SlideShare une entreprise Scribd logo
1  sur  10
D
    S
        M
    P




                     Overview of the
        Distributed Symmetric Multiprocessing
                 Software Architecture




                          By Peter Robinson
                     Technical Marketing Manager
                          Symmetric Computing
                       Venture Development Center
                    University of Massachusetts - Boston
                             Boston MA 02125




                                 Page 1
This page intentionally left blank




                 Page 2
Introduction
           Distributed Symmetric Multiprocessing or DSMP, is a new kernel extension or kernel
enhancement, that extends the capabilities of the legacy Linux operating system, so it can support a
scalable, shared-memory architecture over a 40Gb InfiniBand attached cluster. DSMP is comprised of
                                           two unique software components; the host operating system (OS)
        System Call Interface (SCI)        which runs on the head-node and a unique lightweight micro-kernel
         Process              Virtual File OS which runs on all “other” servers (which make-up the cluster).
   Management (PM) System (VFS)            The host OS consists of a Linux image plus a new DSMP kernel,
         Memory                Network     creating a new durative work as noted in Figure 1. The micro-kernel
   Management (MM)               Stack     is a non-Linux based image that extends the function of the host OS
            DSMP Gasket Interface          over the entire cluster. These two OS images (host and micro-
                                           kernel), are designed to run on commodity, Symmetric
               Device
   ARCH        Drivers                     Multiprocessing (SMP) servers based on the AMD64 processor.
                (DD)                       The AMD64 architecture was selected over competing
                                           platforms for a number of reasons, the primary being
Figure 1 – Host DSMP Software architecture
                                           price performance. Back in 2005 when we conceived
DSMP, the AMD Opteron™ Processor was the only x86 solution that supported a high
density, 4P direct connect architecture in a 1U form-factor. As of 4Q09, AMD continues to provide the
best value for 4P 1U servers and they continue to offer the only commercially viable 4P solution on the
market today.

A look at supercomputing today
        Supercomputing can be divided into two camps - proprietary shared-memory systems or
commodity message passing Interface (MPI) clusters. Shared memory systems are based on commodity
processors such as the PowerPC or Itanium or the ever-popular x86 and commodity memory (DRAM
SIMMs). At the core of most shared-memory systems is a proprietary fabric. This fabric physically
extends the host processors coherency scheme over multiple nodes, providing low-latency inter-node
communication while maintaining system wide coherency. These ultra expensive, hardened shared-
memory supercomputers are designed to accommodate concurrent, enterprise or transactional processing
applications. These applications; VMware, Oracle, dbase, SAP, etc. can utilize one to 512+ processor-
cores and tera-bytes of shared-memory. Most of these applications are optimized for the host OS and the
micro-architecture of the host processor, but not for the macro architecture of the target system. Shared-
memory systems are also a great deal easer to develop applications for. In fact, rarely is it ever necessary
to modify code-sets or data-sets to run on a shared-memory system, for most SMP software plugs and
plays, which is why the shared-memory supercomputers are in such high demand.




                                                       Page 3
Conversely, MPI clusters are comprised entirely of commodity servers, connected via Ethernet,
InfiniBand, or similar communication fabrics. However, these commodity networks introduce tremendous
latency compared to proprietary fabrics on OEM shared-memory supercomputers. Additionally, cluster
computing poses challenges for application providers to comply with the strict rules of MPI and to work
within the memory limitations of the SMP nodes which makeup the cluster. Despite the computational
and porting overhead, the cost benefits of commodity based computing solutions make MPI clusters a
staple of University and small-business research labs.
         Although MPI is the platform of choice for Universities and Research Labs, data-sets in
bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for single node Symmetric
Multi-Processing (SMP) systems and are impractical for an MPI clusters, due to the problems that arise
when you decimate data-sets. The alternative is to purchase time on a National Labs shared-memory
supercomputer (such as the ORNL peta-scale Cray XT4/XT5 Jaguar supercomputer). The problem with
the Jaguar supercomputer option is cost, time and overkill. In short, the reliability, availability
serviceability (RAS) of enterprise computing is quite different from what a researcher wants; as an
example researchers and academia:
    •   Don’t need an hardened enterprise class 9-9s reliable platform;
    •   Do not run multiple applications concurrently and there is no need for virtualization.
    •   Applications are single-process, multiple-thread;
    •   Have an aversion to spending time, dollars and staff-hours needed to apply to access these
        National Lab machines;
    •   Do not want to wait weeks on end in a queue to run their application;
    •   Are willing to optimized their applications for the target hardware to get the most out of the run;
    •   Ultimately want unencumbered 24/7 access to an affordable shared-memory machine – just like
        their MPI cluster.

Enter Symmetric Computing
       The design team of Symmetric Computing came out of the research community. As such, they
were very aware of the problems researcher face today and in the future. This awareness drove the
development of DSMP and the need to base it on commodity hardware. Our intent is nothing short of
have DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster
computing.




                                                     Page 4
How DSMP works
        As stated in the introduction, DSMP is software that transforms an InfiniBand connected cluster
of homogeneous 1U/4P commodity servers into a shared-memory supercomputer. Although there are two
unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference
between them because, from the programmers perspective, there is only one OS image and one kernel.
The DSMP kernel provides seven (7) enhancements that transform a cluster into a distributed symmetric
multiprocessing platform, they are:
   1. The shared-memory system;
   2.   The optimized InfiniBand driver which supports a shared-memory architecture;
   3.   An application driven, memory page coherency scheme;
   4.   An enhanced multi-threading service, based on the POSIX thread standard;
   5.   A distributed MuTeX;
   6.   A memory based distributed disk-queue and
   7. A Distributed disk array.
                                                                                        Treo™ Departmental
The shared-memory system: The center piece of DSMP is its shared-                 Supercomputer
memory architecture. For our example we will assume a three node 4P
system with 64GB of physical memory per node. The three nodes are
networked via 40Gb InfiniBand and there is no switch. This in fact is our
value Treo™ Departmental Supercomputer product offering, shown here on
the right.
Figure 2 presents a macro view of the DSMP memory architecture. What become quite obvious from
viewing this graphic is the application of two memory segments, i.e., local-memory and global-memory.
                               64GB                           64GB                 64GB
              G
              B




                                            12GB
              4




                  P0      P1      P2        P3

                                                                                             Global
                                                                                             Memory
                                                                                              “0”
                          16GB                        16GB                 16GB
                          Local                       Local                Local
                         Memory                                           Memory             Global
                                                     Memory
                           “0”                                              “3”              Memory
                                                       “1”
                                                                                              “1”


                                                                                             Global
                                                                                             Memory
                                                                                              “3”
                                                               TX                  TX
                                  Figure 1 - DSMP memory architecture
                                     TX

                                       RX                      RX                  RX

                       SMP 0                       SMP 1                SMP n




                                                              Page 5
Both coexist in the SMP physical memory and are evenly distributed over the four AMD64 processors on
each of the three servers. However, the memory management unit (MMU) on the AMD Opteron™
processor sees only the local memory (as noted in blue). Local memory is statically allocated by the
kernel, for our Treo™ example we will assume 1GB of local memory for every AMD64 core within the
server. Hence, there are 16GB of local-memory per server or 48GB of local-memory allocated from the
192GB of available system wide memory. The remaining 144GB is global-memory, which is
concurrently viewable and accessible by all 48 processor cores within the Treo™ Departmental
Supercomputer.
         All memory (local and global) is partitioned into 4,096 byte pages or 64 AMD64 cache-lines.
When there is a cache-line miss from local-memory (a page fault), the kernel identifies a least recently
used (LRU) memory-page and swaps in the missing memory-page from global-memory. That happens,
across the InfiniBand fabric, in just under 5µ-seconds, even faster if the page is on the same physical
node.

        The Optimized InfiniBand Drivers: The entire success of DSMP revolved around the existence
of a low latency, commercially available network fabric. It wasn’t that long ago, with the exit of Intel
from InfiniBand, that the industry experts were forecasting its demise. Today InfiniBand is the fabric of
choice for most High Performance Computing (HPC) clusters due to its low latency and high bandwidth.
        To squeeze every last nano-second of performance out of the fabric, the designer of DSMP
bypassed the Linux InfiniBand protocol stack and wrote his own low-level driver. In addition, he
developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel
adapter (HCA). This allowed the HCA to service and move memory-pages requests, without processor
intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction,
reducing system-wide latency.

         An application driven, memory page coherency scheme: As stated in the introduction, all
proprietary supercomputers maintain memory-consistency and/or coherency via hardware extension of
the host processor. DSMP took a different approach for maintaining the two separated levels of coherency
within the system. First there is cache-line coherency within the local SMP server. Coherency at this level
is maintained by the MMU and the SMP logic native to the AMD64 processor, i.e., Cache-coherent
HyperTransport™ Technology. However, global memory page coherency and consistency is controlled
by, and maintained by the programmer. This approach may seem counter intuitive at first. However, the
target market-segment for DSMP was technical computing not enterprise and it was assumed that the end
user is familiar with the algorithm and how to optimize it for the target platform (in the same way code
was optimized for a Beowulf cluster). Given the high skill level of the end users with the need to use only
commodity hardware, drove system level code decisions to keep a DSMP cluster both affordable and fast.
To obtain these goals, new and enhanced Linux primitives were developed. Hence, with some simple,
intuitive programming rules, augmented with new primitives; porting an application to a DSMP platform
(while maintaining coherency), is simple and manageable. Those rules are as follows:




                                                     Page 6
•   Be sensitive to the fact the memory-pages are swapped into and out of local memory from global
        memory in 4K pages and that it takes 5µ-seconds to complete the swap.
    •   Be careful not to overlap or allocate multiple data sets within the same memory page. To help
        prevent this a new Alloc( ) primitive is provided to assure alignment.
    •   Because of the way local and global memory are partitioned (within physical memory), care
        should be taken to distribute process/threads and associated data evenly over the four processors.
        In short, try not to pile-up process/threads on one processor/memory unit, but rather distribute
        them evenly over the system. POSIX thread primitives are provided to support the distribution of
        these threads.
    •   If there is a data-set which is “modified-shared” and accessed by multiple process/threads which
        are on an adjacent server, then it will be necessary to use a set of new Linux primitives
        to maintain coherency i.e., Sync( ), Lock( ) and Release( ).

         Multi-Threading: The “gold standard” for parallelizing Linux C/C++ source code is with the
POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The
latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common
Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two
dozen or so POSIX routines were either tested to and/or modified for DSMP and the Treo™ platform.
         The common method for parallelizing a process is via the Fork( ) primitive. Within DSMP there
is a flag associated with Fork( ). This flag determines if the forked thread is to say local (with the current
process on the primary server), or run on one of the remote servers. This allows the programmer to
specify, how many threads of a given process can be serviced by the head node. Simple analysis will
show just how many thread can run concurrently before performance flattens out due to the memory-wall
effect, or other conditions. Once this value is understood, the remote flag can be used to evenly distribute
threads over all the servers within the DSMP system. By default, each successive instance of Fork( )
caused that thread to be associated with the next server in the DSMP system, in a round-robin fashion.
Hence, a Fork ( ) remote of three threads on Treo™ would place the current process on each of the three
servers with one thread per server. The Kernel manages the consistency of the process to ensure it
executes with the same environment and associated state variables.
         Coherency at the memory-page level is the responsibility of the programmer. A lot of this is
common sense; if a memory page is accessed by multiple threads and up-dated (modified – exclusive),
then it will be necessary to hold off pending threads until the current thread has updated the page in
question. To facilitate this, three DSMP Linux primitives are provided. They are Sync( ), Lock( ) and
Release( ).




                                                       Page 7
•   Sync( ): as the name Implies, synchronize one (1) local private memory-page with its
            source global-memory page.
        •   Lock( ): is used to prevent any other process thread from accessing and subsequently
            modifying the memory-page. Lock( ) also invalidates all other copies of the locked memory-
            page within the system. If a process thread on an adjacent server accesses a locked memory
            page, execution is suspended until the page is released.
        •   Release( ): as the name implies, releases a previously locked memory page.

         Lastly, to insure that data structure do not overlap, a new DSMP Alloc( ) primitive is provided to
force alignment for a give data-structure on a 4K boundary. This primitive assures that the end of one
data-structure does not fall inside an adjacent data-structure.

       Distributed MuTeX: Wikipedia describes MuTeX or Mutual exclusion as a set of algorithms
which are used in concurrent programming to avoid the simultaneous use of a common resource, such as
a global variable or a critical sections. A distributed MuTeX is nothing more than a DSMP kernel
enhancement which insures that MuTeX functions as expected within the DSMP system. From a
programmers point-of-view, there are no changes or modification to MuTeX – it just works.

        Memory based distributed disk-queue: A new DSMP primitive D_file( ) provides a high-
bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth
interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a
memory (DRAM) based storage buffer which effectively eliminates bottlenecks which occur when a
multiple threads compete for a low bandwidth device such as a HDD. Once the current process retires, the
contents of the queue are sent to the target I/O device and the queue is released.

         A Distributed disk array: A distributed disk array is implemented by the kernel through
enhancements made to the Linux striped volume manager. These enhancements extend the Linux volume
manager over the entire network interface providing to the OS, a single consolidated drive. On Treo™ the
distributed disk array is made up of six (6) 1TB drivers – two per server, for a single 6TB storage device.
DSMP Performance
         Performance of a supercomputer is a function of two metrics:
          1) Processor performance (computational throughput);
          2) Global Memory Read/Write performance - which can be furthered divided down to:
              a. Stream performance – continuous R/W memory bandwidth and
              b. Random read/write performance (memory R/W latency).

        The extraordinary thing about the DSMP™ is the fact that it is based on commodity components.
That’s important, because DSMP performance scales with the performance of the commodity components
from which it is made. As an example, random read/write latency for Treo™, went down 40% with the
availability of 40Gb InfiniBand. Furthermore, this move from 20Gb to a 40Gb fabric caused no
appreciable increase in the cost of a Treo™ system (and no changes to the DSMP software were needed).



                                                     Page 8
Also, within this same timeframe, AMD64 processor density went from quad-core to six-core, again
without any appreciable increase in the cost of the total system. Therefore, over time the performance gap
between DSMP™ shared-memory supercomputers and proprietary shared-memory systems will close.

         Today proprietary shared-memory system providers have intra-node bandwidth numbers in the
order of 2.5GB/sec. and random access times in the order of 1µsec. That’s a difference of ~4:1 in
bandwidth and ~5:1 in R/W latency over DSMP™. At first glance, this much of a disparity might appear
as a disadvantage, but that is not necessarily the case - for three reasons. First: DSMP random R/W
latency is based on the time it takes to move 4,096B vs. 64B or 128B in <1µsec. (for SGI and others);
that’s a 64:1 or 32:1 difference in size of the cache-line or page size. In addition, the processors used in
these proprietary systems might have enhanced floating-point capabilities but they might run slower, in
some case, much slower than a 2.8GHz quad-core AMD Opteron™ Processor. So performance is not tied
entirely to memory latency or processor performance but is a function of many system variables as-well-
as the algorithm and the way the data is structured.

        A second and more important reason why the DSMP performance is not a problem is access.
That is, having open and unencumbered 24/7 access to a shared-memory system. As an example, let’s
assume it takes 24 hours to run a job on the ORNL Jaguar supercomputer with a allocation of 48
processors and 150GB of shared-memory. However, it takes months to submit the proposal and gain
approved. Then there’s the additional wait in the queue of around 14 days - to access the system; typical
for this type of engagement. If we assume the DSMP™ shared-memory supercomputer is 1/5 the
performance of the one at Oakridge (due to memory latency, bandwidth and related factors), then it would
take five times longer to get the same results – that’s 120 hours verses, 24. However, when you take into
account the two week queue time, the results are available 10-days sooner. In the same time-frame, you
could have run the job three times over.

        The third and final reason is value. Today, an entry level Treo™ departmental supercomputer
costs only $49,950 - configured with 144GB of shared memory, 48 - 2.8Ghz AMD64 processor cores and
6TB of disk storage (University pricing). A comparable shared-memory platform from an OEM would
approach $1,000,000 (not including maintenance and licensing fees), that’s 1/20 of the price at 1/5 the
performance. With the introduction of the Treo™ departmental supercomputer, Universities and
researchers have a new option which is based on the same market forces that drove the emergence of the
MPI cluster i.e., commodity hardware, value and availability. Today, Symmetric Computing is offering
four unique configuration of Treo™ from 48 to 72 - AMD64 cores and 144GB to 336GB of shared-
memory (see table on following page).




                                                      Page 9
Treo™          Quad-core         Six-core       4GB          8GB      Total
              P/N            2.8GHz           2.6GHz        PC5300       PC5300   Shared
                                                            DIMMs        DIMMs    Memory

          SCA161604-3     269 Giga-flops              -     192 GB       -        144GB
          SCA241604-3                -     374 Giga-flops   192 GB       -        120GB
          SCA241608-3                -     374 Giga-flops            -   384 GB   312GB
          SCA161608-3     269 Giga-flops              -              -   384 GB   336GB


        Looking forward to 1Q10, the Symmetric Computing engineering staff will introduce a 10-node
blade center delivering 1.2 Tera-flops of peak throughput with 640GB or 1.28GB of system memory. In
addition, we are working with our partners to deliver turn-key platform tuned for application specific
missions – such as next generation sequencing, HMMER, BLAST, etc.

Conclusion
        Symmetric Computing’s overall goal is to make supercomputing accessible and affordable to a
broad range of end users. We believe that DSMP is to shared-memory computing what Beowulf/MPI was
to distributed-memory computing. We are focused on delivering an affordable, commodity based
technical computing solutions that services an entirely new market with – the Departmental
Supercomputer. Our initial focus is to provide open applications optimized to run under DSMP and on
Treo™, to accelerate scientific developments in Biosciences and Bioinformatics. We continue to expand
our scope of applications and remain committed to delivering Supercomputing to the Masses.

About Symmetric Computing
        Symmetric Computing is a Boston based software company with offices at the Venter
Development Center on the campus of the University of Massachusetts – Boston. We design software to
accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas,
Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to
delivering standards-based, customer-focused technical computing solutions for users, ranging from
Universities to enterprises. For more information, visit www.symmetriccomputing.com.




                                                  Page 10

Contenu connexe

Tendances

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Memory Management in Windows 7
Memory Management in Windows 7Memory Management in Windows 7
Memory Management in Windows 7Naveed Qadri
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Nakul Manchanda
 
Persistence of memory: In-memory Is Not Often the Answer
Persistence of memory: In-memory Is Not Often the AnswerPersistence of memory: In-memory Is Not Often the Answer
Persistence of memory: In-memory Is Not Often the AnswerNeil Raden
 
Vmreport
VmreportVmreport
Vmreportmeru2ks
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internalsSisimon Soman
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi ComputersNemwos
 
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...Anurag Deb
 
Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Arth Ramada
 
Numa (non uniform memory access)
Numa (non uniform memory access)Numa (non uniform memory access)
Numa (non uniform memory access)Mamesh
 
Ph.D. thesis presentation
Ph.D. thesis presentationPh.D. thesis presentation
Ph.D. thesis presentationdavidkftam
 
multi processors
multi processorsmulti processors
multi processorsAcad
 
Ov psim demo_slides_power_pc
Ov psim demo_slides_power_pcOv psim demo_slides_power_pc
Ov psim demo_slides_power_pcsimon56
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
39 virtual memory
39 virtual memory39 virtual memory
39 virtual memorymyrajendra
 

Tendances (20)

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Memory Management in Windows 7
Memory Management in Windows 7Memory Management in Windows 7
Memory Management in Windows 7
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)
 
Persistence of memory: In-memory Is Not Often the Answer
Persistence of memory: In-memory Is Not Often the AnswerPersistence of memory: In-memory Is Not Often the Answer
Persistence of memory: In-memory Is Not Often the Answer
 
Vmreport
VmreportVmreport
Vmreport
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi Computers
 
PowerAlluxio
PowerAlluxioPowerAlluxio
PowerAlluxio
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
 
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...
 
Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Multiple processor (ppt 2010)
Multiple processor (ppt 2010)
 
Numa (non uniform memory access)
Numa (non uniform memory access)Numa (non uniform memory access)
Numa (non uniform memory access)
 
NUMA
NUMANUMA
NUMA
 
Ph.D. thesis presentation
Ph.D. thesis presentationPh.D. thesis presentation
Ph.D. thesis presentation
 
multi processors
multi processorsmulti processors
multi processors
 
Ov psim demo_slides_power_pc
Ov psim demo_slides_power_pcOv psim demo_slides_power_pc
Ov psim demo_slides_power_pc
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
39 virtual memory
39 virtual memory39 virtual memory
39 virtual memory
 

En vedette

знаешь ли ты свой родной край
знаешь ли ты свой родной крайзнаешь ли ты свой родной край
знаешь ли ты свой родной крайguest7e683c
 
невмержицкий ема
невмержицкий еманевмержицкий ема
невмержицкий емаFinancialStudio
 
Fondüü
FondüüFondüü
Fondüüannelit
 
история педагогики Крыма
история педагогики Крымаистория педагогики Крыма
история педагогики Крымаlibuspu
 
Karistusõiguse ülevaade
Karistusõiguse ülevaadeKaristusõiguse ülevaade
Karistusõiguse ülevaadeTanelJar
 

En vedette (8)

знаешь ли ты свой родной край
знаешь ли ты свой родной крайзнаешь ли ты свой родной край
знаешь ли ты свой родной край
 
Шаповал
ШаповалШаповал
Шаповал
 
невмержицкий ема
невмержицкий еманевмержицкий ема
невмержицкий ема
 
Данильченко
ДанильченкоДанильченко
Данильченко
 
Fondüü
FondüüFondüü
Fondüü
 
история педагогики Крыма
история педагогики Крымаистория педагогики Крыма
история педагогики Крыма
 
Karistusõiguse ülevaade
Karistusõiguse ülevaadeKaristusõiguse ülevaade
Karistusõiguse ülevaade
 
Global Education 2011
Global Education 2011Global Education 2011
Global Education 2011
 

Similaire à Dsmp Whitepaper Release 3

Dsmp Whitepaper V5
Dsmp Whitepaper V5Dsmp Whitepaper V5
Dsmp Whitepaper V5gelfstrom
 
Virtual Memory vs Cache Memory
Virtual Memory vs Cache MemoryVirtual Memory vs Cache Memory
Virtual Memory vs Cache MemoryAshik Iqbal
 
Open Storage Uni Parthenope
Open Storage Uni ParthenopeOpen Storage Uni Parthenope
Open Storage Uni ParthenopeWalter Moriconi
 
Open Storage: un nuovo modo di “PENSARE” lo Storage
Open Storage: un nuovo modo di “PENSARE” lo StorageOpen Storage: un nuovo modo di “PENSARE” lo Storage
Open Storage: un nuovo modo di “PENSARE” lo Storageguest8b632d
 
Towards Software Defined Persistent Memory
Towards Software Defined Persistent MemoryTowards Software Defined Persistent Memory
Towards Software Defined Persistent MemorySwaminathan Sundararaman
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory UsageTomer Perry
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilogSrinivas Naidu
 
Mac Memory Analysis with Volatility
Mac Memory Analysis with VolatilityMac Memory Analysis with Volatility
Mac Memory Analysis with VolatilityAndrew Case
 
Computer Architecture and organization
Computer Architecture and organizationComputer Architecture and organization
Computer Architecture and organizationBadrinath Kadam
 
logical memory-organisation
logical memory-organisationlogical memory-organisation
logical memory-organisationAmrita Manna
 
Memory organization.pptx
Memory organization.pptxMemory organization.pptx
Memory organization.pptxRamanRay105
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-usergdburton
 
2. the memory systems (module2)
2. the memory systems (module2)2. the memory systems (module2)
2. the memory systems (module2)Ajit Saraf
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsSamsung Open Source Group
 
In-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsIn-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsDATAVERSITY
 
Ram and types of ram.Cache
Ram and types of ram.CacheRam and types of ram.Cache
Ram and types of ram.Cachehamza mukhtiar
 

Similaire à Dsmp Whitepaper Release 3 (20)

Dsmp Whitepaper V5
Dsmp Whitepaper V5Dsmp Whitepaper V5
Dsmp Whitepaper V5
 
Virtual Memory vs Cache Memory
Virtual Memory vs Cache MemoryVirtual Memory vs Cache Memory
Virtual Memory vs Cache Memory
 
Open Storage Uni Parthenope
Open Storage Uni ParthenopeOpen Storage Uni Parthenope
Open Storage Uni Parthenope
 
Open Storage: un nuovo modo di “PENSARE” lo Storage
Open Storage: un nuovo modo di “PENSARE” lo StorageOpen Storage: un nuovo modo di “PENSARE” lo Storage
Open Storage: un nuovo modo di “PENSARE” lo Storage
 
Towards Software Defined Persistent Memory
Towards Software Defined Persistent MemoryTowards Software Defined Persistent Memory
Towards Software Defined Persistent Memory
 
Virtual memory 20070222-en
Virtual memory 20070222-enVirtual memory 20070222-en
Virtual memory 20070222-en
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
 
Cache memory
Cache memoryCache memory
Cache memory
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
Mac Memory Analysis with Volatility
Mac Memory Analysis with VolatilityMac Memory Analysis with Volatility
Mac Memory Analysis with Volatility
 
Computer Architecture and organization
Computer Architecture and organizationComputer Architecture and organization
Computer Architecture and organization
 
Information processing cycle
Information processing cycleInformation processing cycle
Information processing cycle
 
logical memory-organisation
logical memory-organisationlogical memory-organisation
logical memory-organisation
 
Memory organization.pptx
Memory organization.pptxMemory organization.pptx
Memory organization.pptx
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 
2. the memory systems (module2)
2. the memory systems (module2)2. the memory systems (module2)
2. the memory systems (module2)
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
 
In-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsIn-Memory Computing: Myths and Facts
In-Memory Computing: Myths and Facts
 
Memory
MemoryMemory
Memory
 
Ram and types of ram.Cache
Ram and types of ram.CacheRam and types of ram.Cache
Ram and types of ram.Cache
 

Dsmp Whitepaper Release 3

  • 1. D S M P Overview of the Distributed Symmetric Multiprocessing Software Architecture By Peter Robinson Technical Marketing Manager Symmetric Computing Venture Development Center University of Massachusetts - Boston Boston MA 02125 Page 1
  • 2. This page intentionally left blank Page 2
  • 3. Introduction Distributed Symmetric Multiprocessing or DSMP, is a new kernel extension or kernel enhancement, that extends the capabilities of the legacy Linux operating system, so it can support a scalable, shared-memory architecture over a 40Gb InfiniBand attached cluster. DSMP is comprised of two unique software components; the host operating system (OS) System Call Interface (SCI) which runs on the head-node and a unique lightweight micro-kernel Process Virtual File OS which runs on all “other” servers (which make-up the cluster). Management (PM) System (VFS) The host OS consists of a Linux image plus a new DSMP kernel, Memory Network creating a new durative work as noted in Figure 1. The micro-kernel Management (MM) Stack is a non-Linux based image that extends the function of the host OS DSMP Gasket Interface over the entire cluster. These two OS images (host and micro- kernel), are designed to run on commodity, Symmetric Device ARCH Drivers Multiprocessing (SMP) servers based on the AMD64 processor. (DD) The AMD64 architecture was selected over competing platforms for a number of reasons, the primary being Figure 1 – Host DSMP Software architecture price performance. Back in 2005 when we conceived DSMP, the AMD Opteron™ Processor was the only x86 solution that supported a high density, 4P direct connect architecture in a 1U form-factor. As of 4Q09, AMD continues to provide the best value for 4P 1U servers and they continue to offer the only commercially viable 4P solution on the market today. A look at supercomputing today Supercomputing can be divided into two camps - proprietary shared-memory systems or commodity message passing Interface (MPI) clusters. Shared memory systems are based on commodity processors such as the PowerPC or Itanium or the ever-popular x86 and commodity memory (DRAM SIMMs). At the core of most shared-memory systems is a proprietary fabric. This fabric physically extends the host processors coherency scheme over multiple nodes, providing low-latency inter-node communication while maintaining system wide coherency. These ultra expensive, hardened shared- memory supercomputers are designed to accommodate concurrent, enterprise or transactional processing applications. These applications; VMware, Oracle, dbase, SAP, etc. can utilize one to 512+ processor- cores and tera-bytes of shared-memory. Most of these applications are optimized for the host OS and the micro-architecture of the host processor, but not for the macro architecture of the target system. Shared- memory systems are also a great deal easer to develop applications for. In fact, rarely is it ever necessary to modify code-sets or data-sets to run on a shared-memory system, for most SMP software plugs and plays, which is why the shared-memory supercomputers are in such high demand. Page 3
  • 4. Conversely, MPI clusters are comprised entirely of commodity servers, connected via Ethernet, InfiniBand, or similar communication fabrics. However, these commodity networks introduce tremendous latency compared to proprietary fabrics on OEM shared-memory supercomputers. Additionally, cluster computing poses challenges for application providers to comply with the strict rules of MPI and to work within the memory limitations of the SMP nodes which makeup the cluster. Despite the computational and porting overhead, the cost benefits of commodity based computing solutions make MPI clusters a staple of University and small-business research labs. Although MPI is the platform of choice for Universities and Research Labs, data-sets in bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for single node Symmetric Multi-Processing (SMP) systems and are impractical for an MPI clusters, due to the problems that arise when you decimate data-sets. The alternative is to purchase time on a National Labs shared-memory supercomputer (such as the ORNL peta-scale Cray XT4/XT5 Jaguar supercomputer). The problem with the Jaguar supercomputer option is cost, time and overkill. In short, the reliability, availability serviceability (RAS) of enterprise computing is quite different from what a researcher wants; as an example researchers and academia: • Don’t need an hardened enterprise class 9-9s reliable platform; • Do not run multiple applications concurrently and there is no need for virtualization. • Applications are single-process, multiple-thread; • Have an aversion to spending time, dollars and staff-hours needed to apply to access these National Lab machines; • Do not want to wait weeks on end in a queue to run their application; • Are willing to optimized their applications for the target hardware to get the most out of the run; • Ultimately want unencumbered 24/7 access to an affordable shared-memory machine – just like their MPI cluster. Enter Symmetric Computing The design team of Symmetric Computing came out of the research community. As such, they were very aware of the problems researcher face today and in the future. This awareness drove the development of DSMP and the need to base it on commodity hardware. Our intent is nothing short of have DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster computing. Page 4
  • 5. How DSMP works As stated in the introduction, DSMP is software that transforms an InfiniBand connected cluster of homogeneous 1U/4P commodity servers into a shared-memory supercomputer. Although there are two unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference between them because, from the programmers perspective, there is only one OS image and one kernel. The DSMP kernel provides seven (7) enhancements that transform a cluster into a distributed symmetric multiprocessing platform, they are: 1. The shared-memory system; 2. The optimized InfiniBand driver which supports a shared-memory architecture; 3. An application driven, memory page coherency scheme; 4. An enhanced multi-threading service, based on the POSIX thread standard; 5. A distributed MuTeX; 6. A memory based distributed disk-queue and 7. A Distributed disk array. Treo™ Departmental The shared-memory system: The center piece of DSMP is its shared- Supercomputer memory architecture. For our example we will assume a three node 4P system with 64GB of physical memory per node. The three nodes are networked via 40Gb InfiniBand and there is no switch. This in fact is our value Treo™ Departmental Supercomputer product offering, shown here on the right. Figure 2 presents a macro view of the DSMP memory architecture. What become quite obvious from viewing this graphic is the application of two memory segments, i.e., local-memory and global-memory. 64GB 64GB 64GB G B 12GB 4 P0 P1 P2 P3 Global Memory “0” 16GB 16GB 16GB Local Local Local Memory Memory Global Memory “0” “3” Memory “1” “1” Global Memory “3” TX TX Figure 1 - DSMP memory architecture TX RX RX RX SMP 0 SMP 1 SMP n Page 5
  • 6. Both coexist in the SMP physical memory and are evenly distributed over the four AMD64 processors on each of the three servers. However, the memory management unit (MMU) on the AMD Opteron™ processor sees only the local memory (as noted in blue). Local memory is statically allocated by the kernel, for our Treo™ example we will assume 1GB of local memory for every AMD64 core within the server. Hence, there are 16GB of local-memory per server or 48GB of local-memory allocated from the 192GB of available system wide memory. The remaining 144GB is global-memory, which is concurrently viewable and accessible by all 48 processor cores within the Treo™ Departmental Supercomputer. All memory (local and global) is partitioned into 4,096 byte pages or 64 AMD64 cache-lines. When there is a cache-line miss from local-memory (a page fault), the kernel identifies a least recently used (LRU) memory-page and swaps in the missing memory-page from global-memory. That happens, across the InfiniBand fabric, in just under 5µ-seconds, even faster if the page is on the same physical node. The Optimized InfiniBand Drivers: The entire success of DSMP revolved around the existence of a low latency, commercially available network fabric. It wasn’t that long ago, with the exit of Intel from InfiniBand, that the industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most High Performance Computing (HPC) clusters due to its low latency and high bandwidth. To squeeze every last nano-second of performance out of the fabric, the designer of DSMP bypassed the Linux InfiniBand protocol stack and wrote his own low-level driver. In addition, he developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel adapter (HCA). This allowed the HCA to service and move memory-pages requests, without processor intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system-wide latency. An application driven, memory page coherency scheme: As stated in the introduction, all proprietary supercomputers maintain memory-consistency and/or coherency via hardware extension of the host processor. DSMP took a different approach for maintaining the two separated levels of coherency within the system. First there is cache-line coherency within the local SMP server. Coherency at this level is maintained by the MMU and the SMP logic native to the AMD64 processor, i.e., Cache-coherent HyperTransport™ Technology. However, global memory page coherency and consistency is controlled by, and maintained by the programmer. This approach may seem counter intuitive at first. However, the target market-segment for DSMP was technical computing not enterprise and it was assumed that the end user is familiar with the algorithm and how to optimize it for the target platform (in the same way code was optimized for a Beowulf cluster). Given the high skill level of the end users with the need to use only commodity hardware, drove system level code decisions to keep a DSMP cluster both affordable and fast. To obtain these goals, new and enhanced Linux primitives were developed. Hence, with some simple, intuitive programming rules, augmented with new primitives; porting an application to a DSMP platform (while maintaining coherency), is simple and manageable. Those rules are as follows: Page 6
  • 7. Be sensitive to the fact the memory-pages are swapped into and out of local memory from global memory in 4K pages and that it takes 5µ-seconds to complete the swap. • Be careful not to overlap or allocate multiple data sets within the same memory page. To help prevent this a new Alloc( ) primitive is provided to assure alignment. • Because of the way local and global memory are partitioned (within physical memory), care should be taken to distribute process/threads and associated data evenly over the four processors. In short, try not to pile-up process/threads on one processor/memory unit, but rather distribute them evenly over the system. POSIX thread primitives are provided to support the distribution of these threads. • If there is a data-set which is “modified-shared” and accessed by multiple process/threads which are on an adjacent server, then it will be necessary to use a set of new Linux primitives to maintain coherency i.e., Sync( ), Lock( ) and Release( ). Multi-Threading: The “gold standard” for parallelizing Linux C/C++ source code is with the POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two dozen or so POSIX routines were either tested to and/or modified for DSMP and the Treo™ platform. The common method for parallelizing a process is via the Fork( ) primitive. Within DSMP there is a flag associated with Fork( ). This flag determines if the forked thread is to say local (with the current process on the primary server), or run on one of the remote servers. This allows the programmer to specify, how many threads of a given process can be serviced by the head node. Simple analysis will show just how many thread can run concurrently before performance flattens out due to the memory-wall effect, or other conditions. Once this value is understood, the remote flag can be used to evenly distribute threads over all the servers within the DSMP system. By default, each successive instance of Fork( ) caused that thread to be associated with the next server in the DSMP system, in a round-robin fashion. Hence, a Fork ( ) remote of three threads on Treo™ would place the current process on each of the three servers with one thread per server. The Kernel manages the consistency of the process to ensure it executes with the same environment and associated state variables. Coherency at the memory-page level is the responsibility of the programmer. A lot of this is common sense; if a memory page is accessed by multiple threads and up-dated (modified – exclusive), then it will be necessary to hold off pending threads until the current thread has updated the page in question. To facilitate this, three DSMP Linux primitives are provided. They are Sync( ), Lock( ) and Release( ). Page 7
  • 8. Sync( ): as the name Implies, synchronize one (1) local private memory-page with its source global-memory page. • Lock( ): is used to prevent any other process thread from accessing and subsequently modifying the memory-page. Lock( ) also invalidates all other copies of the locked memory- page within the system. If a process thread on an adjacent server accesses a locked memory page, execution is suspended until the page is released. • Release( ): as the name implies, releases a previously locked memory page. Lastly, to insure that data structure do not overlap, a new DSMP Alloc( ) primitive is provided to force alignment for a give data-structure on a 4K boundary. This primitive assures that the end of one data-structure does not fall inside an adjacent data-structure. Distributed MuTeX: Wikipedia describes MuTeX or Mutual exclusion as a set of algorithms which are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable or a critical sections. A distributed MuTeX is nothing more than a DSMP kernel enhancement which insures that MuTeX functions as expected within the DSMP system. From a programmers point-of-view, there are no changes or modification to MuTeX – it just works. Memory based distributed disk-queue: A new DSMP primitive D_file( ) provides a high- bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a memory (DRAM) based storage buffer which effectively eliminates bottlenecks which occur when a multiple threads compete for a low bandwidth device such as a HDD. Once the current process retires, the contents of the queue are sent to the target I/O device and the queue is released. A Distributed disk array: A distributed disk array is implemented by the kernel through enhancements made to the Linux striped volume manager. These enhancements extend the Linux volume manager over the entire network interface providing to the OS, a single consolidated drive. On Treo™ the distributed disk array is made up of six (6) 1TB drivers – two per server, for a single 6TB storage device. DSMP Performance Performance of a supercomputer is a function of two metrics: 1) Processor performance (computational throughput); 2) Global Memory Read/Write performance - which can be furthered divided down to: a. Stream performance – continuous R/W memory bandwidth and b. Random read/write performance (memory R/W latency). The extraordinary thing about the DSMP™ is the fact that it is based on commodity components. That’s important, because DSMP performance scales with the performance of the commodity components from which it is made. As an example, random read/write latency for Treo™, went down 40% with the availability of 40Gb InfiniBand. Furthermore, this move from 20Gb to a 40Gb fabric caused no appreciable increase in the cost of a Treo™ system (and no changes to the DSMP software were needed). Page 8
  • 9. Also, within this same timeframe, AMD64 processor density went from quad-core to six-core, again without any appreciable increase in the cost of the total system. Therefore, over time the performance gap between DSMP™ shared-memory supercomputers and proprietary shared-memory systems will close. Today proprietary shared-memory system providers have intra-node bandwidth numbers in the order of 2.5GB/sec. and random access times in the order of 1µsec. That’s a difference of ~4:1 in bandwidth and ~5:1 in R/W latency over DSMP™. At first glance, this much of a disparity might appear as a disadvantage, but that is not necessarily the case - for three reasons. First: DSMP random R/W latency is based on the time it takes to move 4,096B vs. 64B or 128B in <1µsec. (for SGI and others); that’s a 64:1 or 32:1 difference in size of the cache-line or page size. In addition, the processors used in these proprietary systems might have enhanced floating-point capabilities but they might run slower, in some case, much slower than a 2.8GHz quad-core AMD Opteron™ Processor. So performance is not tied entirely to memory latency or processor performance but is a function of many system variables as-well- as the algorithm and the way the data is structured. A second and more important reason why the DSMP performance is not a problem is access. That is, having open and unencumbered 24/7 access to a shared-memory system. As an example, let’s assume it takes 24 hours to run a job on the ORNL Jaguar supercomputer with a allocation of 48 processors and 150GB of shared-memory. However, it takes months to submit the proposal and gain approved. Then there’s the additional wait in the queue of around 14 days - to access the system; typical for this type of engagement. If we assume the DSMP™ shared-memory supercomputer is 1/5 the performance of the one at Oakridge (due to memory latency, bandwidth and related factors), then it would take five times longer to get the same results – that’s 120 hours verses, 24. However, when you take into account the two week queue time, the results are available 10-days sooner. In the same time-frame, you could have run the job three times over. The third and final reason is value. Today, an entry level Treo™ departmental supercomputer costs only $49,950 - configured with 144GB of shared memory, 48 - 2.8Ghz AMD64 processor cores and 6TB of disk storage (University pricing). A comparable shared-memory platform from an OEM would approach $1,000,000 (not including maintenance and licensing fees), that’s 1/20 of the price at 1/5 the performance. With the introduction of the Treo™ departmental supercomputer, Universities and researchers have a new option which is based on the same market forces that drove the emergence of the MPI cluster i.e., commodity hardware, value and availability. Today, Symmetric Computing is offering four unique configuration of Treo™ from 48 to 72 - AMD64 cores and 144GB to 336GB of shared- memory (see table on following page). Page 9
  • 10. Treo™ Quad-core Six-core 4GB 8GB Total P/N 2.8GHz 2.6GHz PC5300 PC5300 Shared DIMMs DIMMs Memory SCA161604-3 269 Giga-flops - 192 GB - 144GB SCA241604-3 - 374 Giga-flops 192 GB - 120GB SCA241608-3 - 374 Giga-flops - 384 GB 312GB SCA161608-3 269 Giga-flops - - 384 GB 336GB Looking forward to 1Q10, the Symmetric Computing engineering staff will introduce a 10-node blade center delivering 1.2 Tera-flops of peak throughput with 640GB or 1.28GB of system memory. In addition, we are working with our partners to deliver turn-key platform tuned for application specific missions – such as next generation sequencing, HMMER, BLAST, etc. Conclusion Symmetric Computing’s overall goal is to make supercomputing accessible and affordable to a broad range of end users. We believe that DSMP is to shared-memory computing what Beowulf/MPI was to distributed-memory computing. We are focused on delivering an affordable, commodity based technical computing solutions that services an entirely new market with – the Departmental Supercomputer. Our initial focus is to provide open applications optimized to run under DSMP and on Treo™, to accelerate scientific developments in Biosciences and Bioinformatics. We continue to expand our scope of applications and remain committed to delivering Supercomputing to the Masses. About Symmetric Computing Symmetric Computing is a Boston based software company with offices at the Venter Development Center on the campus of the University of Massachusetts – Boston. We design software to accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas, Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to delivering standards-based, customer-focused technical computing solutions for users, ranging from Universities to enterprises. For more information, visit www.symmetriccomputing.com. Page 10