1. Automotive
Simulation Risk Analysis
High Throughput Computing
Price Modelling
Engineering
HIGH CAE Aerospace
PERFORMANCE
COMPUTING 2012/13
TECHNOLOGY
COMPASS CAD Big Data Analytics
Life Sciences
2. TECHNOLOGY COMPASS INTEL CLUSTER READY ............................................................................62
A Quality Standard for HPC Clusters...................................................... 64
TABLE OF CONTENTS AND INTRODUCTION
Intel Cluster Ready builds HPC Momentum ..................................... 69
The transtec Benchmarking Center ....................................................... 73
HIGH PERFORMANCE COMPUTING .................................................... 4 WINDOWS HPC SERVER 2008 R2 ........................................................74
Performance Turns Into Productivity ......................................................6 Elements of the Microsoft HPC Solution ............................................ 76
Flexible deployment with xCAT ...................................................................8 Deployment, system management, and monitoring ................. 78
Job scheduling..................................................................................................... 80
CLUSTER MANAGEMENT MADE EASY ..............................................12 Service-oriented architecture ................................................................... 82
Bright Cluster Manager ................................................................................. 14 Networking and MPI ........................................................................................ 85
Microsoft Office Excel support ................................................................. 88
INTELLIGENT HPC WORKLOAD MANAGEMENT .........................28
Moab HPC Suite – Enterprise Edition.................................................... 30 PARALLEL NFS ...............................................................................................90
New in Moab 7.0 ................................................................................................. 34 The New Standard for HPC Storage ....................................................... 92
Moab HPC Suite – Basic Edition................................................................ 37 Whats´s new in NFS 4.1? ............................................................................... 94
Moab HPC Suite - Grid Option .................................................................... 43 Panasas HPC Storage ...................................................................................... 99
NICE ENGINE FRAME .................................................................................50 NVIDIA GPU COMPUTING ....................................................................110
A technical portal for remote visualization ...................................... 52 The CUDA Architecture ............................................................................... 112
Application highlights.................................................................................... 54 Codename “Fermi” ......................................................................................... 116
Desktop Cloud Virtualization .................................................................... 57 Introducing NVIDIA Parallel Nsight ..................................................... 122
Remote Visualization...................................................................................... 58 QLogic TrueScale InfiniBand and GPUs ............................................ 126
INFINIBAND .................................................................................................130
High-speed interconnects ........................................................................ 132
Top 10 Reasons to Use QLogic TrueScale InfiniBand ................ 136
Intel MPI Library 4.0 Performance ........................................................ 139
InfiniBand Fabric Suite (IFS) – What’s New in Version 6.0 ...... 141
PARSTREAM .................................................................................................144
Big Data Analytics .......................................................................................... 146
GLOSSARY .....................................................................................................156
2
3. MORE THAN 30 YEARS OF EXPERIENCE IN SCIENTIFIC COMPUTING environment is of a highly heterogeneous nature. Even the
1980 marked the beginning of a decade where numerous startups dynamical provisioning of HPC resources as needed does not
were created, some of which later transformed into big players in constitute any problem, thus further leading to maximal utiliza-
the IT market. Technical innovations brought dramatic changes tion of the cluster.
to the nascent computer market. In Tübingen, close to one of Ger-
many’s prime and oldest universities, transtec was founded. transtec HPC solutions use the latest and most innovative
technology. Their superior performance goes hand in hand with
In the early days, transtec focused on reselling DEC computers energy efficiency, as you would expect from any leading edge IT
and peripherals, delivering high-performance workstations to solution. We regard these basic characteristics.
university institutes and research facilities. In 1987, SUN/Sparc
and storage solutions broadened the portfolio, enhanced by This brochure focusses on where transtec HPC solutions excel.
IBM/RS6000 products in 1991. These were the typical worksta- To name a few: Bright Cluster Manager as the technology leader
tions and server systems for high performance computing then, for unified HPC cluster management, leading-edge Moab HPC
used by the majority of researchers worldwide. Suite for job and workload management, Intel Cluster Ready
certification as an independent quality standard for our sys-
In the late 90s, transtec was one of the first companies to offer tems, Panasas HPC storage systems for highest performance
highly customized HPC cluster solutions based on standard and best scalability required of an HPC storage system. Again,
Intel architecture servers, some of which entered the TOP500 with these components, usability and ease of management
list of the world’s fastest computing systems. are central issues that are addressed. Also, being NVIDIA Tesla
Preferred Provider, transtec is able to provide customers with
Thus, given this background and history, it is fair to say that well-designed, extremely powerful solutions for Tesla GPU
transtec looks back upon a more than 30 years’ experience in computing. QLogic’s InfiniBand Fabric Suite makes managing a
scientific computing; our track record shows nearly 500 HPC large InfiniBand fabric easier than ever before – transtec mas-
installations. With this experience, we know exactly what cus- terly combines excellent and well-chosen components that are
tomers’ demands are and how to meet them. High performance already there to a fine-tuned, customer-specific, and thoroughly
and ease of management – this is what customers require to- designed HPC solution.
day. HPC systems are for sure required to peak-perform, as their
name indicates, but that is not enough: they must also be easy Last but not least, your decision for a transtec HPC solution
to handle. Unwieldy design and operational complexity must be means you opt for most intensive customer care and best ser-
avoided or at least hidden from administrators and particularly vice in HPC. Our experts will be glad to bring in their expertise
users of HPC computer systems. and support to assist you at any stage, from HPC design to daily
cluster operations, to HPC Cloud Services.
transtec HPC solutions deliver ease of management, both in the
Linux and Windows worlds, and even where the customer´s Have fun reading the transtec HPC Compass 2012/13!
3
5. High Performance Computing (HPC) has been with us from the very
beginning of the computer era. High-performance computers were
built to solve numerous problems which the “human computers” could
not handle. The term HPC just hadn’t been coined yet. More important,
some of the early principles have changed fundamentally.
HPC systems in the early days were much different from those we see
today. First, we saw enormous mainframes from large computer manu-
facturers, including a proprietary operating system and job management
system. Second, at universities and research institutes, workstations
made inroads and scientists carried out calculations on their dedicated
Unix or VMS workstations. In either case, if you needed more computing
power, you scaled up, i.e. you bought a bigger machine.
Today the term High-Performance Computing has gained a fundamen-
tally new meaning. HPC is now perceived as a way to tackle complex
mathematical, scientific or engineering problems. The integration of
industry standard, “off-the-shelf” server hardware into HPC clusters fa-
cilitates the construction of computer networks of such power that one
single system could never achieve. The new paradigm for parallelization
is scaling out.
5
6. HIGH PERFORMANCE COMPUTING Computer-supported simulations of realistic processes (so-
called Computer Aided Engineering – CAE) has established itself
PERFORMANCE TURNS INTO PRODUCTIVITY
as a third key pillar in the field of science and research along-
side theory and experimentation. It is nowadays inconceivable
that an aircraft manufacturer or a Formula One racing team
would operate without using simulation software. And scien-
tific calculations, such as in the fields of astrophysics, medicine,
pharmaceuticals and bio-informatics, will to a large extent be
dependent on supercomputers in the future. Software manu-
facturers long ago recognized the benefit of high-performance
computers based on powerful standard servers and ported
their programs to them accordingly.
The main advantages of scale-out supercomputers is just
that: they are infinitely scalable, at least in principle. Since
they are based on standard hardware components, such a
supercomputer can be charged with more power whenever
the computational capacity of the system is not sufficient any
more, simply by adding additional nodes of the same kind. A
“transtec HPC solutions are meant to provide cumbersome switch to a different technology can be avoided
customers with unparalleled ease-of-manage- in most cases.
ment and ease-of-use. Apart from that, deciding
for a transtec HPC solution means deciding for The primary rationale in using HPC clusters is to grow, to scale
the most intensive customer care and the best out computing capacity as far as necessary. To reach that goal,
service imaginable” an HPC cluster returns most of the invest when it is continu-
ously fed with computing problems.
Dr. Oliver Tennert Director Technology Management &
HPC Solutions The secondary reason for building scale-out supercomputers is
to maximize the utilization of the system.
6
7. If the individual processes engage in a large amount of com-
munication, the response time of the network (latency) becomes
important. Latency in a Gigabit Ethernet or a 10GE network is typi-
cally around 10 µs. High-speed interconnects such as InfiniBand,
reduce latency by a factor of 10 down to as low as 1 µs. Therefore,
high-speed interconnects can greatly speed up total processing.
The other frequently used variant is called SMP applications.
VARIATIONS ON THE THEME: MPP AND SMP SMP, in this HPC context, stands for Shared Memory Processing.
Parallel computations exist in two major variants today. Ap- It involves the use of shared memory areas, the specific imple-
plications running in parallel on multiple compute nodes are mentation of which is dependent on the choice of the underlying
frequently so-called Massively Parallel Processing (MPP) applica- operating system. Consequently, SMP jobs generally only run on
tions. MPP indicates that the individual processes can each a single node, where they can in turn be multi-threaded and thus
utilize exclusive memory areas. This means that such jobs are be parallelized across the number of CPUs per node. For many HPC
predestined to be computed in parallel, distributed across the applications, both the MPP and SMP variant can be chosen.
nodes in a cluster. The individual processes can thus utilize the
separate units of the respective node – especially the RAM, the Many applications are not inherently suitable for parallel execu-
CPU power and the disk I/O. tion. In such a case, there is no communication between the in-
dividual compute nodes, and therefore no need for a high-speed
Communication between the individual processes is imple- network between them; nevertheless, multiple computing jobs
mented in a standardized way through the MPI software can be run simultaneously and sequentially on each individual
interface (Message Passing Interface), which abstracts the node, depending on the number of CPUs.
underlying network connections between the nodes from
the processes. However, the MPI standard (current version In order to ensure optimum computing performance for these
2.0) merely requires source code compatibility, not binary applications, it must be examined how many CPUs and cores
compatibility, so an off-the-shelf application usually needs deliver the optimum performance.
specific versions of MPI libraries in order to run. Examples of
MPI implementations are OpenMPI, MPICH2, MVAPICH2, Intel We find applications of this sequential type of work typically in
MPI or – for Windows clusters – MS-MPI. the fields of data analysis or Monte-Carlo simulations.
7
8. HIGH PERFORMANCE COMPUTING
FLEXIBLE DEPLOYMENT WITH XCAT
xCAT as a Powerful and Flexible Deployment Tool
xCAT (Extreme Cluster Administration Tool) is an open source
toolkit for the deployment and low-level administration of HPC
cluster environments, small as well as large ones.
xCAT provides simple commands for hardware control, node dis-
covery, the collection of MAC addresses, and the node deploy-
ment with (diskful) or without local (diskless) installation. The
cluster configuration is stored in a relational database. Node
groups for different operating system images can be defined.
Also, user-specific scripts can be executed automatically at
installation time.
xCAT Provides the Following Low-Level Administrative Features
Remote console support
Parallel remote shell and remote copy commands
Plugins for various monitoring tools like Ganglia or Nagios
Hardware control commands for node discovery, collect-
ing MAC addresses, remote power switching and resetting
of nodes
8
9. Automatic configuration of syslog, remote shell, DNS, DHCP, when the code is self-developed, developers often prefer one
and ntp within the cluster MPI implementation over another.
Extensive documentation and man pages
According to the customer’s wishes, we install various compil-
For cluster monitoring, we install and configure the open ers, MPI middleware, as well as job management systems like
source tool Ganglia or the even more powerful open source Parastation, Grid Engine, Torque/Maui, or the very powerful
solution Nagios, according to the customer’s preferences and Moab HPC Suite for the high-level cluster management.
requirements.
Local Installation or Diskless Installation
We offer a diskful or a diskless installation of the cluster nodes.
A diskless installation means the operating system is hosted
partially within the main memory, larger parts may or may
not be included via NFS or other means. This approach allows
for deploying large amounts of nodes very efficiently, and the
cluster is up and running within a very small timescale. Also,
updating the cluster can be done in a very efficient way. For
this, only the boot image has to be updated, and the nodes have
to be rebooted. After this, the nodes run either a new kernel or
even a new operating system. Moreover, with this approach,
partitioning the cluster can also be very efficiently done, either
for testing purposes, or for allocating different cluster parti-
tions for different users or applications.
Development Tools, Middleware, and Applications
According to the application, optimization strategy, or underlying
architecture, different compilers lead to code results of very
different performance. Moreover, different, mainly commercial,
applications, require different MPI implementations. And even
9
10. HPC solution
benchmarking of application
HIGH PERFORMANCE COMPUTING
different systems installation
PERFORMANCE TURNS INTO PRODUCTIVITY
continual
improvement
maintenance, integration
onsite
customer into
support & hardware
training customer’s
managed services assembly
environment
SERVICES AND CUSTOMER CARE FROM A TO Z
application-, burn-in tests software
individual Presales
customer-, of systems & OS
consulting
site-specific installation
sizing of
HPC solution
benchmarking of application
different systems installation
continual
improvement
maintenance, integration
onsite
customer into
support & hardware
training customer’s
managed services assembly
environment
10
11. to important middleware components like cluster management
or developer tools and the customer’s production applications.
Onsite delivery means onsite integration into the customer’s
production environment, be it establishing network connectivity
to the corporate network, or setting up software and configura-
tion parts.
transtec HPC clusters are ready-to-run systems – we deliver, you
HPC @ TRANSTEC: SERVICES AND CUSTOMER CARE FROM A TO Z turn the key, the system delivers high performance. Every HPC
transtec AG has over 30 years of experience in scientific comput- project entails transfer to production: IT operation processes and
ing and is one of the earliest manufacturers of HPC clusters. policies apply to the new HPC system. Effectively, IT personnel is
For nearly a decade, transtec has delivered highly customized trained hands-on, introduced to hardware components and soft-
High Performance clusters based on standard components to ware, with all operational aspects of configuration management.
academic and industry customers across Europe with all the
high quality standards and the customer-centric approach that transtec services do not stop when the implementation projects
transtec is well known for. ends. Beyond transfer to production, transtec takes care. transtec
offers a variety of support and service options, tailored to the
Every transtec HPC solution is more than just a rack full of hard- customer’s needs. When you are in need of a new installation, a
ware – it is a comprehensive solution with everything the HPC major reconfiguration or an update of your solution – transtec is
user, owner, and operator need. able to support your staff and, if you lack the resources for main-
taining the cluster yourself, maintain the HPC solution for you.
In the early stages of any customer’s HPC project, transtec ex- From Professional Services to Managed Services for daily opera-
perts provide extensive and detailed consulting to the customer tions and required service levels, transtec will be your complete
– they benefit from expertise and experience. Consulting is fol- HPC service and solution provider. transtec’s high standards of
lowed by benchmarking of different systems with either specifi- performance, reliability and dependability assure your productiv-
cally crafted customer code or generally accepted benchmarking ity and complete satisfaction.
routines; this aids customers in sizing and devising the optimal
and detailed HPC configuration. transtec’s offerings of HPC Managed Services offer customers the
possibility of having the complete management and administra-
Each and every piece of HPC hardware that leaves our factory tion of the HPC cluster managed by transtec service specialists,
undergoes a burn-in procedure of 24 hours or more if necessary. in an ITIL compliant way. Moreover, transtec’s HPC on Demand
We make sure that any hardware shipped meets our and our services help provide access to HPC resources whenever they
customers’ quality requirements. transtec HPC solutions are turn- need them, for example, because they do not have the possibility
key solutions. By default, a transtec HPC cluster has everything of owning and running an HPC cluster themselves, due to lacking
installed and configured – from hardware and operating system infrastructure, know-how, or admin staff.
11
13. Bright Cluster Manager removes the complexity from the
installation, management and use of HPC clusters, without
compromizing performance or capability. With Bright Cluster
Manager, an administrator can easily install, use and manage
multiple clusters simultaneously, without the need for expert
knowledge of Linux or HPC.
13
14. CLUSTER MANAGEMENT MADE EASY A UNIFIED APPROACH
Other cluster management offerings take a “toolkit” approach
BRIGHT CLUSTER MANAGER
in which a Linux distribution is combined with many third-party
THE CLUSTER INSTALLER TAKES THE ADMINISTRATOR THROUGH THE
tools for provisioning, monitoring, alerting, etc.
INSTALLATION PROCESS AND OFFERS ADVANCED OPTIONS SUCH AS
“EXPRESS” AND “REMOTE”. This approach has critical limitations because those separate
tools were not designed to work together, were not designed
for HPC, and were not designed to scale. Furthermore, each of
the tools has its own interface (mostly command-line based),
and each has its own daemons and databases. Countless hours
of scripting and testing from highly skilled people are required
to get the tools to work for a specific cluster, and much of it
goes undocumented.
Bright Cluster Manager takes a much more fundamental, inte-
grated and unified approach. It was designed and written from
the ground up for straightforward, efficient, comprehensive clus-
ter management. It has a single lightweight daemon, a central
database for all monitoring and configuration data, and a single
BY SELECTING A CLUSTER NODE IN THE TREE ON THE LEFT AND THE TASKS
CLI and GUI for all cluster management functionality.
TAB ON THE RIGHT, THE ADMINISTRATOR CAN EXECUTE A NUMBER OF
POWERFUL TASKS ON THAT NODE WITH JUST A SINGLE MOUSE CLICK.. This approach makes Bright Cluster Manager extremely easy to
use, scalable, secure and reliable, complete, flexible, and easy to
maintain and support.
EASE OF INSTALLATION
Bright Cluster Manager is easy to install. Typically, system admin-
istrators can install and test a fully functional cluster from “bare
metal” in less than an hour. Configuration choices made during
the installation can be modified afterwards. Multiple installation
modes are available, including unattended and remote modes.
Cluster nodes can be automatically identified based on switch
ports rather than MAC addresses, improving speed and reliability
of installation, as well as subsequent maintenance.
14
15. EASE OF USE are performed through one intuitive, visual interface.
Bright Cluster Manager is easy to use. System administrators Multiple clusters can be managed simultaneously. The CMGUI
have two options: the intuitive Cluster Management Graphical runs on Linux, Windows and MacOS (coming soon) and can be
User Interface (CMGUI) and the powerful Cluster Management extended using plugins. The CMSH provides practically the same
Shell (CMSH). The CMGUI is a standalone desktop application functionality as the Bright CMGUI, but via a command-line inter-
that provides a single system view for managing all hardware face. The CMSH can be used both interactively and in batch mode
and software aspects of the cluster through a single point of via scripts. Either way, system administrators now have unprec-
control. Administrative functions are streamlined as all tasks edented flexibility and control over their clusters.
CLUSTER METRICS, SUCH AS GPU AND CPU TEMPERATURES, FAN SPEEDS AND NETWORKS STATISTICS CAN BE VISUALIZED BY SIMPLY DRAGGING AND DROPPING THEM FROM
THE LIST ON THE LEFT INTO A GRAPHING WINDOW ON THE RIGHT. MULTIPLE METRICS CAN BE COMBINED IN ONE GRAPH AND GRAPHS CAN BE ZOOMED INTO. GRAPH LAYOUT
AND COLORS CAN BE TAILORED TO YOUR REQUIREMENTS.
15
16. CLUSTER MANAGEMENT MADE EASY SUPPORT FOR LINUX AND WINDOWS
Bright Cluster Manager is based on Linux and is available
BRIGHT CLUSTER MANAGER
with a choice of pre-integrated, pre-configured and opti-
mized Linux distributions, including SUSE Linux Enterprise
THE STATUS OF CLUSTER NODES, SWITCHES, OTHER HARDWARE, AS WELL AS UP TO
SIX METRICS CAN BE VISUALIZED IN THE RACKVIEW. A ZOOM-OUT OPTION IS AVAIL-
ABLE FOR CLUSTERS WITH MANY RACKS.
THE OVERVIEW TAB PROVIDES INSTANT, HIGH-LEVEL INSIGHT INTO
THE STATUS OF THE CLUSTER.
Server, Red Hat Enterprise Linux, CentOS and Scientific
Linux. Dual-boot installations with Windows HPC Server are
supported as well, allowing nodes to either boot from the
Bright-managed Linux head node, or the Windows-managed
head node.
EXTENSIVE DEVELOPMENT ENVIRONMENT
Bright Cluster Manager provides an extensive HPC development
environment for both serial and parallel applications, including
the following (some optional):
16
17. Compilers, including full suites from GNU, Intel, AMD and THE PARALLEL SHELL ALLOWS FOR SIMULTANEOUS EXECUTION OF COMMANDS OR
SCRIPTS ACROSS NODE GROUPS OR ACROSS THE ENTIRE CLUSTER.
Portland Group
Debuggers and profilers, including the GNU debugger and
profiler, TAU, TotalView, Allinea DDT and Allinea OPT
GPU libraries, including CUDA and OpenCL
MPI libraries, including OpenMPI, MPICH, MPICH2, MPICH-
MX, MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled
with the compilers installed on Bright Cluster Manager, and
optimized for high-speed interconnects such as InfiniBand
and Myrinet
Mathematical libraries, including ACML, FFTW, GMP,
GotoBLAS, MKL and ScaLAPACK
Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net-
CDF and PETSc
Bright Cluster Manager also provides Environment Modules to Linux kernels can be assigned to individual images. Incremen-
make it easy to maintain multiple versions of compilers, librar- tal changes to images can be deployed to live nodes without
ies and applications for different users on the cluster, without rebooting or re-installation.
creating compatibility conflicts. Each Environment Module file The provisioning system propagates only changes to the
contains the information needed to configure the shell for an images, minimizing time and impact on system performance
application, and automatically sets these variables correctly and availability. Provisioning capability can be assigned to
for the particular application when it is loaded. Bright Cluster any number of nodes on-the-fly, for maximum flexibility and
Manager includes many preconfigured module files for many scalability. Bright Cluster Manager can also provision over
scenarios, such as combinations of compliers, mathematical InfiniBand and to RAM disk.
and MPI libraries.
COMPREHENSIVE MONITORING
POWERFUL IMAGE MANAGEMENT AND PROVISIONING With Bright Cluster Manager, system administrators can collect,
Bright Cluster Manager features sophisticated software image monitor, visualize and analyze a comprehensive set of metrics.
management and provisioning capability. A virtually unlimited Practically all software and hardware metrics available to the
number of images can be created and assigned to as many Linux kernel, and all hardware management interface metrics
different categories of nodes as required. Default or custom (IPMI, iLO, etc.) are sampled.
17
18. CLUSTER MANAGEMENT MADE EASY
BRIGHT CLUSTER MANAGER
HIGH PERFORMANCE MEETS EFFICIENCY
Initially, massively parallel systems constitute a challenge to
both administrators and users. They are complex beasts. Any-
one building HPC clusters will need to tame the beast, master
the complexity and present users and administrators with an
easy-to-use, easy-to-manage system landscape.
Leading HPC solution providers such as transtec achieve this
goal. They hide the complexity of HPC under the hood and
match high performance with efficiency and ease-of-use for
both users and administrators. The “P” in “HPC” gains a double
meaning: “Performance” plus “Productivity”.
Cluster and workload management software like Moab HPC
Suite, Bright Cluster Manager or QLogic IFS provide the means
to master and hide the inherent complexity of HPC systems. For
administrators and users, HPC clusters are presented as single,
large machines, with many different tuning parameters. The
software also provides a unified view of existing clusters when-
ever unified management is added as a requirement by the
customer at any point in time after the first installation. Thus,
daily routine tasks such as job management, user management,
queue partitioning and management, can be performed easily
with either graphical or web-based tools, without any advanced
scripting skills or technical expertise required from the adminis-
trator or user.
18
19. Powerful cluster automation functionality allows
preemptive actions based on monitoring thresholds
Comprehensive cluster monitoring and health checking
framework, including automatic sidelining of unhealthy
nodes to prevent job failure
Scalability from Deskside to TOP500
Off-loadable provisioning for maximum scalability
THE BRIGHT ADVANTAGE Proven on some of the world’s largest clusters
Bright Cluster Manager offers many advantages that lead to
improved productivity, uptime, scalability, performance and Minimum Overhead/Maximum Performance
security, while reducing total cost of ownership. Single lightweight daemon drives all functionality
Daemon heavily optimized to minimize effect on operating
Rapid Productivity Gains system and applications
Easy to learn and use, with an intuitive GUI Single database stores all metric and configuration data
Quick installation: from bare metal to a cluster ready to use,
in less than an hour Top Security
Fast, flexible provisioning: incremental, live, disk-full, disk- Automated security and other updates from key-signed
less, provisioning over InfiniBand, auto node discovery repositories
Comprehensive monitoring: on-the-fly graphs, rackview, Encrypted external and internal communications (optional)
multiple clusters, custom metrics X.509v3 certificate-based public-key authentication
Powerful automation: thresholds, alerts, actions Role-based access control and complete audit trail
Complete GPU support: NVIDIA, AMD ATI, CUDA, OpenCL Firewalls and secure LDAP
On-demand SMP: instant ScaleMP virtual SMP deployment
Powerful cluster management shell and SOAP API for auto-
mating tasks and creating custom capabilities
Seamless integration with leading workload managers: PBS
Pro, Moab, Maui, SLURM, Grid Engine, Torque, LSF
Integrated (parallel) application development environment.
Easy maintenance: automatically update your cluster from
Linux and Bright Computing repositories
Web-based user portal
Bright Computing
Maximum Uptime
Unattended, robust head node failover to spare head node
19
20. CLUSTER MANAGEMENT MADE EASY Examples include CPU and GPU temperatures, fan speeds,
switches, hard disk SMART information, system load, memory
BRIGHT CLUSTER MANAGER
utilization, network statistics, storage metrics, power systems
statistics, and workload management statistics. Custom metrics
can also easily be defined.
Metric sampling is done very efficiently – in one process, or
out-of-band where possible. System administrators have full
flexibility over how and when metrics are sampled, and historic
data can be consolidated over time to save disk space.
THE AUTOMATION CONFIGURATION WIZARD GUIDES THE SYSTEM ADMINISTRATOR
THROUGH THE STEPS OF DEFINING A RULE: SELECTING METRICS, DEFINING THRESH-
OLDS AND SPECIFYING ACTIONS.
CLUSTER MANAGEMENT AUTOMATION
Cluster management automation takes preemptive actions
when predetermined system thresholds are exceeded, sav-
ing time and preventing hardware damage. System thresh-
olds can be configured on any of the available metrics. The
built-in configuration wizard guides the system administra-
20
21. tor through the steps of defining a rule: selecting metrics, EXAMPLE GRAPHS THAT VISUALIZE METRICS ON A GPU CLUSTER.
defining thresholds and specifying actions. For example,
a temperature threshold for GPUs can be established that
results in the system automatically shutting down an over-
heated GPU unit and sending an SMS message to the system
administrator’s mobile phone. Several predefined actions are
available, but any Linux command or script can be config-
ured as an action.
COMPREHENSIVE GPU MANAGEMENT
Bright Cluster Manager radically reduces the time and ef-
fort of managing GPUs, and fully integrates these devices
into the single view of the overall system. Bright includes
powerful GPU management and monitoring capability that
leverages functionality in NVIDIA Tesla GPUs. System admin-
istrators can easily assume maximum control of the GPUs
and gain instant and time-based status insight. In addition
to the standard cluster management capabilities, Bright
Cluster Manager monitors the full range of GPU metrics,
including: MULTI-TASKING VIA PARALLEL SHELL
GPU temperature, fan speed, utilization The parallel shell allows simultaneous execution of multiple
GPU exclusivity, compute, display, persistance mode commands and scripts across the cluster as a whole, or across
GPU memory utilization, ECC statistics easily definable groups of nodes. Output from the executed
Unit fan speed, serial number, temperature, power commands is displayed in a convenient way with variable levels
usage, voltages and currents, LED status, firmware of verbosity. Running commands and scripts can be killed easily
Board serial, driver version, PCI info if necessary. The parallel shell is available through both the
CMGUI and the CMSH.
Beyond metrics, Bright Cluster Manager features built-in
support for GPU computing with CUDA and OpenCL libraries. INTEGRATED WORKLOAD MANAGEMENT
Switching between current and previous versions of CUDA and Bright Cluster Manager is integrated with a wide selection of
OpenCL has also been made easy. free and commercial workload managers. This integration
21
22. CLUSTER MANAGEMENT MADE EASY provides a number of benefits:
The selected workload manager gets automatically installed
BRIGHT CLUSTER MANAGER
and configured
Many workload manager metrics are monitored
The GUI provides a user-friendly interface for configuring,
monitoring and managing the selected workload manager
The CMSH and the SOAP API provide direct and powerful access
to a number of workload manager commands and metrics
WORKLOAD MANAGEMENT QUEUES CAN BE VIEWED AND CON- CREATING AND DISMANTLING A VIRTUAL SMP NODE CAN BE ACHIEVED WITH JUST
FIGURED FROM THE GUI, WITHOUT THE NEED FOR WORKLOAD A FEW CLICKS WITHIN THE GUI OR A SINGLE COMMAND IN THE CLUSTER MANAGE-
MANAGEMENT EXPERTISE. MENT SHELL.
22
23. Reliable workload manager failover is properly configured MAXIMUM UPTIME WITH HEALTH CHECKING
The workload manager is continuously made aware of the Bright Cluster Manager – Advanced Edition includes a powerful
health state of nodes (see section on Health Checking) cluster health checking framework that maximizes system uptime.
It continually checks multiple health indicators for all hardware
The following user-selectable workload managers are tightly and software components and proactively initiates corrective
integrated with Bright Cluster Manager: actions. It can also automatically perform a series of standard
PBS Pro, Moab, Maui, LSF and user-defined tests just before starting a new job, to ensure
SLURM, Grid Engine, Torque a successful execution. Examples of corrective actions include
autonomous bypass of faulty nodes, automatic job requeuing to
Alternatively, Lava, LoadLeveler or other workload managers can avoid queue flushing, and process “jailing” to allocate, track, trace
be installed on top of Bright Cluster Manager. and flush completed user processes. The health checking frame-
work ensures the highest job throughput, the best overall cluster
INTEGRATED SMP SUPPORT efficiency and the lowest administration overhead.
Bright Cluster Manager – Advanced Edition dynamically ag-
gregates multiple cluster nodes into a single virtual SMP node, WEB-BASED USER PORTAL
using ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating The web-based user portal provides read-only access to essential
and dismantling a virtual SMP node can be achieved with just cluster information, including a general overview of the cluster
a few clicks within the CMGUI. Virtual SMP nodes can also be status, node hardware and software properties, workload manager
launched and dismantled automatically using the scripting statistics and user-customizable graphs. The User Portal can easily
capabilities of the CMSH. In Bright Cluster Manager a virtual be customized and expanded using PHP and the SOAP API.
SMP node behaves like any other node, enabling transparent,
on-the-fly provisioning, configuration, monitoring and man- USER AND GROUP MANAGEMENT
agement of virtual SMP nodes as part of the overall system Users can be added to the cluster through the CMGUI or the
management. CMSH. Bright Cluster Manager comes with a pre-configured
LDAP database, but an external LDAP service, or alternative
MAXIMUM UPTIME WITH HEAD NODE FAILOVER authentication system, can be used instead.
Bright Cluster Manager – Advanced Edition allows two head
nodes to be configured in active-active failover mode. Both ROLE-BASED ACCESS CONTROL AND AUDITING
head nodes are on active duty, but if one fails, the other takes Bright Cluster Manager’s role-based access control mechanism
over all tasks, seamlessly. allows administrator privileges to be defined on a per-role basis.
23
24. CLUSTER MANAGEMENT MADE EASY Administrator actions can be audited using an audit file which
stores all their write action.
BRIGHT CLUSTER MANAGER
TOP CLUSTER SECURITY
Bright Cluster Manager offers an unprecedented level of secu-
rity that can easily be tailored to local requirements. Security
features include:
Automated security and other updates from key-signed
Linux and Bright Computing repositories
Encrypted internal and external communications
X.509v3 certificate based public-key authentication to the
cluster management infrastructure
THE WEB-BASED USER PORTAL PROVIDES READ-ONLY ACCESS TO ESSENTIAL CLUSTER
INFORMATION, INCLUDING A GENERAL OVERVIEW OF THE CLUSTER STATUS, NODE
HARDWARE AND SOFTWARE PROPERTIES, WORKLOAD MANAGER STATISTICS AND
USER-CUSTOMIZABLE GRAPHS.
“The building blocks for transtec HPC solu-
tions must be chosen according to our goals
ease-of-management and ease-of-use. With
Bright Cluster Manager, we are happy to have
the technology leader at hand, meeting these
requirements, and our customers value that.”
Armin Jäger HPC Solution Engineer
24
25. Role-based access control and complete audit trail STANDARD AND ADVANCED EDITIONS
Firewalls and secure LDAP Bright Cluster Manager is available in two editions: Standard
Secure shell access and Advanced. The table on this page lists the differences. You
can easily upgrade from the Standard to the Advanced Edition
MULTI-CLUSTER CAPABILITY as your cluster grows in size or complexity.
Bright Cluster Manager is ideal for organizations that need to
manage multiple clusters, either in one or in multiple locations. DOCUMENTATION AND SERVICES
Capabilities include: A comprehensive system administrator manual and user manu-
All cluster management and monitoring functionality availa- al are included in PDF format. Customized training and profes-
ble for all clusters through one GUI sional services are available. Services include various levels of
Selecting any set of configurations in one cluster and support, installation services and consultancy.
export them to any or all other clusters with a few mouse
clicks
Making node images available to other clusters.
BRIGHT CLUSTER MANAGER CAN MANAGE MULTIPLE CLUSTERS SIMULTANEOUSLY. CLUSTER HEALTH CHECKS CAN BE VISUALIZED IN THE RACKVIEW. THIS SCREENSHOT
THIS OVERVIEW SHOWS CLUSTERS IN OSLO, ABU DHABI AND HOUSTON, ALL MAN- SHOWS THAT GPU UNIT 41 FAILS A HEALTH CHECK CALLED “ALLFANSRUNNING”.
AGED THROUGH ONE GUI.
25
26. CLUSTER MANAGEMENT MADE EASY
BRIGHT CLUSTER MANAGER
FEATURE STANDARD ADVANCED
Choice of Linux distributions x x
Intel Cluster Ready x x
Cluster Management GUI x x
Cluster Management Shell x x
Web-Based User Portal x x
SOAP API x x
Node Provisioning x x
Node Identification x x
Cluster Monitoring x x
Cluster Automation x x
User Management x x
Parallel Shell x x
Workload Manager Integration x x
Cluster Security x x
Compilers x x
Debuggers & Profilers x x
MPI Libraries x x
Mathematical Libraries x x
Environment Modules x x
NVIDIA CUDA & OpenCL x x
GPU Management & Monitoring x x
ScaleMP Management & Monitoring - x
Redundant Failover Head Nodes - x
Cluster Health Checking - x
Off-loadable Provisioning - x
Suggested Number of Nodes 4–128 129–10,000+
Multi-Cluster Management - x
Standard Support x x
Premium Support Optional Optional
26
29. While all HPC systems face challenges in workload demand,
resource complexity, and scale, enterprise HPC systems face
more stringent challenges and expectations. Enterprise HPC
systems must meet mission-critical and priority HPC workload
demands for commercial businesses and business-oriented
research and academic organizations. They have complex SLAs
and priorities to balance. Their HPC workloads directly impact
the revenue, product delivery, and organizational objectives
of their organizations.
29
30. INTELLIGENT MOAB HPC SUITE
Moab is the most powerful intelligence engine for policy-based,
HPC WORKLOAD MANAGEMENT predictive scheduling across workloads and resources. Moab
MOAB HPC SUITE – ENTERPRISE EDITION HPC Suite accelerates results delivery and maximize utiliza-
tion while simplifying workload management across complex,
heterogeneous cluster environments. The Moab HPC Suite
products leverage the multi-dimensional policies in Moab to
continually model and monitor workloads, resources, SLAs,
and priorities to optimize workload output. And these policies
utilize the unique Moab management abstraction layer that
integrates data across heterogeneous resources and resource
managers to maximize control as you automate workload man-
agement actions.
Managing the World’s Top Systems, Ready to Manage Yours
Moab manages the world’s largest, most scale-intensive and
complex HPC environments in the world including 40% of the top
10 supercomputing systems, nearly 40% of the top 25 and 36%
of the compute cores in the top 100 systems based on rankings
from www.Top500.org. So you know it is battle-tested and ready
“With Moab HPC Suite, we can meet very de- to efficiently and intelligently manage the complexities of your
manding customers’ requirements as regards environment.
unified management of heterogeneous cluster
environments, grid management, and provide MOAB HPC SUITE – ENTERPRISE EDITION
them with flexible and powerful configuration Moab HPC Suite - Enterprise Edition provides enterprise-ready
and reporting options. Our customers value HPC workload management that self-optimizes the productivity,
that highly.” workload uptime and meeting of SLAs and business priorities
for HPC systems and HPC cloud. It uses the battle-tested and
patented Moab intelligence engine to automate the mission-
Thomas Gebert HPC Solution Architect critical workload priorities of enterprise HPC systems. Enterprise
customers benefit from a single integrated product that brings
30
31. together key enterprise HPC capabilities, implementation, train- achievement of business objectives and outcomes that depend
ing, and 24x7 support services to speed the realization of benefits on the results the enterprise HPC systems deliver. Moab HPC
from their HPC system for their business. Moab HPC Suite – En- Suite Enterprise Edition delivers:
terprise Edition delivers:
Productivity acceleration Productivity acceleration to get more results faster and at a
Uptime automation lower cost
Auto-SLA enforcement Moab HPC Suite – Enterprise Edition gets more results delivered
Grid- and cloud-ready HPC management faster from HPC resources to lower costs while accelerating
overall system, user and administrator productivity. Moab
Designed to Solve Enterprise HPC Challenges provides the unmatched scalability, 90-99 percent utilization,
While all HPC systems face challenges in workload and resource and fast and simple job submission that is required to maximize
complexity, scale and demand, enterprise HPC systems face productivity in enterprise HPC organizations. The Moab intel-
more stringent challenges and expectations. Enterprise HPC ligence engine optimizes workload scheduling and orchestrates
systems must meet mission-critical and priority HPC workload resource provisioning and management to maximize workload
demands for commercial businesses and business-oriented speed and quantity. It also unifies workload management
research and academic organizations. These organizations have across heterogeneous resources, resource managers and even
complex SLA and priorities to balance. And their HPC workloads multiple clusters to reduce management complexity and costs.
directly impact the revenue, product delivery, and organization-
al objectives of their organizations. Uptime automation to ensure workload completes successfully
Enterprise HPC organizations must eliminate job delays and HPC job and resource failures in enterprise HPC systems lead to
failures. They are also seeking to improve resource utilization delayed results and missed organizational opportunities and
and workload management efficiency across multiple heteroge- objectives. Moab HPC Suite – Enterprise Edition intelligently
neous systems. To maximize user productivity, they are required automates workload and resource uptime in HPC systems to en-
to make it easier to access and use HPC resources for users and sure that workload completes reliably and avoids these failures.
even expand to other clusters or HPC cloud to better handle
workload demand and surges. Auto-SLA enforcement to consistently meet service guaran-
tees and business priorities
BENEFITS Moab HPC Suite – Enterprise Edition uses the powerful Moab
Moab HPC Suite - Enterprise Edition offers key benefits to intelligence engine to optimally schedule and dynamically
reduce costs, improve service performance, and accelerate the adjust workload to consistently meet service level agreements
productivity of enterprise HPC systems. These benefits drive the (SLAs), guarantees, and business priorities. This automatically
31
32. INTELLIGENT ensures that the right workloads are completed at the optimal
times, taking into account the complex number of departments,
HPC WORKLOAD MANAGEMENT priorities and SLAs to be balanced.
MOAB HPC SUITE – ENTERPRISE EDITION
Grid- and Cloud-ready HPC management to more efficiently
manage and meet workload demand
The benefits of a traditional HPC environment can be extended
to more efficiently manage and meet workload and resource
demand by sharing workload across multiple clusters through
grid management and the HPC cloud management capabilities
provided in Moab HPC Suite – Enterprise Edition.
CAPABILITIES
Moab HPC Suite – Enterprise Edition brings together key en-
terprise HPC capabilities into a single integrated product that
self-optimizes the productivity, workload uptime, and meeting
of SLA’s and priorities for HPC systems and HPC Cloud.
Productivity acceleration capabilities deliver more results
faster, lower costs, and increase resource, user and administra-
tor productivity
ARCHITECTURE Massive scalability accelerates job response and through-
put, including support for high throughput computing
Workload-optimized allocation policies and provisioning
gets more results out of existing heterogeneous resources to
reduce costs
Workload unification across heterogeneous clusters maxi-
mizes resource availability for workloads and administration
efficiency by managing workload as one cluster
Simplified HPC submission and control for both users and ad-
ministrators with job arrays, templates, self-service submission
32
33. portal and administrator dashboard (i.e. usage limits, usage reports, etc.)
Optimized intelligent scheduling that packs workloads and SLA and priority polices ensure the highest priority workloads
backfills around priority jobs and reservations while balancing are processed first (i.e. quality of service, hierarchical priority
SLAs to efficiently use all available resources weighting, dynamic fairshare policies, etc.)
Advanced scheduling and management of GPGPUs for jobs to Continuous plus future scheduling ensures priorities and gua-
maximize their utilization including auto-detection, policy-based rantees are proactively met as conditions and workload levels
GPGPU scheduling and GPGPU metrics reporting change (i.e. future reservations, priorities, and pre-emption)
Workload-aware auto-power management reduces energy use
and costs by 30-40 percent with intelligent workload consolidati- Grid- and cloud-ready HPC management extends the benefits of
on and auto-power management your traditional HPC environment to more efficiently manage
workload and better meet workload demand
Uptime automation capabilities ensure workload completes suc- Pay-for-use showback and chargeback capabilities track
cessfully and reliably, avoiding failures and missed organizational actual resource usage with flexible chargeback options and
opportunities and objectives reporting by user or department
Intelligent resource placement prevents job failures with gra- Manage and share workload across multiple remote
nular resource modeling that ensures all workload requirements clusters to meet growing workload demand or surges with
are met while avoiding at-risk resources the single self-service portal and intelligence engine with
Auto-response to incidents and events maximizes job and sys- purchase of Moab HPC Suite - Grid Option
tem uptime with configurable actions to pre-failure conditions,
amber alerts, or other metrics and monitors ARCHITECTURE
Workload-aware maintenance scheduling helps maintain a Moab HPC Suite - Enterprise Edition is architected to integrate
stable HPC system without disrupting workload productivity on top of your existing job resource managers and other types
Real-world services expertise ensures fast time to value and of resource managers in your environment. It provides policy-
system uptime with included package of implementation, trai- based scheduling and management of workloads as well as
ning, and 24x7 remote support services resource allocation and provisioning orchestration. The Moab
intelligence engine makes complex scheduling and manage-
Auto-SLA enforcement schedules and adjusts workload to con- ment decisions based on all of the data it integrates from the
sistently meet service guarantees and business priorities so the various resource managers and then orchestrates the job and
right workloads are completed at the optimal times management actions through those resource managers. It
Department budget enforcement schedules resources in does this without requiring any additional agents. This makes
line with resource sharing agreements and budgets it the ideal choice to integrate with existing and new systems
33
34. INTELLIGENT
HPC WORKLOAD MANAGEMENT
NEW IN MOAB 7.0
NEW MOAB HPC SUITE 7.0
The new Moab HPC Suite 7.0 releases deliver continued break-
through advancements in scalability, reliability, and job array
management to accelerate system productivity as well as ex-
tended database support. Here is a look at the new capabilities
and the value they offer customers:
TORQUE Resource Manager Scalability and Reliability Ad-
vancements for Petaflop and Beyond
As part of the Moab HPC Suite 7.0 releases, the TORQUE 4.0
resource manager features scalability and reliability advance-
ments to fully exploit Moab scalability. These advancements
maximize your use of increasing hardware capabilities and
enable you to meet growing HPC user needs. Key advancements
in TORQUE 4.0 for Moab HPC Suite 7.0 include:
The new Job Radix enables you to efficiently run jobs that span
tens of thousands or even hundreds of thousands of nodes.
Each MOM daemon now cascades job communication with
multiple other MOM daemons simultaneously to reduce the
job start-up process time to a small fraction of what it would
normally take across a large number of nodes. The Job Radix
eliminates lost jobs and job start-up bottlenecks caused by
having all nodes MOM daemons communicating with only one
head MOM node. This saves critical minutes on job start-up
process time and allows for higher job throughput.
34
35. New MOM daemon communication hierarchy increases gration with existing user portals, plug-ins of resource manag-
the number of nodes supported and reduces the overhead ers for rich data integration, and script integration. Customers
of cluster status updates by distributing communication now have a standard interface to Moab with REST APIs.
across multiple nodes instead of a single TORQUE head
node. This makes status updates more efficient faster sched- Simplified Self-Service and Admin Dashboard Portal Experience
uling and responsiveness. Moab HPC 7.0 features an enhanced self-service and admin
New multi-threading improves response and reliability, dashboard portal with simplified “click-based” job submission
allowing for instant feedback to user requests as well as the for end users as well as new visual cluster dashboard views of
ability to continue work even if some processes linger. nodes, jobs, and reservations for more efficient management. The
Improved network communications with all UDP-based new Visual Cluster dashboard provides administrators and users
communication replaced with TCP to make data transfers views of their cluster resources that are easily filtered by almost
from node to node more reliable. any factors including id, name, IP address, state, power, pending
actions, reservations, load, memory, processors, etc. Users can
Job Array Auto-Cancellation Policies Improve System Productivity also quickly filter and view their jobs by name, state, user, group,
Moab HPC Suite 7.0 improves system productivity with new job ar- account, wall clock requested, memory requested, start date/
ray auto-cancellation policies that cancel remaining sub-jobs in an time, submit date/time, etc. One-click drill-downs provide addi-
array once the solution is found in the array results. This frees up tional details and options for management actions.
resources, which would otherwise be running irrelevant jobs, to run
other jobs in the queue jobs quicker. The job array auto-cancellation Resource Usage Accounting Flexibility
policies allow you to set auto-cancellations of sub-jobs based on Moab HPC Suite 7.0 includes more flexible resource usage ac-
first, any instance of results success or failure, or specific exit codes. counting options that enable administrators to easily duplicate
custom organizational hierarchies such as organization, groups,
Extended Database Support Now Includes PostgreSQL and projects, business units, cost centers etc. in the Moab Account-
Oracle in Addition to MySQL ing Manager usage budgets and charging structure. This ensures
The extended database support in Moab HPC Suite 7.0 enables resource usage is budgeted , tracked, and reported or charged
customers to use ODBC-compliant PostgreSQL and Oracle back for in the most useful way to admins and their customer
databases in addition to MySQL. This provides customers the groups and users.
flexibility to use the database that best meets their needs or is
the standard for their system.
New Moab Web Services Provide Easier Standard Integration
and Customization
New Moab Web Services provide easier standard integration
and customization for a customer’s environment such as inte-
35
36. INTELLIGENT as well as to manage your HPC system as it grows and expands
in the future.
HPC WORKLOAD MANAGEMENT
MOAB HPC SUITE – BASIC EDITION Moab HPC Suite – Enterprise Edition includes the patented
Moab intelligence engine that enables it to integrate with and
automate management across existing heterogeneous environ-
ments to optimize management and workload efficiency. This
unique intelligence engine includes:
Industry leading multi-dimensional policies that automate
the complex real-time decisions and actions for scheduling
workload and allocating and adapting resources. These mul-
ti-dimensional policies can model and consider the workload
requirements, resource attributes and affinities, SLAs and
priorities to enable more complex and efficient decisions to
be automated.
Real-time and predictive future environment scheduling
that drives more accurate and efficient decisions and service
guarantees as it can proactively adjust scheduling and re-
source allocations as it projects the impact of workload and
resource condition changes.
Open & flexible management abstraction layer lets you
integrate the data and orchestrate workload actions across
the chaos of complex heterogeneous cluster environments
and management middleware to maximize workload control,
automation, and optimization.
COMPONENTS
Moab HPC Suite – Enterprise Edition includes the following inte-
grated products and technologies for a complete HPC workload
management solution:
Moab Workload Manager: Patented multi-dimensional
36
37. intelligence engine that automates the complex decisions based workload management system that accelerates and auto-
and orchestrates policy-based workload placement and mates the scheduling, managing, monitoring, and reporting of
scheduling as well as resource allocation, provisioning and HPC workloads on massive scale, multi-technology installations.
energy management The Moab HPC Suite – Basic Edition patented multi-dimensional
Moab Cluster Manager: Graphical desktop administrator decision engine accelerates both the decisions and orchestrati-
application for managing, configuring, monitoring, and on of workload across the ideal combination of diverse resour-
reporting for Moab managed clusters ces, including specialized resources like GPGPUs. The speed and
Moab Viewpoint: Web-based user self-service job submis- accuracy of the decisions and scheduling automation optimizes
sion and management portal and administrator dashboard workload throughput and resource utilization so more work
portal is accomplished in less time with existing resources to control
Moab Accounting Manager: HPC resource use budgeting costs and increase the value out of HPC investments.
and accounting tool that enforces resource sharing agree-
ments and limits based on departmental budgets and provi- Moab HPC Suite – Basic Edition enables you to address pressing
des showback and chargeback reporting for resource usage HPC challenges including:
Moab Services Manager: Integration interfaces to resource Delays to workload start and end times slowing results
managers and third-party tools Inconsistent delivery on service guarantees and SLA commit-
ments
Moab HPC Suite – Enterprise Edition is also integrated with Under-utilization of resources
TORQUE which is available as a free download on AdaptiveCom- How to efficiently manage workload across heterogeneous and
puting.com. TORQUE is an open-source job/resource manager hybrid systems of GPGPUs, hardware, and middleware
that provides continually updated information regarding the How to simplify job submission & management for users and
state of nodes and workload status. Adaptive Computing is the administrators to maximize productivity
custodian of the TORQUE project and is actively developing
the code base in cooperation with the TORQUE community to Moab HPC Suite – Basic Edition acts as the “brain” of an HPC
provide state of the art resource management. Each Moab HPC system to accelerate and automate complex decision making
Suite product subscription includes support for the Moab HPC processes. The patented decision engine is capable of making
Suite as well as TORQUE, if you choose to use TORQUE as the the complex multi-dimensional policy-based decisions needed to
job/resource manager for your cluster. schedule workload to optimize job speed, job success and resource
utilization. Moab HPC Suite – Basic Edition integrates decision-
MOAB HPC SUITE – BASIC EDITION making data from and automates actions through your system’s
Moab HPC Suite – Basic Edition is a multi-dimensional policy- existing mix of resource managers. This enables all the dimensions
37
38. INTELLIGENT of real-time granular resource attributes and state as well as the
timing of current and future resource commitments to be factored
HPC WORKLOAD MANAGEMENT into more efficient and accurate scheduling and allocation decisi-
MOAB HPC SUITE – BASIC EDITION ons. It also dramatically simplifies the management tasks and pro-
cesses across these complex, heterogeneous environments. Moab
works with many of the major resource management and industry
standard resource monitoring tools covering mixed hardware,
MOAB HPC SUITE - BASIC EDITION network, storage and licenses.
Moab HPC Suite – Basic Edition policies are also able to factor
in organizational priorities and complexities when scheduling
workload and allocating resources. Moab ensures workload is pro-
cessed according to organizational priorities and commitments
and that resources are shared fairly across users, groups and even
multiple organizations. This enables organizations to automati-
cally enforce service guarantees and effectively manage organiza-
tional complexities with simple policy-based settings.
BENEFITS
Moab HPC Suite – Basic Edition drives more ROI and results
from your HPC environment including:
Improved job response times and job throughput with a
workload decision engine that accelerates complex wor-
kload scheduling decisions to enable faster job start times
and high throughput computing
Optimized resource utilization to 90-99 percent with multi-
dimensional and predictive workload scheduling to accomp-
lish more with your existing resources
Automated enforcement of service guarantees, priorities,
and resource sharing agreements across users, groups, and
projects
Increased productivity by simplifying HPC use, access, and
38
39. control for both users and administrators with job arrays, affinity- and node topology-based placement
job templates, optional user portal, and GUI administrator Backfill job scheduling speeds job throughput and maximi-
management and monitoring tool zes utilization by scheduling smaller or less demanding jobs
Streamline job turnaround and reduce administrative as they can fit around priority jobs and reservations to use
burden by unifying and automating workload tasks and re- all available resources
source processes across diverse resources and mixed-system Security policies control which users and groups can access
environments including GPGPUs which resources
Provides a scalable workload management architecture Checkpointing
that can manage peta-scale and beyond, is grid-ready,
compatible with existing infrastructure, and extensible to Real-time and predictive scheduling ensure job priorities and
manage your environment as it grows and evolves guarantees are proactively met as conditions and workload
levels change
CAPABILITIES Advanced reservations guarantee that jobs run when required
Moab HPC Suite – Basic Edition accelerates workload pro- Maintenance reservations reserve resources for planned fu-
cessing with a patented multi-dimensional decision engine ture maintenance to avoid disruption to business workloads
that self-optimizes workload placement, resource utilization Predictive scheduling enables the future workload schedule
and results output while ensuring organizational priorities to be continually forecasted and adjusted along with resour-
are met across the users and groups leveraging the HPC ce allocations to adapt to changes in conditions and new job
environment. and reservation requests
Policy-driven scheduling intelligently places workload on op- Advanced scheduling and management of GPGPUs for jobs to
timal set of diverse resources to maximize job throughput and maximize their utilization
success as well as utilization and the meeting of workload and Automatic detection and management of GPGPUs in envi-
group priorities ronment to eliminate manual configuration and make them
Priority, SLA and resource sharing policies ensure the highest immediately available for scheduling
priority workloads are processed first and resources are Exclusively allocate and schedule GPGPUs on a per-job basis
shared fairly across users and groups such as quality of Policy-based management & scheduling using GPGPU
service, hierarchical priority weighting, and fairshare targets, metrics
limits and weights policies Quick access to statistics on GPGPU utilization and key
Allocation policies optimize resource utilization and prevent metrics for optimal management and issue diagnosis such as
job failures with granular resource modeling and scheduling, error counts, temperature, fan speed, and memory
39
40. INTELLIGENT Easier submission, management, and control of job arrays im-
prove user productivity and job throughput efficiency
HPC WORKLOAD MANAGEMENT Users can easily submit thousands of sub-jobs with a single
MOAB HPC SUITE – BASIC EDITION job submission with an array index differentiating each array
sub-job
Job array usage limit policies enforce number of job maxi-
mums by credentials or class
Simplified reporting and management of job arrays for end
users filters jobs to summarize, track and manage at the
master job level
Scalable job performance to large-scale, extreme-scale, and
high-throughput computing environments
Efficiently manages the submission and scheduling of hund-
reds of thousands of queued job submissions to support
high throughput computing
Fast scheduler response to user commands while scheduling
so users and administrators get the real-time job informati-
on they need
Fast job throughput rate to get results started and delivered
faster and keep utilization of resources up
Open and flexible management abstraction layer easily integrates
with and automates management across existing heterogeneous
resources and middleware to improve management efficiency
Rich data integration and aggregation enables you to set pow-
erful, multi-dimensional policies based on the existing real-time
resource data monitored without adding any new agents
Heterogeneous resource allocation & management for wor-
kloads across mixed hardware, specialty resources such as
40