SGI - HPC-29mai2012

Experts @ H P C
Structural Mechanics Structural Mechanics Computational Fluid Electro-Magnetics
Implicit Explicit Dynamics

Computational Chemistry Computational Chemistry Computational Biology Seismic Processing
Quantum Mechanics Molecular Dynamics

Reservoir Simulation Rendering / Ray Tracing Climate / Weather Data Analytics
Ocean Simulation

2

C o m p e t e n c y = Real HPC + Big
Storage

OpenFOAM® Performance with SGI MPI
Speedup Performance: SGI MPT < --> OpenMPI Ratio MPT / OpenMPI

3,0 2,0
Automotive Interior Climate 2,72

2,5 Model, 19M cells 2,29

2,01 2,01
2,0
1
,73
1
,59
1
,5 1
,5
1
,35
1
,02 1
,00
1
,0

14
,1
0,5 1
,08
1
,02

0,0 1
,0
64 1
28 1
92 256
# Cores

SGI MPT Speedup OpenMPI Speedup MPT/OpenMPI Ratio

OpenFOAM with SGI MPI with up to 35% better
performance
6

W h a t is t h e “ a v e r a g e ” SGI Confidential

p o w e r c o n s u m p t io n ?
Linpack 30.5kW*
STREAM GUPS Fluent
22.1kW* 23.3kW* 22.4kW*
72.5% 76.4% 73.4%
Linpack kW Linpack kW Linpack kW

Idle
15.9kW*
52.1% Linpack kW

Average power consumption heavily depends on
• application and its data profile
• the level of code optimization (+ libraries + MPI optimization)
• the ability of Job Scheduler to utilize the system
• the bottlenecks in I/O subsystem and in OS
* Measured on ICE 8200 system with 128x 2.66GHz Quad-Core Intel® Xeon® Processor 5300 series (1 Rack)

Where is performance?

Accelerated

R e a l M e m o r y B a n d w id t h
R e q u ir e m e n t s
M e a s u re me nts a t L R Z o n
S G I A lt ix 4 7 0 0

Source: Matthias Brehm (LRZ) in inSide, vol 4 No 2
p s
1 F lo
/s :
1B

SGI HPC Servers and Supercomputers

Scale-Out Scale-Up

Rackable™ CloudRack™ Altix ® ICE Altix® UV
1U, 2U, 3U, 4U & XE Tray Cluster Blade Cluster Shared-Memory
Build-to-Order Leader Architecture (for Internet DC) Architecture Architecture
Scalability Leader Virtualization & May-Core Leader

S G I U V2
4th Generation SMP System
• T h e m o s t f le x ib le
s ys te m !

SGI UV Shared Memory Architecture

C o m m o d it y C lu s t e r s S G I U V P la t f o r m
In f in iB a n d o r G ig a b it E t h e r n e t S G I N U M A lin k In t e r c o n n e c t
Mem
me m me m me m me m G l o b a l s h a r e d m e m o r y t o 16 T B
~ 64GB
S ys te m
s ys te m ys te m ys te m ys te m
+
s
+
s
+
s
+ ... s ys te m
+ +
OS OS OS OS OS OS

• E a c h s ys te m h a s o w n • A ll n o d e s o p e r a t e o n o n e
me mo ry a nd O S la r g e s h a r e d m e m o r y s p a c e
• N o d e s c o m m u n ic a t e o v e r • E lim in a t e s d a t a p a s s in g
c o m m o d it y in t e r c o n n e c t b e tw e e n no d e s
• In e f f ic ie n t c r o s s -n o d e • B ig d a t a s e t s f it e n t ir e ly in
c o m m u n ic a t io n c r e a t e s me mo ry
b o t t le n e c k s • Le s s me mory p e r nod e
• C o d in g r e q u ir e d f o r p a r a lle l r e q u ir e d
c o d e e x e c u t io n • S im p le r t o p r o g r a m
• H ig h p e r f o r m a n c e , lo w c o s t ,
e a s y t o d e p lo y

The UV2 Advantage

o n g 15 y e a r h e r i t a g e : s a m e p r i n c i p l e s a s A l t i x
4 7 0 0 , …. b u t

ntel Sandy Bridge Xeon Multi-Core Processors

arge scalable Shared Memory System
• Up to 4096 Cores and 64TB per Partition
• Up to 2048 Cores, 4096 Threads and 32TB per Partition
• Multi-partition Systems with up to 16384 Sockets, 2PB in multiple
Partitions
• MPI, UPC Acceleration by Hardware Offload
• Cross Partition Communication

n 2 0 12 w i t h o u t c o m p e t i t i o n

y help of proven SGI ccNuma Architecture

SGI UV2 Interconnect with Global Addressing

UMAlink® routers connect nodes to Multi-rack UV systems

UB snoops Socket QPI and accelerates remote access
High Radix
Router
UB Offloads Programmingkmodels
UMA
lin MPI, UPC, (CoArray not yet)
N
Altix UV Blade Altix UV Blade Altix UV Blade Altix UV Blade

HUB HUB HUB HUB

CPU CPU CPU CPU CPU CPU CPU CPU

64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB

512GB globally addressable Memory

UV Foundation:
GAM + Communications Offload

[S] GSM – cc = GAM

Intel
GSM
CPU
• Partition Memory ( OS )
- Max. 2KC 16TB
PI GAM
[V] GRU • PGAS Memory ( X-Partition )
TLB NI • Communications Offload ( GRU + AMU )
[P] AMU NUMAlink - Accelerate PGAS Codes
MI to Other Nodes - Accelerate MPI Codes ( MOE v.v. TOE )

GAM : Globally Addressable Memory  8PB ( 53b )

15

UV1 vs. UV2

Socket Socket
S S - NHM-EX SNB-EX-B & SNB-EP - S S
- WSM-EX IVB-EX-B & IVB-EP -
- QPI 1.0 QPI 1.1 -
D H H
D H H

S S Glue Glue
R
-H+H+R H + H +R -
- 3 separate Chips into 1 Chip -
- 90nm 40 nm -
- (D) Directory DIMM No Directory DIMM -
R
- (S) Snoop DRAM No Snoop DRAM -
Better AMOs -

Interconnect - NL5 NL6 - Interconnect
- 6.25 GT/s -
- 8B/10B encoding Higher Payload -
- 4 x 12 lanes 16 x 4 lanes -
- Cu only Cu & Optical -
- 7m max 20m max -

16

Additional Performance Acceleration
Barrier Latency <1usec (4096 thread)

ltix UV offers up to 3X improvement
in MPI reduction processes.

arrier latency is dramatically better
(80x) than competing platforms

HPCC Benchmarks

UV with MOE
UV, MOE disabled
PCC benchmarks show substantial
UV with MOE improvement possible with MPI
Offload Engine (MOE)
UV, MOE disabled

UV with MOE
UV. MOE disabled

0

Source: SGI Engineering projections

UV2000 16 Socket 8 Blade IRU

Notes
• IRU: 10U high by 19” wide by 27” deep
• 8 blades – 8 Harps & 16 sockets – per IRU
• 1 or 2 CMCs in rear of IRU
CMC • 3 UV1 12V Power Supplies
• Nine 12V cooling fans N+1
Signal
BP Power BP
Signal BP Two signal backplanes
16 NL channels cabled in air
plenum
to connect the right and Powerbackplane
left backplane

Front
19

SGI UV2 Node Architecture and Numalink 6
PCIe Gen3 x16 PCIe Gen3 x16

4 DDR3
Sandy Sandy channels
Bridge- Bridge-EP 2DPC
EP or EX or EX 1600MHz

QPI 1.1 8GT/s 32GB/s

NL6
UV2-HUB 16 x4 channels
12.5GT/s
IRU external links 16 x4 NL6 IRU external links
NL0-plane NL1-plane

12 IRU internal links NO memory Buffers as in UV1
umalink 6
2.5GT/s – or – 6.7GB/s net bidirectional bandwidth per link
Same per socket performance
6 NL6 links aggregate Bandwidth out of blade: 10 7 . 2 G B /s as in cluster
2 NL6 internal links to backplane – aggregate: 8 0 . 4 G B /s 40 PCIe lanes per socket
4 NL6 external links to routers – 2 6 . 8 G B /s

umalink 6 Routers
6 NL6 ports

umalink cable

UV2 Topology
y s t e m To p o l o g y

IRU Blade

ypercube

ax 2 hops between blades

21

UV 2 Feature Advances

Feature UV1 UV2
System scale 2048c/4096t 4096c/4096t
Memory/SSI 16TB 64TB
Interconnect NUMAlink 5 NL 6 (2.5X data rate)
NL fabric Scale 32K sockets 32K+ sockets
Processor Nehalem EX Sandybridge
Sockets/rack 64 (large 24”) 64 (standard 19”)
Reliability Enterprise Class Enterprise Class

22

MIC Architecture X8 6
com
p at
ible

1.3TF/s Double Precision peak
340GB/s bandwidth

S G I IC E X …
Fifth Generation ICE System
• T h e w o r ld ’ s
fa s te s t
s u p e rc o mp u te r
ju s t g o t f a s t e r !
• F le x ib le t o f it
y o u r w o r k lo a d

SGI® ICE: Firsts and Onlies
• F i r s t * o v e r 1P F p e a k * I n f i n i B a n d pure
compute connected CPU cluster
• W o r l d ' s f a s t e s t distributed memory system
• World’s fastest and m o s t s c a l a b l e computational
fluid dynamics system
• F i r s t a n d o n l y v e n d o r t o s u p p o r t multiple
fabric level topologies + flexibility at the node, switch and
fabric level + application benchmarking expertise for same
• First and only vendor capable of l i v e , l a r g e - s c a l e
c o m p u t e c a p a c it y in t e g r a t io n

©2011 SGI

D ialing U p T he D ensity!
SGI ICE 8400  S G I I C E X
S G I IC E 8 4 0 0 D -R a c k = M -R a c k 72 x 2 =
=64N 72N 14 4 N
(128 Sockets) (144 Sockets) (288 Sockets)

30” Width 24” Width 28” Width

S G I I C E X Enclosure Design Building
Block Increments of Two Blade Enclosures - “One
Enclosure Pair”

F e a ture s p e r
E n c lo s u r e
17.7
P a ir :

• 3 6 b la d e 16.59
(9.5U)
s lo t s Rear View

21U
• F o u r f a b r ic 1.75 “Building
s w it c h s lo t s (1U) Block”

• I n t e g r a t e d Separable
19” rack mount
m a n a g e m e nPower Shelf
t

S G I I C E X Compute Blade
IP-113 (Dakota) for “D-Rack”

F D R M e z z a n in e M a in F e a t u r e s :
C a r d O p t io n s
•Supports single or dual plane
FDR InfiniBand
•Supports two future Intel®
Xeon® processor E5 family
CPUs
•Supports up to eight DDR3
DIMMs per socket @ 1600 MT/
s
•Houses up to two 2.5” SATA
drives for local swap/scratch
usage
•Utilizes traditional heat sinks

S G I I C E X Compute Blade
IP-115 (Gemini Twin) for “M-Rack”
M a in F e a t u r e s :
•Supports single plane FDR
InfiniBand
•Supports four future Intel®
Xeon® processor E5 family
CPUs
•Two dual socket nodes
•Supports four DDR3
DIMMs per socket @
1600 MT/s
•Houses up to two 2.5”
SATA drives for local swap/
scratch usage
• One per node
•Utilizes traditional heat
sinks and cold sinks (liquid)

©2011 SGI

On-Socket Water-Cooling Detail
U s e d f o r I P - 115 G e m i n i “ t w i n ” b l a d e s ;
r e p la c e s t h e t r a d it io n a l a ir -c o o le d h e a t s in k s
o n t h e C P U s t o e n a b le h ig h e s t w a t t S K U
s upport
•Resides between the pair of node boards in each blade slot (“M-Rack”
deployment)
•Enables highest watt SKU support (e.g., 130W TDPs)
•Utilizes a liquid-to-water heat exchanger that provisions the required
quantity of flow to the M-Racks for cooling
Out

Notable Features of a “Cell”
D-Cell and M-Cell O n e C o o lin g
O ne C o mp ute
Rac k
Rac k

• “ C lo s e d -L o o p
A i r f l o w ” Environment
• Supports W a r m O n e C o m p le t e C e ll

W a t e r Cooling
• Contains Large,
“ U n if ie d ”
C o o l i n g R a c k s for
Efficiency

©2011 SGI

Common Topologies

Mes h or
F a t Tre e E nha nc e To ru s
(C LOS H yp e r c u d (2 , 3 or
A ll-t o -A ll
N e tw o rk be H yp e r c u more
s ) be d im e n s i
oWill )
ns
S u ppo r te d o n S GI ICE 8 4 0 0 a n d S GI ICE X s u ppo r t
w h e n in
OF E D

©2011 SGI

ICE Differentiation: OS Noise Synchronization
• OS system noise: CPU cycles stolen from a user application by the OS to do
periodic or asynchronous work (monitoring, daemons, garbage collection, etc).
• Management interface will allow users to select what gets synchronized
• Performance boost on larger scales systems
Process on: Unsynchronized OS Noise → Wasted Cycles
System Wasted Wasted
Node 1 Overhead Cycles Cycles

Node 2 Wasted System Wasted
Compute Cycles Cycles Overhead Cycles

Node 3 Wasted Wasted System
Cycles Cycles Overhead

Barrier Complete
Process on:
System
Node 1 Overhead

System
Node 2 Overhead

System
Node 3 Overhead

Synchronized OS Noise → Faster Results

Slide 33 Time

S G I IC E X

C o o l C us to me rs

S G I I C E X : Initial Customers
• N A S A : Increasing their current SGI ICE system, called
“Pleiades,” by 35% with multiple racks with future Intel®
Xeon® processor E5 family – will have 1.7 petaflops
• Facilitate new discoveries for Earth Science research projects
• Modeling and simulation to support flight regimes and new
designs for aircraft
• Engineering risk assessment of crew risk probabilities to
support development of
launch and commercial crew vehicles for space exploration
missions
• N T N U : 13 SGI ICE X racks @ >275 teraflops; 4 SGI
InfiniteStorage 16000 racks @ 1.2 petabytes
• Accelerate numerical weather predictions
• Develop atmospheric and oceanographic models for improved
weather forecasting

©2011 SGI

UN Chief Calls for Urgent Action
on Climate Change
NASA Advanced Supercomputing Division
SGI® ICE

Images taken by the Thematic Mapper sensor aboard Landsat 5
Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological Survey

Cyclone Service Models

 SGI delivers techincal application

expertise.
Software (SaaS)
 SGI delivers commercially

available open and 3rd party

software via the Internet.
SGI Cyclone
 SGI offers a platform for

developers

 SGI delivers the system

infrastructure.

SGI OpenFOAM® Ready for Cyclone
 Customer : iVEC and Curtin
Technical Applications Portal
Powered by
University Australia
User  Problem: Solving large scale CFD
Su
b mi
problems like simulating wind flows
ts
Job in the capital city of Perth.
 Solution: OpenFOAM scaled on SGI
Cyclone better (1024 cores) and
was 20x faster than on Amazon
EC2.

Source: Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia

Balanced design & architecture

Do you attach
Caravan attached
to the F1?

SGI - HPC-29mai2012

Recommandé

Recommandé

Contenu connexe

Similaire à SGI - HPC-29mai2012

Similaire à SGI - HPC-29mai2012 (20)

Plus de Agora Group

Plus de Agora Group (20)

Dernier

Dernier (20)

SGI - HPC-29mai2012

Notes de l'éditeur