7. W h a t is t h e “ a v e r a g e ” SGI Confidential
p o w e r c o n s u m p t io n ?
Linpack 30.5kW*
STREAM GUPS Fluent
22.1kW* 23.3kW* 22.4kW*
72.5% 76.4% 73.4%
Linpack kW Linpack kW Linpack kW
Idle
15.9kW*
52.1% Linpack kW
Average power consumption heavily depends on
• application and its data profile
• the level of code optimization (+ libraries + MPI optimization)
• the ability of Job Scheduler to utilize the system
• the bottlenecks in I/O subsystem and in OS
* Measured on ICE 8200 system with 128x 2.66GHz Quad-Core Intel® Xeon® Processor 5300 series (1 Rack)
9. R e a l M e m o r y B a n d w id t h
R e q u ir e m e n t s
M e a s u re me nts a t L R Z o n
S G I A lt ix 4 7 0 0
Source: Matthias Brehm (LRZ) in inSide, vol 4 No 2
p s
1 F lo
/s :
1B
11. S G I U V2
4th Generation SMP System
• T h e m o s t f le x ib le
s ys te m !
12. SGI UV Shared Memory Architecture
C o m m o d it y C lu s t e r s S G I U V P la t f o r m
In f in iB a n d o r G ig a b it E t h e r n e t S G I N U M A lin k In t e r c o n n e c t
Mem
me m me m me m me m G l o b a l s h a r e d m e m o r y t o 16 T B
~ 64GB
S ys te m
s ys te m ys te m ys te m ys te m
+
s
+
s
+
s
+ ... s ys te m
+ +
OS OS OS OS OS OS
• E a c h s ys te m h a s o w n • A ll n o d e s o p e r a t e o n o n e
me mo ry a nd O S la r g e s h a r e d m e m o r y s p a c e
• N o d e s c o m m u n ic a t e o v e r • E lim in a t e s d a t a p a s s in g
c o m m o d it y in t e r c o n n e c t b e tw e e n no d e s
• In e f f ic ie n t c r o s s -n o d e • B ig d a t a s e t s f it e n t ir e ly in
c o m m u n ic a t io n c r e a t e s me mo ry
b o t t le n e c k s • Le s s me mory p e r nod e
• C o d in g r e q u ir e d f o r p a r a lle l r e q u ir e d
c o d e e x e c u t io n • S im p le r t o p r o g r a m
• H ig h p e r f o r m a n c e , lo w c o s t ,
e a s y t o d e p lo y
13. The UV2 Advantage
o n g 15 y e a r h e r i t a g e : s a m e p r i n c i p l e s a s A l t i x
4 7 0 0 , …. b u t
ntel Sandy Bridge Xeon Multi-Core Processors
arge scalable Shared Memory System
• Up to 4096 Cores and 64TB per Partition
• Up to 2048 Cores, 4096 Threads and 32TB per Partition
• Multi-partition Systems with up to 16384 Sockets, 2PB in multiple
Partitions
• MPI, UPC Acceleration by Hardware Offload
• Cross Partition Communication
n 2 0 12 w i t h o u t c o m p e t i t i o n
y help of proven SGI ccNuma Architecture
14. SGI UV2 Interconnect with Global Addressing
UMAlink® routers connect nodes to Multi-rack UV systems
UB snoops Socket QPI and accelerates remote access
High Radix
Router
UB Offloads Programmingkmodels
UMA
lin MPI, UPC, (CoArray not yet)
N
Altix UV Blade Altix UV Blade Altix UV Blade Altix UV Blade
HUB HUB HUB HUB
CPU CPU CPU CPU CPU CPU CPU CPU
64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB
512GB globally addressable Memory
15. UV Foundation:
GAM + Communications Offload
[S] GSM – cc = GAM
Intel
GSM
CPU
• Partition Memory ( OS )
- Max. 2KC 16TB
PI GAM
[V] GRU • PGAS Memory ( X-Partition )
TLB NI • Communications Offload ( GRU + AMU )
[P] AMU NUMAlink - Accelerate PGAS Codes
MI to Other Nodes - Accelerate MPI Codes ( MOE v.v. TOE )
GAM : Globally Addressable Memory 8PB ( 53b )
15
16. UV1 vs. UV2
Socket Socket
S S - NHM-EX SNB-EX-B & SNB-EP - S S
- WSM-EX IVB-EX-B & IVB-EP -
- QPI 1.0 QPI 1.1 -
D H H
D H H
S S Glue Glue
R
-H+H+R H + H +R -
- 3 separate Chips into 1 Chip -
- 90nm 40 nm -
- (D) Directory DIMM No Directory DIMM -
R
- (S) Snoop DRAM No Snoop DRAM -
Better AMOs -
Interconnect - NL5 NL6 - Interconnect
- 6.25 GT/s -
- 8B/10B encoding Higher Payload -
- 4 x 12 lanes 16 x 4 lanes -
- Cu only Cu & Optical -
- 7m max 20m max -
16
18. Additional Performance Acceleration
Barrier Latency <1usec (4096 thread)
ltix UV offers up to 3X improvement
in MPI reduction processes.
arrier latency is dramatically better
(80x) than competing platforms
HPCC Benchmarks
UV with MOE
UV, MOE disabled
PCC benchmarks show substantial
UV with MOE improvement possible with MPI
Offload Engine (MOE)
UV, MOE disabled
UV with MOE
UV. MOE disabled
0
Source: SGI Engineering projections
19. UV2000 16 Socket 8 Blade IRU
Notes
• IRU: 10U high by 19” wide by 27” deep
• 8 blades – 8 Harps & 16 sockets – per IRU
• 1 or 2 CMCs in rear of IRU
CMC • 3 UV1 12V Power Supplies
• Nine 12V cooling fans N+1
Signal
BP Power BP
Signal BP Two signal backplanes
16 NL channels cabled in air
plenum
to connect the right and Powerbackplane
left backplane
Front
19
20. SGI UV2 Node Architecture and Numalink 6
PCIe Gen3 x16 PCIe Gen3 x16
4 DDR3
Sandy Sandy channels
Bridge- Bridge-EP 2DPC
EP or EX or EX 1600MHz
QPI 1.1 8GT/s 32GB/s
NL6
UV2-HUB 16 x4 channels
12.5GT/s
IRU external links 16 x4 NL6 IRU external links
NL0-plane NL1-plane
12 IRU internal links NO memory Buffers as in UV1
umalink 6
2.5GT/s – or – 6.7GB/s net bidirectional bandwidth per link
Same per socket performance
6 NL6 links aggregate Bandwidth out of blade: 10 7 . 2 G B /s as in cluster
2 NL6 internal links to backplane – aggregate: 8 0 . 4 G B /s 40 PCIe lanes per socket
4 NL6 external links to routers – 2 6 . 8 G B /s
umalink 6 Routers
6 NL6 ports
umalink cable
21. UV2 Topology
y s t e m To p o l o g y
IRU Blade
ypercube
ax 2 hops between blades
21
22. UV 2 Feature Advances
Feature UV1 UV2
System scale 2048c/4096t 4096c/4096t
Memory/SSI 16TB 64TB
Interconnect NUMAlink 5 NL 6 (2.5X data rate)
NL fabric Scale 32K sockets 32K+ sockets
Processor Nehalem EX Sandybridge
Sockets/rack 64 (large 24”) 64 (standard 19”)
Reliability Enterprise Class Enterprise Class
22
23. MIC Architecture X8 6
com
p at
ible
1.3TF/s Double Precision peak
340GB/s bandwidth
24. S G I IC E X …
Fifth Generation ICE System
• T h e w o r ld ’ s
fa s te s t
s u p e rc o mp u te r
ju s t g o t f a s t e r !
• F le x ib le t o f it
y o u r w o r k lo a d
26. D ialing U p T he D ensity!
SGI ICE 8400 S G I I C E X
S G I IC E 8 4 0 0 D -R a c k = M -R a c k 72 x 2 =
=64N 72N 14 4 N
(128 Sockets) (144 Sockets) (288 Sockets)
30” Width 24” Width 28” Width
27. S G I I C E X Enclosure Design Building
Block Increments of Two Blade Enclosures - “One
Enclosure Pair”
F e a ture s p e r
E n c lo s u r e
17.7
P a ir :
• 3 6 b la d e 16.59
(9.5U)
s lo t s Rear View
21U
• F o u r f a b r ic 1.75 “Building
s w it c h s lo t s (1U) Block”
• I n t e g r a t e d Separable
19” rack mount
m a n a g e m e nPower Shelf
t
28. S G I I C E X Compute Blade
IP-113 (Dakota) for “D-Rack”
F D R M e z z a n in e M a in F e a t u r e s :
C a r d O p t io n s
•Supports single or dual plane
FDR InfiniBand
•Supports two future Intel®
Xeon® processor E5 family
CPUs
•Supports up to eight DDR3
DIMMs per socket @ 1600 MT/
s
•Houses up to two 2.5” SATA
drives for local swap/scratch
usage
•Utilizes traditional heat sinks
30. On-Socket Water-Cooling Detail
U s e d f o r I P - 115 G e m i n i “ t w i n ” b l a d e s ;
r e p la c e s t h e t r a d it io n a l a ir -c o o le d h e a t s in k s
o n t h e C P U s t o e n a b le h ig h e s t w a t t S K U
s upport
•Resides between the pair of node boards in each blade slot (“M-Rack”
deployment)
•Enables highest watt SKU support (e.g., 130W TDPs)
•Utilizes a liquid-to-water heat exchanger that provisions the required
quantity of flow to the M-Racks for cooling
Out
33. ICE Differentiation: OS Noise Synchronization
• OS system noise: CPU cycles stolen from a user application by the OS to do
periodic or asynchronous work (monitoring, daemons, garbage collection, etc).
• Management interface will allow users to select what gets synchronized
• Performance boost on larger scales systems
Process on: Unsynchronized OS Noise → Wasted Cycles
System Wasted Wasted
Node 1 Overhead Cycles Cycles
Node 2 Wasted System Wasted
Compute Cycles Cycles Overhead Cycles
Node 3 Wasted Wasted System
Cycles Cycles Overhead
Barrier Complete
Process on:
System
Node 1 Overhead
System
Node 2 Overhead
System
Node 3 Overhead
Synchronized OS Noise → Faster Results
Slide 33 Time
36. UN Chief Calls for Urgent Action
on Climate Change
NASA Advanced Supercomputing Division
SGI® ICE
Images taken by the Thematic Mapper sensor aboard Landsat 5
Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological Survey
38. Cyclone Service Models
SGI delivers techincal application
expertise.
Software (SaaS)
SGI delivers commercially
available open and 3rd party
software via the Internet.
SGI Cyclone
SGI offers a platform for
developers
SGI delivers the system
infrastructure.
39. SGI OpenFOAM® Ready for Cyclone
Customer : iVEC and Curtin
Technical Applications Portal
Powered by
University Australia
User Problem: Solving large scale CFD
Su
b mi
problems like simulating wind flows
ts
Job in the capital city of Perth.
Solution: OpenFOAM scaled on SGI
Cyclone better (1024 cores) and
was 20x faster than on Amazon
EC2.
Source: Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia
40. Balanced design & architecture
Do you attach
Caravan attached
to the F1?
Multiple runs and optimizations have yielded different results Just focus on the graph showing the “relative” comparison of Linpack, idle, and application/benchmark power typical
The world’s fastest supercomputer just got faster! Largest performance boost ever - up to 5x performance density improvement over previous industry-leading generation - with future Intel ® Xeon ® processor E5 family Key design innovations and increased flexibility through enhanced R&D investment The world-renowned SGI quality and performance you love Entirely built on industry-standard hardware and software components, enabling access to the full spectrum of the Linux ecosystem Only system in its class that installs production-ready in hours or days, not weeks or months Flexible to fit your workload Ultimate configuration flexibility in topology/interconnect, power, cooling, CPUs and memory Seamless scalability from tens of teraflops to tens of petaflops Expandability within and across technology generations while maintaining uninterrupted production workflow
The world’s fastest supercomputer just got faster! Largest performance boost ever - up to 5x performance density improvement over previous industry-leading generation - with future Intel ® Xeon ® processor E5 family Key design innovations and increased flexibility through enhanced R&D investment The world-renowned SGI quality and performance you love Entirely built on industry-standard hardware and software components, enabling access to the full spectrum of the Linux ecosystem Only system in its class that installs production-ready in hours or days, not weeks or months Flexible to fit your workload Ultimate configuration flexibility in topology/interconnect, power, cooling, CPUs and memory Seamless scalability from tens of teraflops to tens of petaflops Expandability within and across technology generations while maintaining uninterrupted production workflow
First *over 1PF peak* InfiniBand pure compute connected CPU cluster World's fastest distributed memory system Top Intel-based overall SPEC_MPIM2007 and SPEC_MPIL2007 performance (base and peak) Top AMD-based SPEC_MPIM2007 and SPEC_MPIL2007 performance (base and peak) World’s fastest and most scalable computational fluid dynamics system SGI ICE 8400 demonstrated unmatched parallel scaling up to 3,072 cores with a rating of 1,333.3 standard benchmark jobs per day Also proved the ability to run ANSYS FLUENT on all 4,092 cores; to date, no other cluster has reported ANSYS FLUENT benchmark results above 2,048 cores The ANSYS FLUENT benchmark performance increase was achieved with the help of SGI MPI PerfBoost First and only vendor to support multiple fabric level topologies + flexibility at the node, switch and fabric level + application benchmarking expertise for same First and only vendor capable of live, large-scale compute capacity integration
Used for IP-115 Gemini “twin” blades; replaces the traditional air-cooled heat sinks on the CPUs to enable highest watt SKU support Resides between the pair of node boards in each blade slot (“M-Rack” deployment) Enables highest watt SKU support (e.g., 130W TDPs) Utilizes a liquid-to-water heat exchanger that provisions the required quantity of flow to the M-Racks for cooling
“ Closed-Loop Airflow” Environment Integrated hot aisle containment No air from within the cell is mixed with the data center air wherein the cell is installed (versus a hot/cold aisle arrangement - open loop airflow - wherein the air is mixed) Always water-cooled Supports Warm Water Cooling Broad range of acceptable temperatures for additional cost savings Contains air-to-water heat exchanger Contains a liquid-to-water heat exchanger when cold sinks are deployed Contains Large, “Unified” Cooling Racks for Efficiency Compute racks do not have their own cooling at the rack level Decreases power costs associated with cooling All cooling elements utilize one water source
Synchronization of the OS overhead related tasks on each node to begin simultaneously on all nodes in the cluster results in significantly less wasted cycles over the duration of parallel workloads. The negative effect of unsynchronized OS noise grows continuously worse as node counts rise.
Left: August 1985. Right: August 2010. Iran’s Lake Oroumeih (also spelled Urmia) is the largest lake in the Middle East and the third largest saltwater lake on Earth. But dams on feeder streams, expanded use of ground water, and a decades-long drought have reduced it to 60 percent of the size it was in the 1980s. Light blue tones in the 2010 image represent shallow water and salt deposits. Increased salinity has led to an absence of fish and habitat for migratory waterfowl. At the current rate, the lake will be completely dry by the end of 2013.
Customer Name: iVEC and Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia Challenge : iVEC and the Fluid Dynamics Research Group at Curtin University are working together to solve large scale CFD problems like simulating wind flows in the capital city of Perth. SGI Cyclone Solution: The testing included running OpenFOAM on internal systems, SGI Cyclone and Amazon EC2 cloud. SGI Cyclone proved to scale better (1,024 cores) and was much faster!