SlideShare une entreprise Scribd logo
1  sur  44
Cell Broadband Engine
© 2006 IBM Corporation
Chip Multiprocessing
and the Cell Broadband Engine
Michael Gschwind
ACM Computing Frontiers 2006
Cell Broadband Engine
© 2006 IBM Corporation
2 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Computer Architecture at the turn of the millenium
ƒ “The end of architecture” was being proclaimed
– Frequency scaling as performance driver
– State of the art microprocessors
•multiple instruction issue
•out of order architecture
•register renaming
•deep pipelines
ƒ Little or no focus on compilers
– Questions about the need for compiler research
ƒ Academic papers focused on microarchitecture tweaks
Cell Broadband Engine
© 2006 IBM Corporation
3 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
The age of frequency scaling
SCALING:
Voltage: V/α
Oxide: tox /α
Wire width: W/α
Gate width: L/α
Diffusion: xd /α
Substrate: α * NA
RESULTS:
Higher Density: ~α2
Higher Speed: ~α
Power/ckt: ~1/α2
Power Density:
~Constant
p substrate, doping α*NA
Scaled Device
L/α
xd/α
GATE
n+
source
n+
drain
WIRING
Voltage, V / α
W/α
tox/α
Source: Dennard et al., JSSC 1974.
0.5
0.6
0.7
0.8
0.9
1
7
10
13
16
19
22
25
28
31
34
37
Total FO4 Per Stage
Performance
bips
deeper pipeline
Cell Broadband Engine
© 2006 IBM Corporation
4 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Frequency scaling
ƒ A trusted standby
– Frequency as the tide that raises all boats
– Increased performance across all applications
– Kept the industry going for a decade
ƒ Massive investment to keep going
– Equipment cost
– Power increase
– Manufacturing variability
– New materials
Cell Broadband Engine
© 2006 IBM Corporation
5 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
… but reality was closing in!
0
0.2
0.4
0.6
0.8
1
7
10
13
16
19
22
25
28
31
34
37
Total FO4 Per Stage
Relative
to
Optimal
FO4
bips
bips^3/W
0.001
0.01
0.1
1
10
100
1000
0.01
0.1
1
Active Power
Leakage Power
Cell Broadband Engine
© 2006 IBM Corporation
6 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Power crisis
0.1
1
10
100
1000
0.01
0.1
1
gate length Lgate (µm)
classic scaling
Tox(Å)
Vdd
Vt
ƒ The power crisis is not
“natural”
– Created by deviating from
ideal scaling theory
– Vdd and Vt not scaled by α
• additional performance with
increased voltage
ƒ Laws of physics
– Pchip = ntransistors * Ptransistor
ƒ Marginal performance gain
per transistor low
– significant power increase
– decreased power/performance
efficiency
Cell Broadband Engine
© 2006 IBM Corporation
7 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
The power inefficiency of deep pipelining
0
0.2
0.4
0.6
0.8
1
7
10
13
16
19
22
25
28
31
34
37
Total FO4 Per Stage
Relative
to
Optimal
FO4
bips
bips^3/W
ƒ Deep pipelining increases number of latches and switching rate
⇒ power increases with at least f2
ƒ Latch insertion delay limits gains of pipelining
Power-performance optimal Performance optimal
Source: Srinivasan et al., MICRO 2002
deeper pipeline
Cell Broadband Engine
© 2006 IBM Corporation
8 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Hitting the memory wall
ƒ Latency gap
– Memory speedup lags behind
processor speedup
– limits ILP
ƒ Chip I/O bandwidth gap
– Less bandwidth per MIPS
ƒ Latency gap as application
bandwidth gap
– Typically (much) less than
chip I/O bandwidth
-60
-40
-20
0
20
40
60
80
100
1990 2003
normalized
MFLOPS Memory latency
Source: McKee, Computing Frontiers 2004
Source: Burger et al., ISCA 1996
usable bandwidth avg. request size
roundtrip latency
no. of inflight
requests
Cell Broadband Engine
© 2006 IBM Corporation
9 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Our Y2K challenge
ƒ 10× performance of desktop systems
ƒ 1 TeraFlop / second with a four-node configuration
ƒ 1 Byte bandwidth per 1 Flop
– “golden rule for balanced supercomputer design”
ƒ scalable design across a range of design points
ƒ mass-produced and low cost
Cell Broadband Engine
© 2006 IBM Corporation
10 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Cell Design Goals
ƒ Provide the platform for the future of computing
– 10× performance of desktop systems shipping in 2005
ƒ Computing density as main challenge
–Dramatically increase performance per X
• X = Area, Power, Volume, Cost,…
ƒ Single core designs offer diminishing returns on
investment
– In power, area, design complexity and verification cost
Cell Broadband Engine
© 2006 IBM Corporation
11 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Necessity as the mother of invention
ƒ Increase power-performance efficiency
–Simple designs are more efficient in terms of power and area
ƒ Increase memory subsystem efficiency
–Increasing data transaction size
–Increase number of concurrently outstanding transactions
⇒ Exploit larger fraction of chip I/O bandwidth
ƒ Use CMOS density scaling
–Exploit density instead of frequency scaling to deliver increased
aggregate performance
ƒ Use compilers to extract parallelism from application
–Exploit application parallelism to translate aggregate
performance to application performance
Cell Broadband Engine
© 2006 IBM Corporation
12 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Exploit application-level parallelism
ƒ Data-level parallelism
– SIMD parallelism improves performance with little overhead
ƒ Thread-level parallelism
– Exploit application threads with multi-core design approach
ƒ Memory-level parallelism
– Improve memory access efficiency by increasing number of
parallel memory transactions
ƒ Compute-transfer parallelism
– Transfer data in parallel to execution by exploiting application
knowledge
Cell Broadband Engine
© 2006 IBM Corporation
13 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Concept phase ideas
Chip
Multiprocessor
efficient use of
memory interface
large architected
register set
high frequency
and low power!
reverse Vdd
scaling for low
power
modular design
design reuse
control cost of
coherency
Cell Broadband Engine
© 2006 IBM Corporation
14 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Cell Architecture
ƒ Heterogeneous multi-
core system
architecture
– Power Processor
Element for control tasks
– Synergistic Processor
Elements for data-
intensive processing
ƒ Synergistic Processor
Element (SPE)
consists of
– Synergistic Processor
Unit (SPU)
– Synergistic Memory Flow
Control (SMF)
16B/cycle (2x)
16B/cycle
BIC
FlexIOTM
MIC
Dual
XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
SMF
PXU
L1
PPU
16B/cycle
L2
32B/cycle
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
Cell Broadband Engine
© 2006 IBM Corporation
15 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Shifting the Balance of Power with Cell Broadband Engine
ƒ Data processor instead of control system
– Control-centric code stable over time
– Big growth in data processing needs
• Modeling
• Games
• Digital media
• Scientific applications
ƒ Today’s architectures are built on a 40 year old data model
– Efficiency as defined in 1964
– Big overhead per data operation
– Parallelism added as an after-thought
Cell Broadband Engine
© 2006 IBM Corporation
16 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Powering Cell – the Synergistic Processor Unit
VRF
VRF
PERM
PERM LSU
LSU
V
V F
F P
P U
U V
V F
F X
X U
U
Local Store
Local Store
Single Port
Single Port
SRAM
SRAM
Issue /
Issue /
Branch
Branch
Fetch
ILB
16B x 2
2 instructions
16B x 3 x 2
64B
64B
16B
16B
128B
128B
Source: Gschwind et al., Hot Chips 17, 2005
SMF
SMF
Cell Broadband Engine
© 2006 IBM Corporation
17 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Density Computing in SPEs
ƒ Today, execution units only fraction of core area and power
– Bigger fraction goes to other functions
• Address translation and privilege levels
• Instruction reordering
• Register renaming
• Cache hierarchy
ƒ Cell changes this ratio to increase performance per area and
power
– Architectural focus on data processing
• Wide datapaths
• More and wide architectural registers
• Data privatization and single level processor-local store
• All code executes in a single (user) privilege level
• Static scheduling
Cell Broadband Engine
© 2006 IBM Corporation
18 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Streamlined Architecture
ƒ Architectural focus on simplicity
– Aids achievable operating frequency
– Optimize circuits for common performance case
– Compiler aids in layering traditional hardware functions
– Leverage 20 years of architecture research
ƒ Focus on statically scheduled data parallelism
– Focus on data parallel instructions
• No separate scalar execution units
• Scalar operations mapped onto data parallel dataflow
– Exploit wide data paths
• data processing
• instruction fetch
– Address impediments to static scheduling
• Large register set
• Reduce latencies by eliminating non-essential functionality
Cell Broadband Engine
© 2006 IBM Corporation
19 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
SPE Highlights
ƒ RISC organization
– 32 bit fixed instructions
– Load/store architecture
– Unified register file
ƒ User-mode architecture
– No page translation within SPU
ƒ SIMD dataflow
– Broad set of operations (8, 16, 32, 64 Bit)
– Graphics SP-Float
– IEEE DP-Float
ƒ 256KB Local Store
– Combined I & D
ƒ DMA block transfer
– using Power Architecture memory
translation
LS
LS
LS
LS
GPR
FXU ODD
FXU EVN
SFP
DP
CONTROL
CHANNEL
DMA SMM
ATO
SBI
RTB
BE
B
FWD
14.5mm2 (90nm SOI)
Source: Kahle, Spring Processor Forum 2005
Cell Broadband Engine
© 2006 IBM Corporation
20 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
data
parallel
select
static
prediction
simpler
μarch
sequential
fetch
local
store
high
frequency
shorter
pipeline
large
basic
blocks
determ.
latency
large
register
file
instruction
bundling
DLP
wide
data
paths
compute
density
Synergistic Processing
single
port
static
scheduling
scalar
layering
ILP
opt.
Cell Broadband Engine
© 2006 IBM Corporation
21 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Efficient data-sharing between scalar and SIMD processing
ƒ Legacy architectures separate scalar and SIMD
processing
– Data sharing between SIMD and scalar processing units
expensive
• Transfer penalty between register files
– Defeats data-parallel performance improvement in many
scenarios
ƒ Unified register file facilitates data sharing for
efficient exploitation of data parallelism
– Allow exploitation of data parallelism without data transfer
penalty
• Data-parallelism always an improvement
Cell Broadband Engine
© 2006 IBM Corporation
22 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
SPU PIPELINE FRONT END
SPU PIPELINE BACK END
Branch Instruction
Load/Store Instruction
IF Instruction Fetch
IB Instruction Buffer
ID Instruction Decode
IS Instruction Issue
RF Register File Access
EX Execution
WB Write Back
Floating Point Instruction
Permute Instruction
EX1 EX2 EX3 EX4
EX1 EX2 EX3 EX4 EX5 EX6
EX1 EX2 WB
WB
RF1 RF2
WB
Fixed Point Instruction
EX1 EX2 EX3 EX4 EX5 EX6 WB
IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2
SPU Pipeline
Cell Broadband Engine
© 2006 IBM Corporation
23 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Permute Unit
Load-Store Unit
Floating-Point Unit
Fixed-Point Unit
Branch Unit
Channel Unit
Result Forwarding and Staging
Register File
Local Store
(256kB)
Single Port SRAM
128B Read 128B Write
DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle
64 Byte/Cycle
On-Chip Coherent Bus
SPE Block Diagram
Cell Broadband Engine
© 2006 IBM Corporation
24 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Synergistic Memory Flow Control
Data Bus
Snoop Bus
Control Bus
Translate Ld/St
MMIO
DMA Engine DMA
Queue
RMT
MMU
Atomic
Facility
Bus I/F Control
LS
SXU
SPU
MMIO
SPE
SMF
ƒ SMF implements memory
management and mapping
ƒ SMF operates in parallel to SPU
– Independent compute and transfer
– Command interface from SPU
• DMA queue decouples SMF & SPU
• MMIO-interface for remote nodes
ƒ Block transfer between system
memory and local store
ƒ SPE programs reference system
memory using user-level
effective address space
– Ease of data sharing
– Local store to local store transfers
– Protection
Cell Broadband Engine
© 2006 IBM Corporation
25 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Synergistic Memory Flow Control
ƒ Bus Interface
– Up to 16 outstanding DMA
requests
– Requests up to 16KByte
– Token-based Bus Access
Management
Source: Clark et al., Hot Chips 17, 2005
ƒ Element Interconnect Bus
– Four 16 byte data rings
– Multiple concurrent transfers
– 96B/cycle peak bandwidth
– Over 100 outstanding requests
– 200+ GByte/s @ 3.2+ GHz
16B 16B 16B 16B
Data Arb
16B 16B 16B 16B
16B 16B
16B 16B
16B 16B
16B 16B
16B
16B
16B
16B
16B
16B
16B
16B
SPE0 SPE2 SPE4 SPE6
SPE7
SPE5
SPE3
SPE1
MIC
PPE
BIF/IOIF0
IOIF1
Cell Broadband Engine
© 2006 IBM Corporation
26 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Memory efficiency as key to application performance
ƒ Greatest challenge in translating peak into app performance
– Peak Flops useless without way to feed data
ƒ Cache miss provides too little data too late
– Inefficient for streaming / bulk data processing
– Initiates transfer when transfer results are already needed
– Application-controlled data fetch avoids not-on-time data delivery
ƒ SMF is a better way to look at an old problem
– Fetch data blocks based on algorithm
– Blocks reflect application data structures
Cell Broadband Engine
© 2006 IBM Corporation
27 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Traditional memory usage
ƒ Long latency memory access operations exposed
ƒ Cannot overlap 500+ cycles with computation
ƒ Memory access latency severely impacts application performance
computation mem protocol mem idle mem contention
Cell Broadband Engine
© 2006 IBM Corporation
28 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Exploiting memory-level parallelism
ƒ Reduce performance impact of memory accesses with concurrent
access
ƒ Carefully scheduled memory accesses in numeric code
ƒ Out-of-order execution increases chance to discover more
concurrent accesses
– Overlapping 500 cycle latency with computation using OoO illusory
ƒ Bandwidth limited by queue size and roundtrip latency
Cell Broadband Engine
© 2006 IBM Corporation
29 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Exploiting MLP with Chip Multiprocessing
ƒ More threads Î more memory level parallelism
– overlap accesses from multiple cores
Cell Broadband Engine
© 2006 IBM Corporation
30 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
A new form of parallelism: CTP
ƒ Compute-transfer parallelism
– Concurrent execution of compute and transfer increases
efficiency
• Avoid costly and unnecessary serialization
– Application thread has two threads of control
• SPU Î computation thread
• SMF Î transfer thread
ƒ Optimize memory access and data transfer at the
application level
– Exploit programmer and application knowledge about
data access patterns
Cell Broadband Engine
© 2006 IBM Corporation
31 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Memory access in Cell with MLP and CTP
ƒ Super-linear performance gains observed
– decouple data fetch and use
Cell Broadband Engine
© 2006 IBM Corporation
32 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
multi
core
many
outstanding
transfers
sequential
fetch
high
throughput
timely
data
delivery
shared
I & D
reduce
miss
complexity
parallel
compute &
transfer
improved
code gen
decouple
fetch and
access
explicit
data
transfer
storage
density
Synergistic Memory Management
single
port
application
control
reduced
coherence
cost
low
latency
access
block
fetch
deeper
fetch
queue
local
store
Cell Broadband Engine
© 2006 IBM Corporation
33 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Cell BE Implementation Characteristics
ƒ 241M transistors
ƒ 235mm2
ƒ Design operates across wide
frequency range
– Optimize for power & yield
ƒ > 200 GFlops (SP) @3.2GHz
ƒ > 20 GFlops (DP) @3.2GHz
ƒ Up to 25.6 GB/s memory bandwidth
ƒ Up to 75 GB/s I/O bandwidth
ƒ 100+ simultaneous bus
transactions
– 16+8 entry DMA queue per SPE
Relative
Frequency Increase vs. Power Consumption
Voltage
Source: Kahle, Spring Processor Forum 2005
Cell Broadband Engine
© 2006 IBM Corporation
34 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Cell Broadband Engine
Cell Broadband Engine
© 2006 IBM Corporation
35 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
System configurations
Cell BE
Processor
XDR™
XDR™
IOIF
BIF
Cell BE
Processor
XDR™
XDR™
IOIF
Cell
BE
Processor
XDR
™
XDR
™
IOIF
BIF
Cell
BE
Processor
XDR
™
XDR
™
IOIF
switch
Cell BE
Processor
XDR™ XDR™
IOIF BIF
Cell BE
Processor
XDR™ XDR™
IOIF
Cell BE
Processor
XDR™ XDR™
IOIF0 IOIF1
ƒ Game console systems
ƒ Blades
ƒ HDTV
ƒ Home media servers
ƒ Supercomputers
Cell
Design
Cell Broadband Engine
© 2006 IBM Corporation
36 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Compiling for the Cell Broadband Engine
ƒ The lesson of “RISC computing”
– Architecture provides fast, streamlined primitives to compiler
– Compiler uses primitives to implement higher-level idioms
– If the compiler can’t target it Î do not include in architecture
ƒ Compiler focus throughout project
– Prototype compiler soon after first proposal
– Cell compiler team has made significant advances in
• Automatic SIMD code generation
• Automatic parallelization
• Data privatization
Programmability
Programmer
Productivity
Raw Hardware
Performance
Cell
Design
Cell Broadband Engine
© 2006 IBM Corporation
37 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Single source automatic parallelization
ƒ Single source compiler
– Address legacy code and programmer productivity
– Partition a single source program into SPE and PPE functions
ƒ Automatic parallelization
– Across multiple threads
– Vectorization to exploit SIMD units
ƒ Compiler managed local store
– Data and code blocking and tiling
– “Software cache” managed by compiler
ƒ Range of programmer input
– Fully automatic
– Programmer-guided
• pragmas, intrinsics, OpenMP directives
ƒ Prototype developed in IBM Research
Cell Broadband Engine
© 2006 IBM Corporation
38 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Compiler-driven performance
ƒ Code partitioning
– Thread-level parallelism
– Handle local store size
– “Outlining” into SPE overlays
ƒ Data partitioning
– Data access sequencing
– Double buffering
– “Software cache”
ƒ Data parallelization
– SIMD vectorization
– Handling of data with
.unknown alignment
Source: Eichenberger et al., PACT 2005
Cell Broadband Engine
© 2006 IBM Corporation
39 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Raising the bar with parallelism…
ƒ Data-parallelism and static ILP in both core types
– Results in low overhead per operation
ƒ Multithreaded programming is key to great Cell BE performance
– Exploit application parallelism with 9 cores
– Regardless of whether code exploits DLP & ILP
– Challenge regardless of homogeneity/heterogeneity
ƒ Leverage parallelism between data processing and data transfer
– A new level of parallelism exploiting bulk data transfer
– Simultaneous processing on SPUs and data transfer on SMFs
– Offers superlinear gains beyond MIPS-scaling
Cell Broadband Engine
© 2006 IBM Corporation
40 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
… while understanding the trade-offs
ƒ Uniprocessor efficiency is actually low
– Gelsinger’s Law captures historic (performance) efficiency
• 1.4x performance for 2x transistors
– Marginal uni-processor (performance) efficiency is 40% (or lower!)
• And power efficiency is even worse
ƒ The “true” bar is marginal uniprocessor efficiency
– A multiprocessor “only” has to beat a uniprocessor to be the better
solution
– Many low-hanging fruit to be picked in multithreading applications
• Embarrassing application parallelism which has not been exploited
Cell Broadband Engine
© 2006 IBM Corporation
41 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Cell: a Synergistic System Architecture
ƒ Cell is not a collection of different processors, but a synergistic whole
– Operation paradigms, data formats and semantics consistent
– Share address translation and memory protection model
ƒ SPE optimized for efficient data processing
– SPEs share Cell system functions provided by Power Architecture
– SMF implements interface to memory
• Copy in/copy out to local storage
ƒ Power Architecture provides system functions
– Virtualization
– Address translation and protection
– External exception handling
ƒ Element Interconnect Bus integrates system as data transport hub
Cell Broadband Engine
© 2006 IBM Corporation
42 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
A new wave of innovation
ƒ CMPs everywhere
– same ISA CMPs
– different ISA CMPs
– homogeneous CMP
– heterogeneous CMP
– thread-rich CMPs
– GPUs as compute engine
ƒ Compilers and software
must and will follow…
ƒ Novel systems
– Power4
– BOA
– Piranha
– Cyclops
– Niagara
– BlueGene
– Power5
– Xbox360
– Cell BE
Cell Broadband Engine
© 2006 IBM Corporation
43 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
Conclusion
ƒ Established performance drivers are running out of steam
ƒ Need new approach to build high performance system
– Application and system-centric optimization
– Chip multiprocessing
– Understand and manage memory access
– Need compilers to orchestrate data and instruction flow
ƒ Need new push on understanding workload behavior
ƒ Need renewed focus on compilation technology
ƒ An exciting journey is ahead of us!
Cell Broadband Engine
© 2006 IBM Corporation
44 Chip Multiprocessing and the Cell Broadband Engine
M. Gschwind, Computing Frontiers 2006
© Copyright International Business Machines Corporation 2005,6.
All Rights Reserved. Printed in the United States May 2006.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.

Contenu connexe

Similaire à Chip Multiprocessing and the Cell Broadband Engine.pdf

IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specificationsinside-BigData.com
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptxclaudio48
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
SISTec Microelectronics VLSI design
SISTec Microelectronics VLSI designSISTec Microelectronics VLSI design
SISTec Microelectronics VLSI designDr. Ravi Mishra
 
FUNDAMENTALS OF COMPUTER DESIGN
FUNDAMENTALS OF COMPUTER DESIGNFUNDAMENTALS OF COMPUTER DESIGN
FUNDAMENTALS OF COMPUTER DESIGNvenkatraman227
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
Low-Power Design and Verification
Low-Power Design and VerificationLow-Power Design and Verification
Low-Power Design and VerificationDVClub
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
VLSI unit 1 Technology - S.ppt
VLSI unit 1 Technology - S.pptVLSI unit 1 Technology - S.ppt
VLSI unit 1 Technology - S.pptindrajeetPatel22
 
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorSlide_N
 
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.pptDrUrvashiBansal
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupSlide_N
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit SupercomputerVigneshwarRamaswamy
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john maoNAVER D2
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 

Similaire à Chip Multiprocessing and the Cell Broadband Engine.pdf (20)

IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Hipeac 2018 keynote Talk
Hipeac 2018 keynote TalkHipeac 2018 keynote Talk
Hipeac 2018 keynote Talk
 
SISTec Microelectronics VLSI design
SISTec Microelectronics VLSI designSISTec Microelectronics VLSI design
SISTec Microelectronics VLSI design
 
FUNDAMENTALS OF COMPUTER DESIGN
FUNDAMENTALS OF COMPUTER DESIGNFUNDAMENTALS OF COMPUTER DESIGN
FUNDAMENTALS OF COMPUTER DESIGN
 
Greendroid ppt
Greendroid pptGreendroid ppt
Greendroid ppt
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
Low-Power Design and Verification
Low-Power Design and VerificationLow-Power Design and Verification
Low-Power Design and Verification
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Sushant
SushantSushant
Sushant
 
VLSI unit 1 Technology - S.ppt
VLSI unit 1 Technology - S.pptVLSI unit 1 Technology - S.ppt
VLSI unit 1 Technology - S.ppt
 
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
 
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
287233027-Chapter-1-Fundamentals-of-Computer-Design-ppt.ppt
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 

Plus de Slide_N

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...Slide_N
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfSlide_N
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive EntertainmentSlide_N
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3Slide_N
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingSlide_N
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2Slide_N
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 

Plus de Slide_N (20)

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive Entertainment
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 

Dernier

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 

Dernier (20)

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 

Chip Multiprocessing and the Cell Broadband Engine.pdf

  • 1. Cell Broadband Engine © 2006 IBM Corporation Chip Multiprocessing and the Cell Broadband Engine Michael Gschwind ACM Computing Frontiers 2006
  • 2. Cell Broadband Engine © 2006 IBM Corporation 2 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Computer Architecture at the turn of the millenium ƒ “The end of architecture” was being proclaimed – Frequency scaling as performance driver – State of the art microprocessors •multiple instruction issue •out of order architecture •register renaming •deep pipelines ƒ Little or no focus on compilers – Questions about the need for compiler research ƒ Academic papers focused on microarchitecture tweaks
  • 3. Cell Broadband Engine © 2006 IBM Corporation 3 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 The age of frequency scaling SCALING: Voltage: V/α Oxide: tox /α Wire width: W/α Gate width: L/α Diffusion: xd /α Substrate: α * NA RESULTS: Higher Density: ~α2 Higher Speed: ~α Power/ckt: ~1/α2 Power Density: ~Constant p substrate, doping α*NA Scaled Device L/α xd/α GATE n+ source n+ drain WIRING Voltage, V / α W/α tox/α Source: Dennard et al., JSSC 1974. 0.5 0.6 0.7 0.8 0.9 1 7 10 13 16 19 22 25 28 31 34 37 Total FO4 Per Stage Performance bips deeper pipeline
  • 4. Cell Broadband Engine © 2006 IBM Corporation 4 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Frequency scaling ƒ A trusted standby – Frequency as the tide that raises all boats – Increased performance across all applications – Kept the industry going for a decade ƒ Massive investment to keep going – Equipment cost – Power increase – Manufacturing variability – New materials
  • 5. Cell Broadband Engine © 2006 IBM Corporation 5 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 … but reality was closing in! 0 0.2 0.4 0.6 0.8 1 7 10 13 16 19 22 25 28 31 34 37 Total FO4 Per Stage Relative to Optimal FO4 bips bips^3/W 0.001 0.01 0.1 1 10 100 1000 0.01 0.1 1 Active Power Leakage Power
  • 6. Cell Broadband Engine © 2006 IBM Corporation 6 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Power crisis 0.1 1 10 100 1000 0.01 0.1 1 gate length Lgate (µm) classic scaling Tox(Å) Vdd Vt ƒ The power crisis is not “natural” – Created by deviating from ideal scaling theory – Vdd and Vt not scaled by α • additional performance with increased voltage ƒ Laws of physics – Pchip = ntransistors * Ptransistor ƒ Marginal performance gain per transistor low – significant power increase – decreased power/performance efficiency
  • 7. Cell Broadband Engine © 2006 IBM Corporation 7 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 The power inefficiency of deep pipelining 0 0.2 0.4 0.6 0.8 1 7 10 13 16 19 22 25 28 31 34 37 Total FO4 Per Stage Relative to Optimal FO4 bips bips^3/W ƒ Deep pipelining increases number of latches and switching rate ⇒ power increases with at least f2 ƒ Latch insertion delay limits gains of pipelining Power-performance optimal Performance optimal Source: Srinivasan et al., MICRO 2002 deeper pipeline
  • 8. Cell Broadband Engine © 2006 IBM Corporation 8 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Hitting the memory wall ƒ Latency gap – Memory speedup lags behind processor speedup – limits ILP ƒ Chip I/O bandwidth gap – Less bandwidth per MIPS ƒ Latency gap as application bandwidth gap – Typically (much) less than chip I/O bandwidth -60 -40 -20 0 20 40 60 80 100 1990 2003 normalized MFLOPS Memory latency Source: McKee, Computing Frontiers 2004 Source: Burger et al., ISCA 1996 usable bandwidth avg. request size roundtrip latency no. of inflight requests
  • 9. Cell Broadband Engine © 2006 IBM Corporation 9 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Our Y2K challenge ƒ 10× performance of desktop systems ƒ 1 TeraFlop / second with a four-node configuration ƒ 1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design” ƒ scalable design across a range of design points ƒ mass-produced and low cost
  • 10. Cell Broadband Engine © 2006 IBM Corporation 10 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Cell Design Goals ƒ Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 ƒ Computing density as main challenge –Dramatically increase performance per X • X = Area, Power, Volume, Cost,… ƒ Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost
  • 11. Cell Broadband Engine © 2006 IBM Corporation 11 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Necessity as the mother of invention ƒ Increase power-performance efficiency –Simple designs are more efficient in terms of power and area ƒ Increase memory subsystem efficiency –Increasing data transaction size –Increase number of concurrently outstanding transactions ⇒ Exploit larger fraction of chip I/O bandwidth ƒ Use CMOS density scaling –Exploit density instead of frequency scaling to deliver increased aggregate performance ƒ Use compilers to extract parallelism from application –Exploit application parallelism to translate aggregate performance to application performance
  • 12. Cell Broadband Engine © 2006 IBM Corporation 12 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Exploit application-level parallelism ƒ Data-level parallelism – SIMD parallelism improves performance with little overhead ƒ Thread-level parallelism – Exploit application threads with multi-core design approach ƒ Memory-level parallelism – Improve memory access efficiency by increasing number of parallel memory transactions ƒ Compute-transfer parallelism – Transfer data in parallel to execution by exploiting application knowledge
  • 13. Cell Broadband Engine © 2006 IBM Corporation 13 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Concept phase ideas Chip Multiprocessor efficient use of memory interface large architected register set high frequency and low power! reverse Vdd scaling for low power modular design design reuse control cost of coherency
  • 14. Cell Broadband Engine © 2006 IBM Corporation 14 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Cell Architecture ƒ Heterogeneous multi- core system architecture – Power Processor Element for control tasks – Synergistic Processor Elements for data- intensive processing ƒ Synergistic Processor Element (SPE) consists of – Synergistic Processor Unit (SPU) – Synergistic Memory Flow Control (SMF) 16B/cycle (2x) 16B/cycle BIC FlexIOTM MIC Dual XDRTM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU SMF PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF
  • 15. Cell Broadband Engine © 2006 IBM Corporation 15 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Shifting the Balance of Power with Cell Broadband Engine ƒ Data processor instead of control system – Control-centric code stable over time – Big growth in data processing needs • Modeling • Games • Digital media • Scientific applications ƒ Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Parallelism added as an after-thought
  • 16. Cell Broadband Engine © 2006 IBM Corporation 16 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Powering Cell – the Synergistic Processor Unit VRF VRF PERM PERM LSU LSU V V F F P P U U V V F F X X U U Local Store Local Store Single Port Single Port SRAM SRAM Issue / Issue / Branch Branch Fetch ILB 16B x 2 2 instructions 16B x 3 x 2 64B 64B 16B 16B 128B 128B Source: Gschwind et al., Hot Chips 17, 2005 SMF SMF
  • 17. Cell Broadband Engine © 2006 IBM Corporation 17 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Density Computing in SPEs ƒ Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • Address translation and privilege levels • Instruction reordering • Register renaming • Cache hierarchy ƒ Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • Wide datapaths • More and wide architectural registers • Data privatization and single level processor-local store • All code executes in a single (user) privilege level • Static scheduling
  • 18. Cell Broadband Engine © 2006 IBM Corporation 18 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Streamlined Architecture ƒ Architectural focus on simplicity – Aids achievable operating frequency – Optimize circuits for common performance case – Compiler aids in layering traditional hardware functions – Leverage 20 years of architecture research ƒ Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • data processing • instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality
  • 19. Cell Broadband Engine © 2006 IBM Corporation 19 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 SPE Highlights ƒ RISC organization – 32 bit fixed instructions – Load/store architecture – Unified register file ƒ User-mode architecture – No page translation within SPU ƒ SIMD dataflow – Broad set of operations (8, 16, 32, 64 Bit) – Graphics SP-Float – IEEE DP-Float ƒ 256KB Local Store – Combined I & D ƒ DMA block transfer – using Power Architecture memory translation LS LS LS LS GPR FXU ODD FXU EVN SFP DP CONTROL CHANNEL DMA SMM ATO SBI RTB BE B FWD 14.5mm2 (90nm SOI) Source: Kahle, Spring Processor Forum 2005
  • 20. Cell Broadband Engine © 2006 IBM Corporation 20 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 data parallel select static prediction simpler μarch sequential fetch local store high frequency shorter pipeline large basic blocks determ. latency large register file instruction bundling DLP wide data paths compute density Synergistic Processing single port static scheduling scalar layering ILP opt.
  • 21. Cell Broadband Engine © 2006 IBM Corporation 21 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Efficient data-sharing between scalar and SIMD processing ƒ Legacy architectures separate scalar and SIMD processing – Data sharing between SIMD and scalar processing units expensive • Transfer penalty between register files – Defeats data-parallel performance improvement in many scenarios ƒ Unified register file facilitates data sharing for efficient exploitation of data parallelism – Allow exploitation of data parallelism without data transfer penalty • Data-parallelism always an improvement
  • 22. Cell Broadband Engine © 2006 IBM Corporation 22 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 SPU PIPELINE FRONT END SPU PIPELINE BACK END Branch Instruction Load/Store Instruction IF Instruction Fetch IB Instruction Buffer ID Instruction Decode IS Instruction Issue RF Register File Access EX Execution WB Write Back Floating Point Instruction Permute Instruction EX1 EX2 EX3 EX4 EX1 EX2 EX3 EX4 EX5 EX6 EX1 EX2 WB WB RF1 RF2 WB Fixed Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPU Pipeline
  • 23. Cell Broadband Engine © 2006 IBM Corporation 23 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Permute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle 64 Byte/Cycle On-Chip Coherent Bus SPE Block Diagram
  • 24. Cell Broadband Engine © 2006 IBM Corporation 24 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Synergistic Memory Flow Control Data Bus Snoop Bus Control Bus Translate Ld/St MMIO DMA Engine DMA Queue RMT MMU Atomic Facility Bus I/F Control LS SXU SPU MMIO SPE SMF ƒ SMF implements memory management and mapping ƒ SMF operates in parallel to SPU – Independent compute and transfer – Command interface from SPU • DMA queue decouples SMF & SPU • MMIO-interface for remote nodes ƒ Block transfer between system memory and local store ƒ SPE programs reference system memory using user-level effective address space – Ease of data sharing – Local store to local store transfers – Protection
  • 25. Cell Broadband Engine © 2006 IBM Corporation 25 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Synergistic Memory Flow Control ƒ Bus Interface – Up to 16 outstanding DMA requests – Requests up to 16KByte – Token-based Bus Access Management Source: Clark et al., Hot Chips 17, 2005 ƒ Element Interconnect Bus – Four 16 byte data rings – Multiple concurrent transfers – 96B/cycle peak bandwidth – Over 100 outstanding requests – 200+ GByte/s @ 3.2+ GHz 16B 16B 16B 16B Data Arb 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B SPE0 SPE2 SPE4 SPE6 SPE7 SPE5 SPE3 SPE1 MIC PPE BIF/IOIF0 IOIF1
  • 26. Cell Broadband Engine © 2006 IBM Corporation 26 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Memory efficiency as key to application performance ƒ Greatest challenge in translating peak into app performance – Peak Flops useless without way to feed data ƒ Cache miss provides too little data too late – Inefficient for streaming / bulk data processing – Initiates transfer when transfer results are already needed – Application-controlled data fetch avoids not-on-time data delivery ƒ SMF is a better way to look at an old problem – Fetch data blocks based on algorithm – Blocks reflect application data structures
  • 27. Cell Broadband Engine © 2006 IBM Corporation 27 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Traditional memory usage ƒ Long latency memory access operations exposed ƒ Cannot overlap 500+ cycles with computation ƒ Memory access latency severely impacts application performance computation mem protocol mem idle mem contention
  • 28. Cell Broadband Engine © 2006 IBM Corporation 28 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Exploiting memory-level parallelism ƒ Reduce performance impact of memory accesses with concurrent access ƒ Carefully scheduled memory accesses in numeric code ƒ Out-of-order execution increases chance to discover more concurrent accesses – Overlapping 500 cycle latency with computation using OoO illusory ƒ Bandwidth limited by queue size and roundtrip latency
  • 29. Cell Broadband Engine © 2006 IBM Corporation 29 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Exploiting MLP with Chip Multiprocessing ƒ More threads Î more memory level parallelism – overlap accesses from multiple cores
  • 30. Cell Broadband Engine © 2006 IBM Corporation 30 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 A new form of parallelism: CTP ƒ Compute-transfer parallelism – Concurrent execution of compute and transfer increases efficiency • Avoid costly and unnecessary serialization – Application thread has two threads of control • SPU Î computation thread • SMF Î transfer thread ƒ Optimize memory access and data transfer at the application level – Exploit programmer and application knowledge about data access patterns
  • 31. Cell Broadband Engine © 2006 IBM Corporation 31 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Memory access in Cell with MLP and CTP ƒ Super-linear performance gains observed – decouple data fetch and use
  • 32. Cell Broadband Engine © 2006 IBM Corporation 32 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 multi core many outstanding transfers sequential fetch high throughput timely data delivery shared I & D reduce miss complexity parallel compute & transfer improved code gen decouple fetch and access explicit data transfer storage density Synergistic Memory Management single port application control reduced coherence cost low latency access block fetch deeper fetch queue local store
  • 33. Cell Broadband Engine © 2006 IBM Corporation 33 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Cell BE Implementation Characteristics ƒ 241M transistors ƒ 235mm2 ƒ Design operates across wide frequency range – Optimize for power & yield ƒ > 200 GFlops (SP) @3.2GHz ƒ > 20 GFlops (DP) @3.2GHz ƒ Up to 25.6 GB/s memory bandwidth ƒ Up to 75 GB/s I/O bandwidth ƒ 100+ simultaneous bus transactions – 16+8 entry DMA queue per SPE Relative Frequency Increase vs. Power Consumption Voltage Source: Kahle, Spring Processor Forum 2005
  • 34. Cell Broadband Engine © 2006 IBM Corporation 34 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine
  • 35. Cell Broadband Engine © 2006 IBM Corporation 35 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 System configurations Cell BE Processor XDR™ XDR™ IOIF BIF Cell BE Processor XDR™ XDR™ IOIF Cell BE Processor XDR ™ XDR ™ IOIF BIF Cell BE Processor XDR ™ XDR ™ IOIF switch Cell BE Processor XDR™ XDR™ IOIF BIF Cell BE Processor XDR™ XDR™ IOIF Cell BE Processor XDR™ XDR™ IOIF0 IOIF1 ƒ Game console systems ƒ Blades ƒ HDTV ƒ Home media servers ƒ Supercomputers Cell Design
  • 36. Cell Broadband Engine © 2006 IBM Corporation 36 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Compiling for the Cell Broadband Engine ƒ The lesson of “RISC computing” – Architecture provides fast, streamlined primitives to compiler – Compiler uses primitives to implement higher-level idioms – If the compiler can’t target it Î do not include in architecture ƒ Compiler focus throughout project – Prototype compiler soon after first proposal – Cell compiler team has made significant advances in • Automatic SIMD code generation • Automatic parallelization • Data privatization Programmability Programmer Productivity Raw Hardware Performance Cell Design
  • 37. Cell Broadband Engine © 2006 IBM Corporation 37 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Single source automatic parallelization ƒ Single source compiler – Address legacy code and programmer productivity – Partition a single source program into SPE and PPE functions ƒ Automatic parallelization – Across multiple threads – Vectorization to exploit SIMD units ƒ Compiler managed local store – Data and code blocking and tiling – “Software cache” managed by compiler ƒ Range of programmer input – Fully automatic – Programmer-guided • pragmas, intrinsics, OpenMP directives ƒ Prototype developed in IBM Research
  • 38. Cell Broadband Engine © 2006 IBM Corporation 38 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Compiler-driven performance ƒ Code partitioning – Thread-level parallelism – Handle local store size – “Outlining” into SPE overlays ƒ Data partitioning – Data access sequencing – Double buffering – “Software cache” ƒ Data parallelization – SIMD vectorization – Handling of data with .unknown alignment Source: Eichenberger et al., PACT 2005
  • 39. Cell Broadband Engine © 2006 IBM Corporation 39 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Raising the bar with parallelism… ƒ Data-parallelism and static ILP in both core types – Results in low overhead per operation ƒ Multithreaded programming is key to great Cell BE performance – Exploit application parallelism with 9 cores – Regardless of whether code exploits DLP & ILP – Challenge regardless of homogeneity/heterogeneity ƒ Leverage parallelism between data processing and data transfer – A new level of parallelism exploiting bulk data transfer – Simultaneous processing on SPUs and data transfer on SMFs – Offers superlinear gains beyond MIPS-scaling
  • 40. Cell Broadband Engine © 2006 IBM Corporation 40 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 … while understanding the trade-offs ƒ Uniprocessor efficiency is actually low – Gelsinger’s Law captures historic (performance) efficiency • 1.4x performance for 2x transistors – Marginal uni-processor (performance) efficiency is 40% (or lower!) • And power efficiency is even worse ƒ The “true” bar is marginal uniprocessor efficiency – A multiprocessor “only” has to beat a uniprocessor to be the better solution – Many low-hanging fruit to be picked in multithreading applications • Embarrassing application parallelism which has not been exploited
  • 41. Cell Broadband Engine © 2006 IBM Corporation 41 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Cell: a Synergistic System Architecture ƒ Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model ƒ SPE optimized for efficient data processing – SPEs share Cell system functions provided by Power Architecture – SMF implements interface to memory • Copy in/copy out to local storage ƒ Power Architecture provides system functions – Virtualization – Address translation and protection – External exception handling ƒ Element Interconnect Bus integrates system as data transport hub
  • 42. Cell Broadband Engine © 2006 IBM Corporation 42 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 A new wave of innovation ƒ CMPs everywhere – same ISA CMPs – different ISA CMPs – homogeneous CMP – heterogeneous CMP – thread-rich CMPs – GPUs as compute engine ƒ Compilers and software must and will follow… ƒ Novel systems – Power4 – BOA – Piranha – Cyclops – Niagara – BlueGene – Power5 – Xbox360 – Cell BE
  • 43. Cell Broadband Engine © 2006 IBM Corporation 43 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 Conclusion ƒ Established performance drivers are running out of steam ƒ Need new approach to build high performance system – Application and system-centric optimization – Chip multiprocessing – Understand and manage memory access – Need compilers to orchestrate data and instruction flow ƒ Need new push on understanding workload behavior ƒ Need renewed focus on compilation technology ƒ An exciting journey is ahead of us!
  • 44. Cell Broadband Engine © 2006 IBM Corporation 44 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 © Copyright International Business Machines Corporation 2005,6. All Rights Reserved. Printed in the United States May 2006. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.