Multiprocessor architecture and programming

Parallel Computing Architecture &
Programming Techniques
Raul Goycoolea S.
Solution Architect Manager
Oracle Enterprise Architecture Group

<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 216 February 2012

Antecedents of
Parallel
Computing

The “Software Crisis”
“To put it quite bluntly: as long as there were no
machines, programming was no problem at all; when
we had a few weak computers, programming became a
mild problem, and now we have gigantic computers,
programming has become an equally gigantic problem."
-- E. Dijkstra, 1972 Turing Award Lecture
Raul Goycoolea S.

The First Software Crisis
• Time Frame: ’60s and ’70s
• Problem: Assembly Language Programming
Computers could handle larger more complex programs
• Needed to get Abstraction and Portability without
losing Performance
Raul Goycoolea S.

Common Properties
Single flow of control
Single memory image
Differences:
Register File
ISA
Functional Units
How Did We Solve The First Software
Crisis?
• High-level languages for von-Neumann machines
FORTRAN and C
• Provided “common machine language” for
uniprocessors
Raul Goycoolea S.

The Second Software Crisis
• Time Frame: ’80s and ’90s
• Problem: Inability to build and maintain complex and
robust applications requiring multi-million lines of
code developed by hundreds of programmers
Computers could handle larger more complex programs
• Needed to get Composability, Malleability and
Maintainability
High-performance was not an issue left for Moore’s Law
Raul Goycoolea S.

How Did We Solve the Second
Software Crisis?
• Object Oriented Programming
C++, C# and Java
• Also…
Better tools
• Component libraries, Purify
Better software engineering methodology
• Design patterns, specification, testing, code
reviews
Raul Goycoolea S.

Today:
Programmers are Oblivious to Processors
• Solid boundary between Hardware and Software
• Programmers don’t have to know anything about the
processor
High level languages abstract away the processors
Ex: Java bytecode is machine independent
Moore’s law does not require the programmers to know anything
about the processors to get good speedups
• Programs are oblivious of the processor works on all
processors
A program written in ’70 using C still works and is much faster today
• This abstraction provides a lot of freedom for the
programmers
Raul Goycoolea S.

The Origins of a Third Crisis
• Time Frame: 2005 to 20??
• Problem: Sequential performance is left behind by
Moore’s law
• Needed continuous and reasonable performance
improvements
to support new features
to support larger datasets
• While sustaining portability, malleability and
maintainability without unduly increasing complexity
faced by the programmer critical to keep-up with the
current rate of evolution in software
Raul Goycoolea S.

Performance(vs.VAX-11/780)
NumberofTransistors
52%/year
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
%/year
10
8086
1
286
25%/year
386
486
Pentium
P2
P3
P4
Itanium 2
Itanium
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture:
A Quantitative Approach, 4th edition, 2006
The Road to Multicore: Moore’s Law
Raul Goycoolea S.

Specint2000
10000.00
1000.00
100.00
10.00
1.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
The Road to Multicore:
Uniprocessor Performance (SPECint)
Raul Goycoolea S.
Intel 386
Intel 486

General-purpose unicores have stopped historic
performance scaling
Power consumption
Wire delays
DRAM access latency
Diminishing returns of more instruction-level parallelism
Raul Goycoolea S.

Power
1000
100
10
1
85 87 89 91 93 95 97 99 01 03 05 07
Intel 386
Intel 486
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha21064
Alpha21164
Alpha21264
Sparc
SuperSparc
Sparc64
Mips
HPPA
Power PC
AMDK6
AMDK7
AMDx86-64
Power Consumption (watts)
Raul Goycoolea S.

Watts/Spec
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1982 1984 1987 1990 1993 1995 1998 2001 2004 2006
Year
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
0
Power Efficiency (watts/spec)
Raul Goycoolea S.

Process(microns)
0.06
0.04
0.02
0
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
1996 1998 2000 2002 2008 2010 2012 20142004 2006
Year
700 MHz
1.25 GHz
2.1 GHz
6 GHz
10 GHz
13.5 GHz
• 400 mm2 Die
• From the SIA Roadmap
Range of a Wire in One Clock Cycle
Raul Goycoolea S.

Performance
19
84
19
94
19
92
19
82
19
88
19
86
19
80
19
96
19
98
20
00
20
02
19
90
20
04
1000000
10000
100
1
Year
µProc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
DRAM Access Latency
• Access times are a
speed of light issue
• Memory technology is
also changing
SRAM are getting harder to
scale
DRAM is no longer cheapest
cost/bit
• Power efficiency is an
issue here as well
Raul Goycoolea S.

PowerDensity(W/cm2)
10,000
1,000
„70 „80 „90 „00 „10
10 4004
8008
8080
1
8086
8085
286 386
486
Pentium®
Hot Plate
Nuclear Reactor
100
Sun‟s Surface
Rocket Nozzle
Intel Developer Forum, Spring 2004 - Pat Gelsinger
(Pentium at 90 W)
Cube relationship between the cycle time and power
CPUs Architecture
Heat becoming an unmanageable problem
Raul Goycoolea S.

Diminishing Returns
• The ’80s: Superscalar expansion
50% per year improvement in performance
Transistors applied to implicit parallelism
- pipeline processor (10 CPI --> 1 CPI)
• The ’90s: The Era of Diminishing Returns
Squeaking out the last implicit parallelism
2-way to 6-way issue, out-of-order issue, branch prediction
1 CPI --> 0.5 CPI
Performance below expectations projects delayed & canceled
• The ’00s: The Beginning of the Multicore Era
The need for Explicit Parallelism
Raul Goycoolea S.

Mit Raw
16 Cores
2002 Intel Tanglewood
Dual Core IA/64
Intel Dempsey
Dual Core Xeon
Intel Montecito
1.7 Billion transistors
Dual Core IA/64
Intel Pentium D
(Smithfield)
Cancelled
Intel Tejas & Jayhawk
Unicore (4GHz P4)
IBM Power 6
Dual Core
IBM Power 4 and 5
Dual Cores Since 2001
Intel Pentium Extreme
3.2GHz Dual Core
Intel Yonah
Dual Core Mobile
AMD Opteron
Dual Core
Sun Olympus and Niagara
8 Processor Cores
IBM Cell
Scalable Multicore
… 1H 2005 1H 2006 2H 20062H 20052H 2004
Unicores are on extinction
Now all is multicore

# of
1985 199019801970 1975 1995 2000 2005
Raw
Cavium
Octeon
Raza
XLR
CSR-1
Intel
Tflops
Picochip
PC102
Cisco
Niagara
Boardcom 1480
Xbox360
2010
2
1
8
4
32
cores 16
128
64
512
256
Cell
Opteron 4P
Xeon MP
Ambric
AM2045
4004
8008
80868080 286 386 486 Pentium
PA-8800 Opteron Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon Itanium 2
Multicores Future
Raul Goycoolea S.

Program Agenda
• Actual Cases
Raul Goycoolea S.

Introduction to
Parallel
Architectures

Traditionally, software has been written for serial computation:
• To be run on a single computer having a single Central Processing Unit (CPU)
• A problem is broken into a discrete series of instructions
• Instructions are executed one after another
• Only one instruction may execute at any moment in time
What is Parallel Computing?
Raul Goycoolea S.

What is Parallel Computing?
In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
Raul Goycoolea S.

Options in Parallel Computing?
The compute resources might be:
• A single computer with multiple processors;
• An arbitrary number of computers connected by a network;
• A combination of both.
The computational problem should be able to:
• Be broken apart into discrete pieces of work that can be solved
simultaneously;
• Execute multiple program instructions at any moment in time;
• Be solved in less time with multiple compute resources than with a
single compute resource.
Raul Goycoolea S.

The Real World is Massively Parallel
• Parallel computing is an evolution of serial computing that
attempts to emulate what has always been the state of
affairs in the natural world: many complex, interrelated
events happening at the same time, yet within a sequence.
For example:
• Galaxy formation
• Planetary movement
• Weather and ocean patterns
• Tectonic plate drift Rush hour traffic
• Automobile assembly line
• Building a jet
• Ordering a hamburger at the drive through.
Raul Goycoolea S.

Architecture Concepts
Von Neumann Architecture
• Named after the Hungarian mathematician John von Neumann who first authored
the general requirements for an electronic computer in his 1945 papers
• Since then, virtually all computers have followed this basic design, differing from
earlier computers which were programmed through "hard wiring”
• Comprised of four main components:
• Memory
• Control Unit
• Arithmetic Logic Unit
• Input/Output
• Read/write, random access memory is used to store
both program instructions and data
• Program instructions are coded data which tell
the computer to do something
• Data is simply information to be used by the
program
• Control unit fetches instructions/data from memory, decodes
the instructions and then sequentially coordinates operations
to accomplish the programmed task.
• Aritmetic Unit performs basic arithmetic operations
• Input/Output is the interface to the human operator
Raul Goycoolea S.

Flynn’s Taxonomy
• There are different ways to classify parallel computers. One of the more
widely used classifications, in use since 1966, is called Flynn's
Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer architectures
according to how they can be classified along the two independent
dimensions of Instruction and Data. Each of these dimensions can
have only one of two possible states: Single or Multiple.
• The matrix below defines the 4 possible classifications according to
Flynn:
Raul Goycoolea S.

Single Instruction, Single Data (SISD):
• A serial (non-parallel) computer
• Single Instruction: Only one instruction stream is
being acted on by the CPU during any one clock
cycle
• Single Data: Only one data stream is being used
as input during any one clock cycle
• Deterministic execution
• This is the oldest and even today, the most
common type of computer
• Examples: older generation mainframes,
minicomputers and workstations; most modern
day PCs.
Raul Goycoolea S.

Single Instruction, Single Data (SISD):
Raul Goycoolea S.

Single Instruction, Multiple Data
(SIMD):
• A type of parallel computer
• Single Instruction: All processing units execute the same instruction at any
given clock cycle
• Multiple Data: Each processing unit can operate on a different data element
• Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
• Examples:
• Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,
ETA10
• Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
Raul Goycoolea S.

(SIMD):
ILLIAC IV MasPar TM CM-2 Cell GPU
Cray X-MP Cray Y-MP
Raul Goycoolea S.

• Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing
units.
• Few actual examples of this class of parallel computer have ever
existed. One is the experimental Carnegie-Mellon C.mmp computer
(1971).
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded
message.
Multiple Instruction, Single Data
(MISD):
Raul Goycoolea S.

Multiple Instruction, Single Data
(MISD):
Raul Goycoolea S.

• Multiple Instruction: Every processor may be executing a different
instruction stream
• Multiple Data: Every processor may be working with a different
data stream
• Execution can be synchronous or asynchronous, deterministic or
non-deterministic
• Currently, the most common type of parallel computer - most
modern supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel
computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.
Note: many MIMD architectures also include SIMD execution sub-components
Multiple Instruction, Multiple Data
(MIMD):
Raul Goycoolea S.

(MIMD):
Raul Goycoolea S.

(MIMD):
IBM Power HP Alphaserver Intel IA32/x64
Oracle SPARC Cray XT3 Oracle Exadata/Exalogic
Raul Goycoolea S.

Parallel Computer Memory Architecture
Shared Memory
Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.
Multiple processors can operate independently but share the same memory
resources.
Changes in a memory location effected by one processor are visible to all other
processors.
Shared memory machines can be divided into two main classes based upon
memory access times: UMA and NUMA.
Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
Non-Uniform Memory Access (NUMA):
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
40
Raul Goycoolea S.

Parallel Computer Memory Architecture
Shared Memory
41
Shared Memory (UMA) Shared Memory (NUMA)
Raul Goycoolea S.

Basic structure of a centralized
shared-memory multiprocessor
Processor Processor Processor Processor
One or more
levels of Cache
One or more
levels of Cache
One or more
levels of Cache
One or more
levels of Cache
Multiple processor-cache subsystems share the same physical memory, typically connected by a bus.
In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniform
access time o all memory from all processors remains.
Raul Goycoolea S.

Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Interconnection Network
Basic Architecture of a Distributed
Multiprocessor
Consists of individual nodes containing a processor, some memory, typically some I/O, and an interface to an
interconnection network that connects all the nodes. Individual nodes may contain a small number of
processors, which may be interconnected by a small bus or a different interconnection technology, which is less
scalable than the global interconnection network.
Raul Goycoolea S.

Communication
how do parallel operations communicate data results?
Synchronization
how are parallel operations coordinated?
Resource Management
how are a large number of parallel tasks scheduled onto
finite hardware?
Scalability
how large a machine can be built?
Issues in Parallel Machine Design
Raul Goycoolea S.

Program Agenda
• Actual Cases
Raul Goycoolea S.

ExplicitImplicit
Hardware Compiler
Superscalar
Processors
Explicitly Parallel Architectures
Implicit vs. Explicit Parallelism
Raul Goycoolea S.

Implicit Parallelism: Superscalar Processors
Explicit Parallelism
Shared Instruction Processors
Shared Sequencer Processors
Shared Network Processors
Shared Memory Processors
Multicore Processors
Outline
Raul Goycoolea S.

Issue varying numbers of instructions per clock
statically scheduled
–
–
using compiler techniques
in-order execution
dynamically scheduled
–
–
–
–
–
Extracting ILP by examining 100‟s of instructions
Scheduling them in parallel as operands become available
Rename registers to eliminate anti dependences
out-of-order execution
Speculative execution
Implicit Parallelism: Superscalar
Processors
Raul Goycoolea S.

Instruction i IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
Instruction # 1 2 3 4 5 6 7 8
IF: Instruction fetch
EX : Execution
Cycles
ID : Instruction decode
WB : Write back
Pipelining Execution
Raul Goycoolea S.

Instruction type 1 2 3 4 5 6 7
Cycles
Integer
Floating point
IF
IF
ID
ID
EX
EX
WB
WB
Integer
Floating point
Integer
Floating point
Integer
Floating point
IF
IF
ID
ID
EX
EX
WB
WB
IF
IF
ID
ID
EX
EX
WB
WB
IF
IF
ID
ID
EX
EX
WB
WB
2-issue super-scalar machine
Super-Scalar Execution
Raul Goycoolea S.

Intrinsic data dependent (aka true dependence) on Instructions:
I: add r1,r2,r3
J: sub r4,r1,r3
If two instructions are data dependent, they cannot execute
simultaneously, be completely overlapped or execute in out-of-
order
If data dependence caused a hazard in pipeline,
called a Read After Write (RAW) hazard
Data Dependence and Hazards
Raul Goycoolea S.

HW/SW must preserve program order:
order instructions would execute in if executed sequentially as
determined by original source program
Dependences are a property of programs
Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be calculated
3) sets an upper bound on how much parallelism can possibly
be exploited
Goal: exploit parallelism by preserving program order only
where it affects the outcome of the program
ILP and Data Dependencies, Hazards
Raul Goycoolea S.

Name dependence: when 2 instructions use same register or
memory location, called a name, but no flow of data between
the instructions associated with that name; 2 versions of
name dependence
InstrJ writes operand before InstrIreads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”
If anti-dependence caused a hazard in the pipeline, called a
Write After Read (WAR) hazard
Name Dependence #1: Anti-dependece
Raul Goycoolea S.

Instruction writes operand before InstrIwrites it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “output dependence” by compiler writers.
This also results from the reuse of name “r1”
If anti-dependence caused a hazard in the pipeline, called a
Write After Write (WAW) hazard
Instructions involved in a name dependence can execute
simultaneously if name used in instructions is changed so
instructions do not conflict
Register renaming resolves name dependence for registers
Renaming can be done either by compiler or by HW
Name Dependence #1: Output
Dependence
Raul Goycoolea S.

Every instruction is control dependent on some set of
branches, and, in general, these control dependencies must
be preserved to preserve program order
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1, and S2 is control dependent
on p2 but not on p1.
Control dependence need not be preserved
willing to execute instructions that should not have been
executed, thereby violating the control dependences, if can
do so without affecting correctness of the program
Speculative Execution
Control Dependencies
Raul Goycoolea S.

Greater ILP: Overcome control dependence by hardware
speculating on outcome of branches and executing
program as if guesses were correct
Speculation ⇒ fetch, issue, and execute
instructions as if branch predictions were always
correct
Dynamic scheduling ⇒ only fetches and issues
instructions
Essentially a data flow execution model: Operations
execute as soon as their operands are available
Speculation
Raul Goycoolea S.

Different predictors
Branch Prediction
Value Prediction
Prefetching (memory access pattern prediction)
Inefficient
Predictions can go wrong
Has to flush out wrongly predicted data
While not impacting performance, it consumes power
Speculation in Rampant in Modern
Superscalars
Raul Goycoolea S.

Implicit Parallelism: Superscalar Processors
Explicit Parallelism
Shared Instruction Processors
Shared Sequencer Processors
Shared Network Processors
Shared Memory Processors
Multicore Processors
Outline
Raul Goycoolea S.

Parallelism is exposed to software
Compiler or Programmer
Many different forms
Loosely coupled Multiprocessors to tightly coupled VLIW
Explicit Parallel Processors
Raul Goycoolea S.

Throughput per Cycle
One Operation
Latency in Cycles
Parallelism = Throughput * Latency
To maintain throughput T/cycle when each operation has
latency L cycles, need T*L independent operations
For fixed parallelism:
decreased latency allows increased throughput
decreased throughput allows increased latency tolerance
Little’s Law
Raul Goycoolea S.

Time
Time
Time
Time
Data-Level Parallelism (DLP)
Instruction-Level Parallelism (ILP)
Pipelining
Thread-Level Parallelism (TLP)
Types of Software Parallelism
Raul Goycoolea S.

Pipelining
Thread
Parallel
Data
Parallel
Instruction
Parallel
Translating Parallelism Types
Raul Goycoolea S.

What is a sequential program?
A single thread of control that executes one instruction and when it is
finished execute the next logical instruction
What is a concurrent program?
A collection of autonomous sequential threads, executing (logically) in
parallel
The implementation (i.e. execution) of a collection of threads can be:
Multiprogramming
– Threads multiplex their executions on a single processor.
Multiprocessing
– Threads multiplex their executions on a multiprocessor or a multicore system
Distributed Processing
– Processes multiplex their executions on several different machines
What is concurrency?
Raul Goycoolea S.

Concurrency is not (only) parallelism
Interleaved Concurrency
Logically simultaneous processing
Interleaved execution on a single
processor
Parallelism
Physically simultaneous processing
Requires a multiprocessors or a
multicore system
Time
Time
A
B
C
A
B
C
Concurrency and Parallelism
Raul Goycoolea S.

There are a lot of ways to use Concurrency in
Programming
Semaphores
Blocking & non-blocking queues
Concurrent hash maps
Copy-on-write arrays
Exchangers
Barriers
Futures
Thread pool support
Other Types of Synchronization
Raul Goycoolea S.

Deadlock
Two or more threads stop and wait for each other
Livelock
Two or more threads continue to execute, but make
no progress toward the ultimate goal
Starvation
Some thread gets deferred forever
Lack of fairness
Each thread gets a turn to make progress
Race Condition
Some possible interleaving of threads results in an
undesired computation result
Potential Concurrency Problems
Raul Goycoolea S.

Concurrency and Parallelism are important concepts
in Computer Science
Concurrency can simplify programming
However it can be very hard to understand and debug
concurrent programs
Parallelism is critical for high performance
From Supercomputers in national labs to
Multicores and GPUs on your desktop
Concurrency is the basis for writing parallel programs
Next Lecture: How to write a Parallel Program
Parallelism Conclusions
Raul Goycoolea S.

Shared memory
–
–
–
–
Ex: Intel Core 2 Duo/Quad
One copy of data shared
among many cores
Atomicity, locking and
synchronization
essential for correctness
Many scalability issues
Distributed memory
–
–
–
–
Ex: Cell
Cores primarily access local
memory
Explicit data exchange
between cores
Data distribution and
communication orchestration
is essential for performance
P1 P2 P3 Pn
Memory
P1 P2 P3 Pn
M1 M2 M3 Mn
Two primary patterns of multicore architecture design
Architecture Recap
Raul Goycoolea S.

Processor 1…n ask for X
There is only one place to look
Communication through
shared variables
Race conditions possible
Use synchronization to protect from conflicts
Change how data is stored to minimize synchronization
P1 P2 P3 Pn
Memory
x
Programming Shared Memory Processors
Raul Goycoolea S.

Data parallel
Perform same computation
but operate on different data
A single process can fork
multiple concurrent threads
Each thread encapsulate its own execution path
Each thread has local state and shared resources
Threads communicate through shared resources
such as global memory
for (i = 0; i < 12; i++)
C[i] = A[i] + B[i];
i=0
i=1
i=2
i=3
i=8
i=9
i = 10
i = 11
i=4
i=5
i=6
i=7
join (barrier)
fork (threads)
Example of Parallelization
Raul Goycoolea S.

int A[12] = {...}; int B[12] = {...}; int C[12];
void add_arrays(int start)
{
int i;
for (i = start; i < start + 4; i++)
C[i] = A[i] + B[i];
}
int main (int argc, char *argv[])
{
pthread_t threads_ids[3];
int rc, t;
for(t = 0; t < 4; t++) {
rc = pthread_create(&thread_ids[t],
NULL /* attributes */,
add_arrays /* function */,
t * 4 /* args to function */);
}
pthread_exit(NULL);
}
join (barrier)
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i = 10
i = 11
fork (threads)
Example Parallelization with Threads
Raul Goycoolea S.

Data parallelism
Perform same computation
but operate on different data
Control parallelism
Perform different functions
fork (threads)
join (barrier)
pthread_create(/* thread id */,
/* attributes */,
/*
/*
any function
args to function
*/,
*/);
Types of Parallelism
Raul Goycoolea S.

Uniform Memory Access (UMA)
Centrally located memory
All processors are equidistant (access times)
Non-Uniform Access (NUMA)
Physically partitioned but accessible by all
Processors have the same address space
Placement of data affects performance
Memory Access Latency in Shared
Memory Architectures
Raul Goycoolea S.

Coverage or extent of parallelism in algorithm
Granularity of data partitioning among processors
Locality of computation and communication
… so how do I parallelize my program?
Summary of Parallel Performance
Factors
Raul Goycoolea S.

Program Agenda
• Actual Cases
Raul Goycoolea S.

P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequential
computation
Parallel
program
d
e
c
o
m
p
o
s
i
t
i
o
n
a
s
s
i
g
n
m
e
n
t
o
r
c
h
e
s
t
r
a
t
i
o
n
m
a
p
p
i
n
g
Common Steps to Create a Parallel
Program

Identify concurrency and decide at what level to
exploit it
Break up computation into tasks to be divided
among processes
Tasks may become available dynamically
Number of tasks may vary with time
Enough tasks to keep processors busy
Number of tasks available at a time is upper bound on
achievable speedup
Decomposition (Amdahl’s Law)

Specify mechanism to divide work among core
Balance work and reduce communication
Structured approaches usually work well
Code inspection or understanding of application
Well-known design patterns
As programmers, we worry about partitioning first
Independent of architecture or programming model
But complexity often affect decisions!
Granularity

Computation and communication concurrency
Preserve locality of data
Schedule tasks to satisfy dependences early
Orchestration and Mapping

Provides a cookbook to systematically guide programmers
Decompose, Assign, Orchestrate, Map
Can lead to high quality solutions in some domains
Provide common vocabulary to the programming community
Each pattern has a name, providing a vocabulary for
discussing solutions
Helps with software reusability, malleability, and modularity
Written in prescribed format to allow the reader to
quickly understand the solution and its context
Otherwise, too difficult for programmers, and software will not
fully exploit parallel hardware
Parallel Programming by Pattern

Berkeley architecture professor
Christopher Alexander
In 1977, patterns for city
planning, landscaping, and
architecture in an attempt to
capture principles for “living”
design
History

Design Patterns: Elements of Reusable Object-
Oriented Software (1995)
Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides
Catalogue of patterns
Creation, structural, behavioral
Patterns in Object-Oriented
Programming

Algorithm Expression
Finding Concurrency
Expose concurrent tasks
Algorithm Structure
Map tasks to processes to
exploit parallel architecture
4 Design Spaces
Software Construction
Supporting Structures
Code and data structuring
patterns
Implementation Mechanisms
Low level mechanisms used
to write parallel programs
Patterns for Parallel
Programming. Mattson,
Sanders, and Massingill
(2005).
Patterns for Parallelizing Programs

split
frequency encoded
macroblocks
ZigZag
IQuantization
IDCT
Saturation
spatially encoded macroblocks
differentially coded
motion vectors
Motion Vector Decode
Repeat
motion vectors
MPEG bit stream
VLD
macroblocks, motion vectors
MPEG Decoder
join
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
Here’s my algorithm, Where’s the
concurrency?

Task decomposition
Independent coarse-grained
computation
Inherent to algorithm
Sequence of statements
(instructions) that operate
together as a group
Corresponds to some logical
part of program
Usually follows from the way
programmer thinks about a
problem
join
motion vectorsspatially encoded macroblocks
IDCT
Saturation
MPEG Decoder
frequency encoded
macroblocks
ZigZag
IQuantization
MPEG bit stream
VLD
split
motion vectors
Repeat
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
concurrency?

join
motion vectors
Saturation
spatially encoded macroblocks
MPEG Decoder
frequency encoded
macroblocks
ZigZag
IQuantization
IDCT
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
MPEG bit stream
VLD
split
motion vectors
Repeat
Task decomposition
Parallelism in the application
Data decomposition
Same computation is applied
to small data chunks derived
from large data set
concurrency?

motion vectorsspatially encoded macroblocks
MPEG Decoder
frequency encoded
macroblocks
ZigZag
IQuantization
IDCT
Saturation
join
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
MPEG bit stream
VLD
split
motion vectors
Repeat
Task decomposition
Parallelism in the application
Data decomposition
Same computation many data
Pipeline decomposition
Data assembly lines
Producer-consumer chains
concurrency?

Algorithms start with a good understanding of the
problem being solved
Programs often naturally decompose into tasks
Two common decompositions are
–
–
Function calls and
Distinct loop iterations
Easier to start with many tasks and later fuse them,
rather than too few tasks and later try to split them
Guidelines for Task Decomposition

Flexibility
Program design should afford flexibility in the number and
size of tasks generated
–
–
Tasks should not tied to a specific architecture
Fixed tasks vs. Parameterized tasks
Efficiency
Tasks should have enough work to amortize the cost of
creating and managing them
Tasks should be sufficiently independent so that managing
dependencies doesn‟t become the bottleneck
Simplicity
The code has to remain readable and easy to understand,
and debug
Guidelines for Task Decomposition

Data decomposition is often implied by task
decomposition
Programmers need to address task and data
decomposition to create a parallel program
Which decomposition to start with?
Data decomposition is a good starting point when
Main computation is organized around manipulation of a
large data structure
Similar operations are applied to different parts of the
data structure
Guidelines for Data Decomposition
Raul Goycoolea S.

Array data structures
Decomposition of arrays along rows, columns, blocks
Recursive data structures
Example: decomposition of trees into sub-trees
problem
compute
subproblem
compute
subproblem
compute
subproblem
compute
subproblem
merge
subproblem
merge
subproblem
merge
solution
subproblem
split
subproblem
split
split
Common Data Decompositions
Raul Goycoolea S.

Flexibility
Size and number of data chunks should support a wide
range of executions
Efficiency
Data chunks should generate comparable amounts of
work (for load balancing)
Simplicity
Complex data compositions can get difficult to manage
and debug
Raul Goycoolea S.
Guidelines for Data Decompositions

Data is flowing through a sequence of stages
Assembly line is a good analogy
What’s a prime example of pipeline decomposition in
computer architecture?
Instruction pipeline in modern CPUs
What’s an example pipeline you may use in your UNIX shell?
Pipes in UNIX: cat foobar.c | grep bar | wc
Other examples
Signal processing
Graphics
ZigZag
IQuantization
IDCT
Saturation
Guidelines for Pipeline Decomposition
Raul Goycoolea S.

Program Agenda
• Actual Cases
Raul Goycoolea S.

Coverage or extent of parallelism in algorithm
Amdahl‟s Law
Granularity of partitioning among processors
Communication cost and load balancing
Locality of computation and communication
Communication between processors or between
processors and their memories
Review: Keys to Parallel Performance

n/m
B
t overlap)C f (o l
frequency
of messages
overhead per
message
(at both ends)
network delay
per message
number of messages
amount of latency
hidden by concurrency
with computation
total data sent
cost induced by
contention per
message
bandwidth along path
(determined by network)
Communication Cost Model

synchronization
point
Get Data
Compute
Get Data
CPU is idle
Memory is idle
Compute
Overlapping Communication with
Computation

Computation to communication ratio limits
performance gains from pipelining
Get Data
Compute
Get Data
Compute
Where else to look for performance?
Limits in Pipelining Communication

Determined by program implementation and
interactions with the architecture
Examples:
Poor distribution of data across distributed memories
Unnecessarily fetching data that is not used
Redundant data fetches
Artifactual Communication

In uniprocessors, CPU communicates with memory
Loads and stores are to uniprocessors as
_______ and ______ are to distributed memory
multiprocessors
How is communication overlap enhanced in
uniprocessors?
Spatial locality
Temporal locality
“get” “put”
Lessons From Uniprocessors

CPU asks for data at address 1000
Memory sends data at address 1000 … 1064
Amount of data sent depends on architecture
parameters such as the cache block size
Works well if CPU actually ends up using data from
1001, 1002, …, 1064
Otherwise wasted bandwidth and cache capacity
Spatial Locality

Main memory access is expensive
Memory hierarchy adds small but fast memories
(caches) near the CPU
Memories get bigger as distance
from CPU increases
CPU asks for data at address 1000
Memory hierarchy anticipates more accesses to same
address and stores a local copy
Works well if CPU actually ends up using data from 1000 over
and over and over …
Otherwise wasted cache capacity
main
memory
cache
(level 2)
cache
(level 1)
Temporal Locality

Data is transferred in chunks to amortize
communication cost
Cell: DMA gets up to 16K
Usually get a contiguous chunk of memory
Spatial locality
Computation should exhibit good spatial locality
characteristics
Temporal locality
Reorder computation to maximize use of data fetched
Reducing Artifactual Costs in
Distributed Memory Architectures

Tasks mapped to execution units (threads)
Threads run on individual processors (cores)
finish line: sequential time + longest parallel time
Two keys to faster execution
Load balance the work among the processors
Make execution on each processor faster
sequential
parallel
sequential
parallel
Single Thread Performance

Need some way of
measuring performance
Coarse grained
measurements
% gcc sample.c
% time a.out
2.312u 0.062s 0:02.50 94.8%
% gcc sample.c –O3
% time a.out
1.921u 0.093s 0:02.03 99.0%
… but did we learn much
about what’s going on?
#define N (1 << 23)
#define T (10)
#include <string.h>
double a[N],b[N];
void cleara(double a[N]) {
int i;
for (i = 0; i < N; i++) {
a[i] = 0;
}
}
int main() {
double s=0,s2=0; int i,j;
for (j = 0; j < T; j++) {
for (i = 0; i < N; i++) {
b[i] = 0;
}
cleara(a);
memset(a,0,sizeof(a));
for (i = 0; i < N; i++) {
s += a[i] * b[i];
s2 += a[i] * a[i] + b[i] * b[i];
}
}
printf("s %f s2 %fn",s,s2);
}
record stop time
record start time
Understanding Performance

Increasingly possible to get accurate measurements
using performance counters
Special registers in the hardware to measure events
Insert code to start, read, and stop counter
Measure exactly what you want, anywhere you want
Can measure communication and computation duration
But requires manual changes
Monitoring nested scopes is an issue
Heisenberg effect: counters can perturb execution time
time
stopclear/start
Measurements Using Counters

Event-based profiling
Interrupt execution when an event counter reaches a
threshold
Time-based profiling
Interrupt execution every t seconds
Works without modifying your code
Does not require that you know where problem might be
Supports multiple languages and programming models
Quite efficient for appropriate sampling frequencies
Dynamic Profiling

Cycles (clock ticks)
Pipeline stalls
Cache hits
Cache misses
Number of instructions
Number of loads
Number of stores
Number of floating point operations
…
Counter Examples

Processor utilization
Cycles / Wall Clock Time
Instructions per cycle
Instructions / Cycles
Instructions per memory operation
Instructions / Loads + Stores
Average number of instructions per load miss
Instructions / L1 Load Misses
Memory traffic
Loads + Stores * Lk Cache Line Size
Bandwidth consumed
Loads + Stores * Lk Cache Line Size / Wall Clock Time
Many others
Cache miss rate
Branch misprediction rate
…
Useful Derived Measurements

application
source
run
(profiles
execution)
performance
profile
binary
object code
compiler
binary analysis
interpret profile
source
correlation
Common Profiling Workflow

GNU gprof
Widely available with UNIX/Linux distributions
gcc –O2 –pg foo.c –o foo
./foo
gprof foo
HPC Toolkit
http://www.hipersoft.rice.edu/hpctoolkit/
PAPI
http://icl.cs.utk.edu/papi/
VTune
http://www.intel.com/cd/software/products/asmo-na/eng/vtune/
Many others
Popular Runtime Profiling Tools

Instruction level parallelism
Multiple functional units, deeply pipelined, speculation, ...
Data level parallelism
SIMD (Single Inst, Multiple Data): short vector instructions
(multimedia extensions)–
–
–
Hardware is simpler, no heavily ported register files
Instructions are more compact
Reduces instruction fetch bandwidth
Complex memory hierarchies
Multiple level caches, may outstanding misses,
prefetching, …
Performance un Uniprocessors
time = compute + wait

SIMD registers hold short vectors
Instruction operates on all elements in SIMD register at once
a
b
c
Vector code
for (int i = 0; i < n; i += 4) {
c[i:i+3] = a[i:i+3] + b[i:i+3]
}
SIMD register
Scalar code
for (int i = 0; i < n; i+=1) {
c[i] = a[i] + b[i]
}
a
b
c
scalar register

For Example Cell
SPU has 128 128-bit registers
All instructions are SIMD instructions
Registers are treated as short vectors of 8/16/32-bit
integers or single/double-precision floats
Instruction Set
AltiVec
MMX/SSE
3DNow!
VIS
MAX2
MVI
MDMX
Architecture
PowerPC
Intel
AMD
Sun
HP
Alpha
MIPS V
SIMD Width
128
64/128
64
64
64
64
64
Floating Point
yes
yes
yes
no
no
no
yes
SIMD in Major Instruction Set
Architectures (ISAs)

Library calls and inline assembly
Difficult to program
Not portable
Different extensions to the same ISA
MMX and SSE
SSE vs. 3DNow!
Compiler vs. Crypto Oracle T4
Using SIMD Instructions

Tune the parallelism first
Then tune performance on individual processors
Modern processors are complex
Need instruction level parallelism for performance
Understanding performance requires a lot of probing
Optimize for the memory hierarchy
Memory is much slower than processors
Multi-layer memory hierarchies try to hide the speed gap
Data locality is essential for performance
Programming for Performance

May have to change everything!
Algorithms, data structures, program structure
Focus on the biggest performance impediments
Too many issues to study everything
Remember the law of diminishing returns
Programming for Performance

Program Agenda
• Actual Cases
Raul Goycoolea S.

Parallel Execution
Parallelizing Compilers
Dependence Analysis
Increasing Parallelization Opportunities
Generation of Parallel Loops
Communication Code Generation
Compilers Outline
Raul Goycoolea S.

Instruction Level Parallelism
(ILP)
Task Level Parallelism (TLP)
Loop Level Parallelism (LLP)
or Data Parallelism
Pipeline Parallelism
Divide and Conquer
Parallelism
Scheduling and Hardware
Mainly by hand
Hand or Compiler Generated
Hardware or Streaming
Recursive functions
Types of Parallelism
Raul Goycoolea S.

90% of the execution time in 10% of the code
Mostly in loops
If parallel, can get good performance
Load balancing
Relatively easy to analyze
Why Loops?
Raul Goycoolea S.

FORALL
No “loop carried
dependences”
Fully parallel
FORACROSS
Some “loop carried
dependences”
Programmer Defined Parallel Loop
Raul Goycoolea S.

Parallel Execution
Dependence Analysis
Outline
Raul Goycoolea S.

Finding FORALL Loops out of FOR loops
Examples
FOR I = 0 to 5
A[I+1] = A[I] + 1
FOR I = 0 to 5
A[I] = A[I+6] + 1
For I = 0 to 5
A[2*I] = A[2*I + 1] + 1
Raul Goycoolea S.

True dependence
a =
= a
Anti dependence
= a
a =
Output dependence
a
a
=
=
Definition:
Data dependence exists for a dynamic instance i and j iff
either i or j is a write operation
i and j refer to the same variable
i executes before j
How about array accesses within loops?
Dependences
Raul Goycoolea S.

Parallel Execution
Dependence Analysis
Outline
Raul Goycoolea S.

FOR I = 0 to 5
A[I] = A[I] + 1
0 1 2
Iteration Space
0 1 2 3 4 5
Data Space
3 4 5 6 7 8 9 10 11 12
A[I]
A[I]
A[I]
A[I]
A[I]
= A[I]
= A[I]
= A[I]
= A[I]
= A[I]
Array Access in a Loop
Raul Goycoolea S.

Find data dependences in loop
For every pair of array acceses to the same array
If the first access has at least one dynamic instance (an iteration) in
which it refers to a location in the array that the second access also
refers to in at least one of the later dynamic instances (iterations).
Then there is a data dependence between the statements
(Note that same array can refer to itself – output dependences)
Definition
Loop-carried dependence:
dependence that crosses a loop boundary
If there are no loop carried dependences are parallelizable
Recognizing FORALL Loops
Raul Goycoolea S.

FOR I = 1 to n
FOR J = 1 to n
A[I, J] = A[I-1, J+1] + 1
FOR I = 1 to n
FOR J = 1 to n
A[I] = A[I-1] + 1
J
J
I
I
What is the Dependence?
Raul Goycoolea S.

Parallel Execution
Dependence Analysis
Outline
Raul Goycoolea S.

Scalar Privatization
Reduction Recognition
Induction Variable Identification
Array Privatization
Interprocedural Parallelization
Loop Transformations
Granularity of Parallelism
Increasing Parallelization
Opportunities
Raul Goycoolea S.

Example
FOR i = 1 to n
X = A[i] * 3;
B[i] = X;
Is there a loop carried dependence?
What is the type of dependence?
Scalar Privatization
Raul Goycoolea S.

Reduction Analysis:
Only associative operations
The result is never used within the loop
Transformation
Integer Xtmp[NUMPROC];
Barrier();
FOR i = myPid*Iters to MIN((myPid+1)*Iters, n)
Xtmp[myPid] = Xtmp[myPid] + A[i];
Barrier();
If(myPid == 0) {
FOR p = 0 to NUMPROC-1
X = X + Xtmp[p];
…
Reduction Recognition
Raul Goycoolea S.

Example
FOR i = 0 to N
A[i] = 2^i;
After strength reduction
t = 1
FOR i = 0 to N
A[i] = t;
t = t*2;
What happened to loop carried dependences?
Need to do opposite of this!
Perform induction variable analysis
Rewrite IVs as a function of the loop variable
Induction Variables
Raul Goycoolea S.

Similar to scalar privatization
However, analysis is more complex
Array Data Dependence Analysis:
Checks if two iterations access the same location
Array Data Flow Analysis:
Checks if two iterations access the same value
Transformations
Similar to scalar privatization
Private copy for each processor or expand with an additional
dimension
Array Privatization
Raul Goycoolea S.

Function calls will make a loop unparallelizatble
Reduction of available parallelism
A lot of inner-loop parallelism
Solutions
Interprocedural Analysis
Inlining
Interprocedural Parallelization
Raul Goycoolea S.

Cache Coherent Shared Memory Machine
Generate code for the parallel loop nest
No Cache Coherent Shared Memory
or Distributed Memory Machines
Generate code for the parallel loop nest
Identify communication
Generate communication code
Raul Goycoolea S.

Eliminating redundant communication
Communication aggregation
Multi-cast identification
Local memory management
Communication Optimizations
Raul Goycoolea S.

Automatic parallelization of loops with arrays
Requires Data Dependence Analysis
Iteration space & data space abstraction
An integer programming problem
Many optimizations that’ll increase parallelism
Transforming loop nests and communication code generation
Fourier-Motzkin Elimination provides a nice framework
Summary
Raul Goycoolea S.

Program Agenda
Raul Goycoolea S.

Future of
Parallel
Architectures

"I think there is a world market for
maybe five computers.“
– Thomas Watson, chairman of IBM, 1949
"There is no reason in the world
anyone would want a computer in their
home. No reason.”
– Ken Olsen, Chairman, DEC, 1977
"640K of RAM ought to be enough for
anybody.”
– Bill Gates, 1981
Predicting the Future is Always Risky
Raul Goycoolea S.

Evolution
Relatively easy to predict
Extrapolate the trends
Revolution
A completely new technology or solution
Hard to Predict
Paradigm Shifts can occur in both
Future = Evolution + Revolution
Raul Goycoolea S.

Evolution
Trends
Architecture
Languages, Compilers and Tools
Revolution
Crossing the Abstraction Boundaries
Outline
Raul Goycoolea S.

Look at the trends
Moore‟s Law
Power Consumption
Wire Delay
Hardware Complexity
Program Design Methodologies
Design Drivers are different in
Different Generations
Evolution
Raul Goycoolea S.

Performance(vs.VAX-11/780)
NumberofTransistors
52%/year
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
%/year
10
8086
1
286
25%/year
386
486
Pentium
P2
P3
P4
Itanium 2
Itanium
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture:
A Quantitative Approach, 4th edition, 2006
The Road to Multicore: Moore’s Law
Raul Goycoolea S.

Specint2000
10000.00
1000.00
100.00
10.00
1.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
Raul Goycoolea S.
Intel 386
Intel 486

General-purpose unicores have stopped historic
performance scaling
Power consumption
Wire delays
DRAM access latency
Diminishing returns of more instruction-level parallelism
Raul Goycoolea S.

Power
1000
100
10
1
85 87 89 91 93 95 97 99 01 03 05 07
Intel 386
Intel 486
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha21064
Alpha21164
Alpha21264
Sparc
SuperSparc
Sparc64
Mips
HPPA
Power PC
AMDK6
AMDK7
AMDx86-64
Power Consumption (watts)
Raul Goycoolea S.

Watts/Spec
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1982 1984 1987 1990 1993 1995 1998 2001 2004 2006
Year
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
0
Power Efficiency (watts/spec)
Raul Goycoolea S.

Process(microns)
0.06
0.04
0.02
0
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
1996 1998 2000 2002 2008 2010 2012 20142004 2006
Year
700 MHz
1.25 GHz
2.1 GHz
6 GHz
10 GHz
13.5 GHz
• 400 mm2 Die
• From the SIA Roadmap
Range of a Wire in One Clock Cycle
Raul Goycoolea S.

Performance
19
84
19
94
19
92
19
82
19
88
19
86
19
80
19
96
19
98
20
00
20
02
19
90
20
04
1000000
10000
100
1
Year
µProc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
DRAM Access Latency
• Access times are a
speed of light issue
• Memory technology is
also changing
SRAM are getting harder to
scale
DRAM is no longer cheapest
cost/bit
• Power efficiency is an
issue here as well
Raul Goycoolea S.

PowerDensity(W/cm2)
10,000
1,000
„70 „80 „90 „00 „10
10 4004
8008
8080
1
8086
8085
286 386
486
Pentium®
Hot Plate
Nuclear Reactor
100
Sun‟s Surface
Rocket Nozzle
Intel Developer Forum, Spring 2004 - Pat Gelsinger
(Pentium at 90 W)
Cube relationship between the cycle time and power
CPUs Architecture
Heat becoming an unmanageable problem
Raul Goycoolea S.

1970 1980 1990 2000 2010
Improvement in Automatic Parallelization
Automatic
Parallelizing
Compilers for
FORTRAN
Vectorization
technology
Prevalence of type
unsafe languages and
complex data
structures (C, C++)
Typesafe
languages
(Java, C#)
Demand
driven by
Multicores?
Compiling for
Instruction
Level
Parallelism
Raul Goycoolea S.

# of
1985 199019801970 1975 1995 2000 2005
Raw
Cavium
Octeon
Raza
XLR
CSR-1
Intel
Tflops
Picochip
PC102
Cisco
Niagara
Boardcom 1480
Xbox360
2010
2
1
8
4
32
cores 16
128
64
512
256
Cell
Opteron 4P
Xeon MP
Ambric
AM2045
4004
8008
80868080 286 386 486 Pentium
PA-8800 Opteron Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon Itanium 2
Multicores Future
Raul Goycoolea S.

Evolution
Trends
Architecture
Languages, Compilers and Tools
Revolution
Crossing the Abstraction Boundaries
Outline
Raul Goycoolea S.

Don‟t have to contend with uniprocessors
The era of Moore‟s Law induced performance gains is over!
Parallel programming will be required by the masses
–
not just a few supercomputer super-users
Novel Opportunities in Multicores
Raul Goycoolea S.

Don‟t have to contend with uniprocessors
The era of Moore‟s Law induced performance gains is over!
Parallel programming will be required by the masses
– not just a few supercomputer super-users
Not your same old multiprocessor problem
How does going from Multiprocessors to Multicores impact
programs?
What changed?
Where is the Impact?
–
–
Communication Bandwidth
Communication Latency
Novel Opportunities in Multicores
Raul Goycoolea S.

How much data can be communicated
between two cores?
What changed?
Number of Wires
–
–
IO is the true bottleneck
On-chip wire density is very high
Clock rate
– IO is slower than on-chip
Multiplexing
No sharing of pins–
Impact on programming model?
Massive data exchange is possible
Data movement is not the bottleneck
processor affinity not that important
32 Giga bits/sec ~300 Tera bits/sec
10,000X
Communication Bandwidth
Raul Goycoolea S.

How long does it take for a round trip
communication?
What changed?
Length of wire
– Very short wires are faster
Pipeline stages
–
–
–
No multiplexing
On-chip is much closer
Bypass and Speculation?
Impact on programming model?
Ultra-fast synchronization
Can run real-time apps
on multiple cores
50X
~200 Cycles ~4 cycles
Communication Latency
Raul Goycoolea S.

MemoryMemory
PE
$$
PE
$$
Memory
PEPE
$$
Memory
$$
PE
$$ X
PE
$$ X
PE
$$ X
PE
$$ X
Memory Memory
Basic Multicore
IBM Power
Traditional
Multiprocessor
Integrated Multicore
8 Core 8 Thread Oracle T4
Past, Present and the Future?
Raul Goycoolea S.

Summary
• As technology evolves, the inherent flexibility of Multi
processor to adapts to new requirements
• Processors can be used at anytime for a lots of kinds
of applications
• Optimization adapts processors to High Performance
requirements
Raul Goycoolea S.

References
• Author: Raul Goycoolea, Oracle Corporation.
• A search on the WWW for "parallel programming" or "parallel computing" will yield a
wide variety of information.
• Recommended reading:
• "Designing and Building Parallel Programs". Ian Foster.  http://www-
unix.mcs.anl.gov/dbpp/
• "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis,
Vipin Kumar.  http://www-users.cs.umn.edu/~karypis/parbook/
• "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra.
 www.phys.uu.nl/~steen/web03/overview.html
• MIT Multicore Programming Class: 6.189
• Prof. Saman Amarasinghe
• Photos/Graphics have been created by the author, obtained from non-copyrighted,
government or public domain (such as http://commons.wikimedia.org/) sources, or used
with the permission of authors from other presentations and web pages.
168

Twitter
http://twitter.com/raul_goycoolea
Raul Goycoolea Seoane
Keep in Touch
Facebook
http://www.facebook.com/raul.goycoolea
Linkedin
http://www.linkedin.com/in/raulgoy
Blog
http://blogs.oracle.com/raulgoy/
Raul Goycoolea S.

Multiprocessor architecture and programming

Multiprocessor architecture and programming

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Multiprocessor architecture and programming

Similaire à Multiprocessor architecture and programming (20)

Plus de Raul Goycoolea Seoane

Plus de Raul Goycoolea Seoane (7)

Dernier

Dernier (20)

Multiprocessor architecture and programming

Notes de l'éditeur