SlideShare une entreprise Scribd logo
1  sur  171
Parallel Computing Architecture &
Programming Techniques
Raul Goycoolea S.
Solution Architect Manager
Oracle Enterprise Architecture Group
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 216 February 2012
Antecedents of
Parallel
Computing
The “Software Crisis”
“To put it quite bluntly: as long as there were no
machines, programming was no problem at all; when
we had a few weak computers, programming became a
mild problem, and now we have gigantic computers,
programming has become an equally gigantic problem."
-- E. Dijkstra, 1972 Turing Award Lecture
Raul Goycoolea S.
Multiprocessor Programming 416 February 2012
The First Software Crisis
• Time Frame: ’60s and ’70s
• Problem: Assembly Language Programming
Computers could handle larger more complex programs
• Needed to get Abstraction and Portability without
losing Performance
Raul Goycoolea S.
Multiprocessor Programming 516 February 2012
Common Properties
Single flow of control
Single memory image
Differences:
Register File
ISA
Functional Units
How Did We Solve The First Software
Crisis?
• High-level languages for von-Neumann machines
FORTRAN and C
• Provided “common machine language” for
uniprocessors
Raul Goycoolea S.
Multiprocessor Programming 616 February 2012
The Second Software Crisis
• Time Frame: ’80s and ’90s
• Problem: Inability to build and maintain complex and
robust applications requiring multi-million lines of
code developed by hundreds of programmers
Computers could handle larger more complex programs
• Needed to get Composability, Malleability and
Maintainability
High-performance was not an issue left for Moore’s Law
Raul Goycoolea S.
Multiprocessor Programming 716 February 2012
How Did We Solve the Second
Software Crisis?
• Object Oriented Programming
C++, C# and Java
• Also…
Better tools
• Component libraries, Purify
Better software engineering methodology
• Design patterns, specification, testing, code
reviews
Raul Goycoolea S.
Multiprocessor Programming 816 February 2012
Today:
Programmers are Oblivious to Processors
• Solid boundary between Hardware and Software
• Programmers don’t have to know anything about the
processor
High level languages abstract away the processors
Ex: Java bytecode is machine independent
Moore’s law does not require the programmers to know anything
about the processors to get good speedups
• Programs are oblivious of the processor works on all
processors
A program written in ’70 using C still works and is much faster today
• This abstraction provides a lot of freedom for the
programmers
Raul Goycoolea S.
Multiprocessor Programming 916 February 2012
The Origins of a Third Crisis
• Time Frame: 2005 to 20??
• Problem: Sequential performance is left behind by
Moore’s law
• Needed continuous and reasonable performance
improvements
to support new features
to support larger datasets
• While sustaining portability, malleability and
maintainability without unduly increasing complexity
faced by the programmer critical to keep-up with the
current rate of evolution in software
Raul Goycoolea S.
Multiprocessor Programming 1016 February 2012
Performance(vs.VAX-11/780)
NumberofTransistors
52%/year
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
%/year
10
8086
1
286
25%/year
386
486
Pentium
P2
P3
P4
Itanium 2
Itanium
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture:
A Quantitative Approach, 4th edition, 2006
The Road to Multicore: Moore’s Law
Raul Goycoolea S.
Multiprocessor Programming 1116 February 2012
Specint2000
10000.00
1000.00
100.00
10.00
1.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
The Road to Multicore:
Uniprocessor Performance (SPECint)
Raul Goycoolea S.
Multiprocessor Programming 1216 February 2012
Intel 386
Intel 486
The Road to Multicore:
Uniprocessor Performance (SPECint)
General-purpose unicores have stopped historic
performance scaling
Power consumption
Wire delays
DRAM access latency
Diminishing returns of more instruction-level parallelism
Raul Goycoolea S.
Multiprocessor Programming 1316 February 2012
Power
1000
100
10
1
85 87 89 91 93 95 97 99 01 03 05 07
Intel 386
Intel 486
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha21064
Alpha21164
Alpha21264
Sparc
SuperSparc
Sparc64
Mips
HPPA
Power PC
AMDK6
AMDK7
AMDx86-64
Power Consumption (watts)
Raul Goycoolea S.
Multiprocessor Programming 1416 February 2012
Watts/Spec
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1982 1984 1987 1990 1993 1995 1998 2001 2004 2006
Year
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
0
Power Efficiency (watts/spec)
Raul Goycoolea S.
Multiprocessor Programming 1516 February 2012
Process(microns)
0.06
0.04
0.02
0
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
1996 1998 2000 2002 2008 2010 2012 20142004 2006
Year
700 MHz
1.25 GHz
2.1 GHz
6 GHz
10 GHz
13.5 GHz
• 400 mm2 Die
• From the SIA Roadmap
Range of a Wire in One Clock Cycle
Raul Goycoolea S.
Multiprocessor Programming 1616 February 2012
Performance
19
84
19
94
19
92
19
82
19
88
19
86
19
80
19
96
19
98
20
00
20
02
19
90
20
04
1000000
10000
100
1
Year
µProc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
DRAM Access Latency
• Access times are a
speed of light issue
• Memory technology is
also changing
SRAM are getting harder to
scale
DRAM is no longer cheapest
cost/bit
• Power efficiency is an
issue here as well
Raul Goycoolea S.
Multiprocessor Programming 1716 February 2012
PowerDensity(W/cm2)
10,000
1,000
„70 „80 „90 „00 „10
10 4004
8008
8080
1
8086
8085
286 386
486
Pentium®
Hot Plate
Nuclear Reactor
100
Sun‟s Surface
Rocket Nozzle
Intel Developer Forum, Spring 2004 - Pat Gelsinger
(Pentium at 90 W)
Cube relationship between the cycle time and power
CPUs Architecture
Heat becoming an unmanageable problem
Raul Goycoolea S.
Multiprocessor Programming 1816 February 2012
Diminishing Returns
• The ’80s: Superscalar expansion
50% per year improvement in performance
Transistors applied to implicit parallelism
- pipeline processor (10 CPI --> 1 CPI)
• The ’90s: The Era of Diminishing Returns
Squeaking out the last implicit parallelism
2-way to 6-way issue, out-of-order issue, branch prediction
1 CPI --> 0.5 CPI
Performance below expectations projects delayed & canceled
• The ’00s: The Beginning of the Multicore Era
The need for Explicit Parallelism
Raul Goycoolea S.
Multiprocessor Programming 1916 February 2012
Mit Raw
16 Cores
2002 Intel Tanglewood
Dual Core IA/64
Intel Dempsey
Dual Core Xeon
Intel Montecito
1.7 Billion transistors
Dual Core IA/64
Intel Pentium D
(Smithfield)
Cancelled
Intel Tejas & Jayhawk
Unicore (4GHz P4)
IBM Power 6
Dual Core
IBM Power 4 and 5
Dual Cores Since 2001
Intel Pentium Extreme
3.2GHz Dual Core
Intel Yonah
Dual Core Mobile
AMD Opteron
Dual Core
Sun Olympus and Niagara
8 Processor Cores
IBM Cell
Scalable Multicore
… 1H 2005 1H 2006 2H 20062H 20052H 2004
Unicores are on extinction
Now all is multicore
# of
1985 199019801970 1975 1995 2000 2005
Raw
Cavium
Octeon
Raza
XLR
CSR-1
Intel
Tflops
Picochip
PC102
Cisco
Niagara
Boardcom 1480
Xbox360
2010
2
1
8
4
32
cores 16
128
64
512
256
Cell
Opteron 4P
Xeon MP
Ambric
AM2045
4004
8008
80868080 286 386 486 Pentium
PA-8800 Opteron Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon Itanium 2
Multicores Future
Raul Goycoolea S.
Multiprocessor Programming 2116 February 2012
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 2216 February 2012
Introduction to
Parallel
Architectures
Traditionally, software has been written for serial computation:
• To be run on a single computer having a single Central Processing Unit (CPU)
• A problem is broken into a discrete series of instructions
• Instructions are executed one after another
• Only one instruction may execute at any moment in time
What is Parallel Computing?
Raul Goycoolea S.
Multiprocessor Programming 2416 February 2012
What is Parallel Computing?
In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
Raul Goycoolea S.
Multiprocessor Programming 2516 February 2012
Options in Parallel Computing?
The compute resources might be:
• A single computer with multiple processors;
• An arbitrary number of computers connected by a network;
• A combination of both.
The computational problem should be able to:
• Be broken apart into discrete pieces of work that can be solved
simultaneously;
• Execute multiple program instructions at any moment in time;
• Be solved in less time with multiple compute resources than with a
single compute resource.
Raul Goycoolea S.
Multiprocessor Programming 2616 February 2012
27
The Real World is Massively Parallel
• Parallel computing is an evolution of serial computing that
attempts to emulate what has always been the state of
affairs in the natural world: many complex, interrelated
events happening at the same time, yet within a sequence.
For example:
• Galaxy formation
• Planetary movement
• Weather and ocean patterns
• Tectonic plate drift Rush hour traffic
• Automobile assembly line
• Building a jet
• Ordering a hamburger at the drive through.
Raul Goycoolea S.
Multiprocessor Programming 2816 February 2012
Architecture Concepts
Von Neumann Architecture
• Named after the Hungarian mathematician John von Neumann who first authored
the general requirements for an electronic computer in his 1945 papers
• Since then, virtually all computers have followed this basic design, differing from
earlier computers which were programmed through "hard wiring”
• Comprised of four main components:
• Memory
• Control Unit
• Arithmetic Logic Unit
• Input/Output
• Read/write, random access memory is used to store
both program instructions and data
• Program instructions are coded data which tell
the computer to do something
• Data is simply information to be used by the
program
• Control unit fetches instructions/data from memory, decodes
the instructions and then sequentially coordinates operations
to accomplish the programmed task.
• Aritmetic Unit performs basic arithmetic operations
• Input/Output is the interface to the human operator
Raul Goycoolea S.
Multiprocessor Programming 2916 February 2012
Flynn’s Taxonomy
• There are different ways to classify parallel computers. One of the more
widely used classifications, in use since 1966, is called Flynn's
Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer architectures
according to how they can be classified along the two independent
dimensions of Instruction and Data. Each of these dimensions can
have only one of two possible states: Single or Multiple.
• The matrix below defines the 4 possible classifications according to
Flynn:
Raul Goycoolea S.
Multiprocessor Programming 3016 February 2012
Single Instruction, Single Data (SISD):
• A serial (non-parallel) computer
• Single Instruction: Only one instruction stream is
being acted on by the CPU during any one clock
cycle
• Single Data: Only one data stream is being used
as input during any one clock cycle
• Deterministic execution
• This is the oldest and even today, the most
common type of computer
• Examples: older generation mainframes,
minicomputers and workstations; most modern
day PCs.
Raul Goycoolea S.
Multiprocessor Programming 3116 February 2012
Single Instruction, Single Data (SISD):
Raul Goycoolea S.
Multiprocessor Programming 3216 February 2012
Single Instruction, Multiple Data
(SIMD):
• A type of parallel computer
• Single Instruction: All processing units execute the same instruction at any
given clock cycle
• Multiple Data: Each processing unit can operate on a different data element
• Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
• Examples:
• Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,
ETA10
• Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
Raul Goycoolea S.
Multiprocessor Programming 3316 February 2012
Single Instruction, Multiple Data
(SIMD):
ILLIAC IV MasPar TM CM-2 Cell GPU
Cray X-MP Cray Y-MP
Raul Goycoolea S.
Multiprocessor Programming 3416 February 2012
• A type of parallel computer
• Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing
units.
• Few actual examples of this class of parallel computer have ever
existed. One is the experimental Carnegie-Mellon C.mmp computer
(1971).
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded
message.
Multiple Instruction, Single Data
(MISD):
Raul Goycoolea S.
Multiprocessor Programming 3516 February 2012
Multiple Instruction, Single Data
(MISD):
Raul Goycoolea S.
Multiprocessor Programming 3616 February 2012
• A type of parallel computer
• Multiple Instruction: Every processor may be executing a different
instruction stream
• Multiple Data: Every processor may be working with a different
data stream
• Execution can be synchronous or asynchronous, deterministic or
non-deterministic
• Currently, the most common type of parallel computer - most
modern supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel
computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.
Note: many MIMD architectures also include SIMD execution sub-components
Multiple Instruction, Multiple Data
(MIMD):
Raul Goycoolea S.
Multiprocessor Programming 3716 February 2012
Multiple Instruction, Multiple Data
(MIMD):
Raul Goycoolea S.
Multiprocessor Programming 3816 February 2012
Multiple Instruction, Multiple Data
(MIMD):
IBM Power HP Alphaserver Intel IA32/x64
Oracle SPARC Cray XT3 Oracle Exadata/Exalogic
Raul Goycoolea S.
Multiprocessor Programming 3916 February 2012
Parallel Computer Memory Architecture
Shared Memory
Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.
Multiple processors can operate independently but share the same memory
resources.
Changes in a memory location effected by one processor are visible to all other
processors.
Shared memory machines can be divided into two main classes based upon
memory access times: UMA and NUMA.
Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
Non-Uniform Memory Access (NUMA):
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
40
Raul Goycoolea S.
Multiprocessor Programming 4016 February 2012
Parallel Computer Memory Architecture
Shared Memory
41
Shared Memory (UMA) Shared Memory (NUMA)
Raul Goycoolea S.
Multiprocessor Programming 4116 February 2012
Basic structure of a centralized
shared-memory multiprocessor
Processor Processor Processor Processor
One or more
levels of Cache
One or more
levels of Cache
One or more
levels of Cache
One or more
levels of Cache
Multiple processor-cache subsystems share the same physical memory, typically connected by a bus.
In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniform
access time o all memory from all processors remains.
Raul Goycoolea S.
Multiprocessor Programming 4216 February 2012
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Processor
+ Cache
I/OMemory
Interconnection Network
Basic Architecture of a Distributed
Multiprocessor
Consists of individual nodes containing a processor, some memory, typically some I/O, and an interface to an
interconnection network that connects all the nodes. Individual nodes may contain a small number of
processors, which may be interconnected by a small bus or a different interconnection technology, which is less
scalable than the global interconnection network.
Raul Goycoolea S.
Multiprocessor Programming 4316 February 2012
Communication
how do parallel operations communicate data results?
Synchronization
how are parallel operations coordinated?
Resource Management
how are a large number of parallel tasks scheduled onto
finite hardware?
Scalability
how large a machine can be built?
Issues in Parallel Machine Design
Raul Goycoolea S.
Multiprocessor Programming 4416 February 2012
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 4516 February 2012
Parallel
Programming
Concepts
ExplicitImplicit
Hardware Compiler
Superscalar
Processors
Explicitly Parallel Architectures
Implicit vs. Explicit Parallelism
Raul Goycoolea S.
Multiprocessor Programming 4716 February 2012
Implicit Parallelism: Superscalar Processors
Explicit Parallelism
Shared Instruction Processors
Shared Sequencer Processors
Shared Network Processors
Shared Memory Processors
Multicore Processors
Outline
Raul Goycoolea S.
Multiprocessor Programming 4816 February 2012
Issue varying numbers of instructions per clock
statically scheduled
–
–
using compiler techniques
in-order execution
dynamically scheduled
–
–
–
–
–
Extracting ILP by examining 100‟s of instructions
Scheduling them in parallel as operands become available
Rename registers to eliminate anti dependences
out-of-order execution
Speculative execution
Implicit Parallelism: Superscalar
Processors
Raul Goycoolea S.
Multiprocessor Programming 4916 February 2012
Instruction i IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4
Instruction # 1 2 3 4 5 6 7 8
IF: Instruction fetch
EX : Execution
Cycles
ID : Instruction decode
WB : Write back
Pipelining Execution
Raul Goycoolea S.
Multiprocessor Programming 5016 February 2012
Instruction type 1 2 3 4 5 6 7
Cycles
Integer
Floating point
IF
IF
ID
ID
EX
EX
WB
WB
Integer
Floating point
Integer
Floating point
Integer
Floating point
IF
IF
ID
ID
EX
EX
WB
WB
IF
IF
ID
ID
EX
EX
WB
WB
IF
IF
ID
ID
EX
EX
WB
WB
2-issue super-scalar machine
Super-Scalar Execution
Raul Goycoolea S.
Multiprocessor Programming 5116 February 2012
Intrinsic data dependent (aka true dependence) on Instructions:
I: add r1,r2,r3
J: sub r4,r1,r3
If two instructions are data dependent, they cannot execute
simultaneously, be completely overlapped or execute in out-of-
order
If data dependence caused a hazard in pipeline,
called a Read After Write (RAW) hazard
Data Dependence and Hazards
Raul Goycoolea S.
Multiprocessor Programming 5216 February 2012
HW/SW must preserve program order:
order instructions would execute in if executed sequentially as
determined by original source program
Dependences are a property of programs
Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be calculated
3) sets an upper bound on how much parallelism can possibly
be exploited
Goal: exploit parallelism by preserving program order only
where it affects the outcome of the program
ILP and Data Dependencies, Hazards
Raul Goycoolea S.
Multiprocessor Programming 5316 February 2012
Name dependence: when 2 instructions use same register or
memory location, called a name, but no flow of data between
the instructions associated with that name; 2 versions of
name dependence
InstrJ writes operand before InstrIreads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”
If anti-dependence caused a hazard in the pipeline, called a
Write After Read (WAR) hazard
Name Dependence #1: Anti-dependece
Raul Goycoolea S.
Multiprocessor Programming 5416 February 2012
Instruction writes operand before InstrIwrites it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “output dependence” by compiler writers.
This also results from the reuse of name “r1”
If anti-dependence caused a hazard in the pipeline, called a
Write After Write (WAW) hazard
Instructions involved in a name dependence can execute
simultaneously if name used in instructions is changed so
instructions do not conflict
Register renaming resolves name dependence for registers
Renaming can be done either by compiler or by HW
Name Dependence #1: Output
Dependence
Raul Goycoolea S.
Multiprocessor Programming 5516 February 2012
Every instruction is control dependent on some set of
branches, and, in general, these control dependencies must
be preserved to preserve program order
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1, and S2 is control dependent
on p2 but not on p1.
Control dependence need not be preserved
willing to execute instructions that should not have been
executed, thereby violating the control dependences, if can
do so without affecting correctness of the program
Speculative Execution
Control Dependencies
Raul Goycoolea S.
Multiprocessor Programming 5616 February 2012
Greater ILP: Overcome control dependence by hardware
speculating on outcome of branches and executing
program as if guesses were correct
Speculation ⇒ fetch, issue, and execute
instructions as if branch predictions were always
correct
Dynamic scheduling ⇒ only fetches and issues
instructions
Essentially a data flow execution model: Operations
execute as soon as their operands are available
Speculation
Raul Goycoolea S.
Multiprocessor Programming 5716 February 2012
Different predictors
Branch Prediction
Value Prediction
Prefetching (memory access pattern prediction)
Inefficient
Predictions can go wrong
Has to flush out wrongly predicted data
While not impacting performance, it consumes power
Speculation in Rampant in Modern
Superscalars
Raul Goycoolea S.
Multiprocessor Programming 5816 February 2012
Implicit Parallelism: Superscalar Processors
Explicit Parallelism
Shared Instruction Processors
Shared Sequencer Processors
Shared Network Processors
Shared Memory Processors
Multicore Processors
Outline
Raul Goycoolea S.
Multiprocessor Programming 5916 February 2012
Parallelism is exposed to software
Compiler or Programmer
Many different forms
Loosely coupled Multiprocessors to tightly coupled VLIW
Explicit Parallel Processors
Raul Goycoolea S.
Multiprocessor Programming 6016 February 2012
Throughput per Cycle
One Operation
Latency in Cycles
Parallelism = Throughput * Latency
To maintain throughput T/cycle when each operation has
latency L cycles, need T*L independent operations
For fixed parallelism:
decreased latency allows increased throughput
decreased throughput allows increased latency tolerance
Little’s Law
Raul Goycoolea S.
Multiprocessor Programming 6116 February 2012
Time
Time
Time
Time
Data-Level Parallelism (DLP)
Instruction-Level Parallelism (ILP)
Pipelining
Thread-Level Parallelism (TLP)
Types of Software Parallelism
Raul Goycoolea S.
Multiprocessor Programming 6216 February 2012
Pipelining
Thread
Parallel
Data
Parallel
Instruction
Parallel
Translating Parallelism Types
Raul Goycoolea S.
Multiprocessor Programming 6316 February 2012
What is a sequential program?
A single thread of control that executes one instruction and when it is
finished execute the next logical instruction
What is a concurrent program?
A collection of autonomous sequential threads, executing (logically) in
parallel
The implementation (i.e. execution) of a collection of threads can be:
Multiprogramming
– Threads multiplex their executions on a single processor.
Multiprocessing
– Threads multiplex their executions on a multiprocessor or a multicore system
Distributed Processing
– Processes multiplex their executions on several different machines
What is concurrency?
Raul Goycoolea S.
Multiprocessor Programming 6416 February 2012
Concurrency is not (only) parallelism
Interleaved Concurrency
Logically simultaneous processing
Interleaved execution on a single
processor
Parallelism
Physically simultaneous processing
Requires a multiprocessors or a
multicore system
Time
Time
A
B
C
A
B
C
Concurrency and Parallelism
Raul Goycoolea S.
Multiprocessor Programming 6516 February 2012
There are a lot of ways to use Concurrency in
Programming
Semaphores
Blocking & non-blocking queues
Concurrent hash maps
Copy-on-write arrays
Exchangers
Barriers
Futures
Thread pool support
Other Types of Synchronization
Raul Goycoolea S.
Multiprocessor Programming 6616 February 2012
Deadlock
Two or more threads stop and wait for each other
Livelock
Two or more threads continue to execute, but make
no progress toward the ultimate goal
Starvation
Some thread gets deferred forever
Lack of fairness
Each thread gets a turn to make progress
Race Condition
Some possible interleaving of threads results in an
undesired computation result
Potential Concurrency Problems
Raul Goycoolea S.
Multiprocessor Programming 6716 February 2012
Concurrency and Parallelism are important concepts
in Computer Science
Concurrency can simplify programming
However it can be very hard to understand and debug
concurrent programs
Parallelism is critical for high performance
From Supercomputers in national labs to
Multicores and GPUs on your desktop
Concurrency is the basis for writing parallel programs
Next Lecture: How to write a Parallel Program
Parallelism Conclusions
Raul Goycoolea S.
Multiprocessor Programming 6816 February 2012
Shared memory
–
–
–
–
Ex: Intel Core 2 Duo/Quad
One copy of data shared
among many cores
Atomicity, locking and
synchronization
essential for correctness
Many scalability issues
Distributed memory
–
–
–
–
Ex: Cell
Cores primarily access local
memory
Explicit data exchange
between cores
Data distribution and
communication orchestration
is essential for performance
P1 P2 P3 Pn
Memory
Interconnection Network
Interconnection Network
P1 P2 P3 Pn
M1 M2 M3 Mn
Two primary patterns of multicore architecture design
Architecture Recap
Raul Goycoolea S.
Multiprocessor Programming 6916 February 2012
Processor 1…n ask for X
There is only one place to look
Communication through
shared variables
Race conditions possible
Use synchronization to protect from conflicts
Change how data is stored to minimize synchronization
P1 P2 P3 Pn
Memory
x
Interconnection Network
Programming Shared Memory Processors
Raul Goycoolea S.
Multiprocessor Programming 7016 February 2012
Data parallel
Perform same computation
but operate on different data
A single process can fork
multiple concurrent threads
Each thread encapsulate its own execution path
Each thread has local state and shared resources
Threads communicate through shared resources
such as global memory
for (i = 0; i < 12; i++)
C[i] = A[i] + B[i];
i=0
i=1
i=2
i=3
i=8
i=9
i = 10
i = 11
i=4
i=5
i=6
i=7
join (barrier)
fork (threads)
Example of Parallelization
Raul Goycoolea S.
Multiprocessor Programming 7116 February 2012
int A[12] = {...}; int B[12] = {...}; int C[12];
void add_arrays(int start)
{
int i;
for (i = start; i < start + 4; i++)
C[i] = A[i] + B[i];
}
int main (int argc, char *argv[])
{
pthread_t threads_ids[3];
int rc, t;
for(t = 0; t < 4; t++) {
rc = pthread_create(&thread_ids[t],
NULL /* attributes */,
add_arrays /* function */,
t * 4 /* args to function */);
}
pthread_exit(NULL);
}
join (barrier)
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i = 10
i = 11
fork (threads)
Example Parallelization with Threads
Raul Goycoolea S.
Multiprocessor Programming 7216 February 2012
Data parallelism
Perform same computation
but operate on different data
Control parallelism
Perform different functions
fork (threads)
join (barrier)
pthread_create(/* thread id */,
/* attributes */,
/*
/*
any function
args to function
*/,
*/);
Types of Parallelism
Raul Goycoolea S.
Multiprocessor Programming 7316 February 2012
Uniform Memory Access (UMA)
Centrally located memory
All processors are equidistant (access times)
Non-Uniform Access (NUMA)
Physically partitioned but accessible by all
Processors have the same address space
Placement of data affects performance
Memory Access Latency in Shared
Memory Architectures
Raul Goycoolea S.
Multiprocessor Programming 7416 February 2012
Coverage or extent of parallelism in algorithm
Granularity of data partitioning among processors
Locality of computation and communication
… so how do I parallelize my program?
Summary of Parallel Performance
Factors
Raul Goycoolea S.
Multiprocessor Programming 7516 February 2012
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 7616 February 2012
Parallel
Design
Patterns
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequential
computation
Parallel
program
d
e
c
o
m
p
o
s
i
t
i
o
n
a
s
s
i
g
n
m
e
n
t
o
r
c
h
e
s
t
r
a
t
i
o
n
m
a
p
p
i
n
g
Common Steps to Create a Parallel
Program
Identify concurrency and decide at what level to
exploit it
Break up computation into tasks to be divided
among processes
Tasks may become available dynamically
Number of tasks may vary with time
Enough tasks to keep processors busy
Number of tasks available at a time is upper bound on
achievable speedup
Decomposition (Amdahl’s Law)
Specify mechanism to divide work among core
Balance work and reduce communication
Structured approaches usually work well
Code inspection or understanding of application
Well-known design patterns
As programmers, we worry about partitioning first
Independent of architecture or programming model
But complexity often affect decisions!
Granularity
Computation and communication concurrency
Preserve locality of data
Schedule tasks to satisfy dependences early
Orchestration and Mapping
Provides a cookbook to systematically guide programmers
Decompose, Assign, Orchestrate, Map
Can lead to high quality solutions in some domains
Provide common vocabulary to the programming community
Each pattern has a name, providing a vocabulary for
discussing solutions
Helps with software reusability, malleability, and modularity
Written in prescribed format to allow the reader to
quickly understand the solution and its context
Otherwise, too difficult for programmers, and software will not
fully exploit parallel hardware
Parallel Programming by Pattern
Berkeley architecture professor
Christopher Alexander
In 1977, patterns for city
planning, landscaping, and
architecture in an attempt to
capture principles for “living”
design
History
Example 167 (p. 783)
Design Patterns: Elements of Reusable Object-
Oriented Software (1995)
Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides
Catalogue of patterns
Creation, structural, behavioral
Patterns in Object-Oriented
Programming
Algorithm Expression
Finding Concurrency
Expose concurrent tasks
Algorithm Structure
Map tasks to processes to
exploit parallel architecture
4 Design Spaces
Software Construction
Supporting Structures
Code and data structuring
patterns
Implementation Mechanisms
Low level mechanisms used
to write parallel programs
Patterns for Parallel
Programming. Mattson,
Sanders, and Massingill
(2005).
Patterns for Parallelizing Programs
split
frequency encoded
macroblocks
ZigZag
IQuantization
IDCT
Saturation
spatially encoded macroblocks
differentially coded
motion vectors
Motion Vector Decode
Repeat
motion vectors
MPEG bit stream
VLD
macroblocks, motion vectors
MPEG Decoder
join
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
Here’s my algorithm, Where’s the
concurrency?
Task decomposition
Independent coarse-grained
computation
Inherent to algorithm
Sequence of statements
(instructions) that operate
together as a group
Corresponds to some logical
part of program
Usually follows from the way
programmer thinks about a
problem
join
motion vectorsspatially encoded macroblocks
IDCT
Saturation
MPEG Decoder
frequency encoded
macroblocks
ZigZag
IQuantization
MPEG bit stream
VLD
macroblocks, motion vectors
split
differentially coded
motion vectors
Motion Vector Decode
Repeat
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
Here’s my algorithm, Where’s the
concurrency?
join
motion vectors
Saturation
spatially encoded macroblocks
MPEG Decoder
frequency encoded
macroblocks
ZigZag
IQuantization
IDCT
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
MPEG bit stream
VLD
macroblocks, motion vectors
split
differentially coded
motion vectors
Motion Vector Decode
Repeat
Task decomposition
Parallelism in the application
Data decomposition
Same computation is applied
to small data chunks derived
from large data set
Here’s my algorithm, Where’s the
concurrency?
motion vectorsspatially encoded macroblocks
MPEG Decoder
frequency encoded
macroblocks
ZigZag
IQuantization
IDCT
Saturation
join
Motion
Compensation
recovered picture
Picture Reorder
Color Conversion
Display
MPEG bit stream
VLD
macroblocks, motion vectors
split
differentially coded
motion vectors
Motion Vector Decode
Repeat
Task decomposition
Parallelism in the application
Data decomposition
Same computation many data
Pipeline decomposition
Data assembly lines
Producer-consumer chains
Here’s my algorithm, Where’s the
concurrency?
Algorithms start with a good understanding of the
problem being solved
Programs often naturally decompose into tasks
Two common decompositions are
–
–
Function calls and
Distinct loop iterations
Easier to start with many tasks and later fuse them,
rather than too few tasks and later try to split them
Guidelines for Task Decomposition
Flexibility
Program design should afford flexibility in the number and
size of tasks generated
–
–
Tasks should not tied to a specific architecture
Fixed tasks vs. Parameterized tasks
Efficiency
Tasks should have enough work to amortize the cost of
creating and managing them
Tasks should be sufficiently independent so that managing
dependencies doesn‟t become the bottleneck
Simplicity
The code has to remain readable and easy to understand,
and debug
Guidelines for Task Decomposition
Data decomposition is often implied by task
decomposition
Programmers need to address task and data
decomposition to create a parallel program
Which decomposition to start with?
Data decomposition is a good starting point when
Main computation is organized around manipulation of a
large data structure
Similar operations are applied to different parts of the
data structure
Guidelines for Data Decomposition
Raul Goycoolea S.
Multiprocessor Programming 9316 February 2012
Array data structures
Decomposition of arrays along rows, columns, blocks
Recursive data structures
Example: decomposition of trees into sub-trees
problem
compute
subproblem
compute
subproblem
compute
subproblem
compute
subproblem
merge
subproblem
merge
subproblem
merge
solution
subproblem
split
subproblem
split
split
Common Data Decompositions
Raul Goycoolea S.
Multiprocessor Programming 9416 February 2012
Flexibility
Size and number of data chunks should support a wide
range of executions
Efficiency
Data chunks should generate comparable amounts of
work (for load balancing)
Simplicity
Complex data compositions can get difficult to manage
and debug
Raul Goycoolea S.
Multiprocessor Programming 9516 February 2012
Guidelines for Data Decompositions
Data is flowing through a sequence of stages
Assembly line is a good analogy
What’s a prime example of pipeline decomposition in
computer architecture?
Instruction pipeline in modern CPUs
What’s an example pipeline you may use in your UNIX shell?
Pipes in UNIX: cat foobar.c | grep bar | wc
Other examples
Signal processing
Graphics
ZigZag
IQuantization
IDCT
Saturation
Guidelines for Pipeline Decomposition
Raul Goycoolea S.
Multiprocessor Programming 9616 February 2012
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 9716 February 2012
Performance &
Optimization
Coverage or extent of parallelism in algorithm
Amdahl‟s Law
Granularity of partitioning among processors
Communication cost and load balancing
Locality of computation and communication
Communication between processors or between
processors and their memories
Review: Keys to Parallel Performance
n/m
B
t overlap)C f (o l
frequency
of messages
overhead per
message
(at both ends)
network delay
per message
number of messages
amount of latency
hidden by concurrency
with computation
total data sent
cost induced by
contention per
message
bandwidth along path
(determined by network)
Communication Cost Model
synchronization
point
Get Data
Compute
Get Data
CPU is idle
Memory is idle
Compute
Overlapping Communication with
Computation
Computation to communication ratio limits
performance gains from pipelining
Get Data
Compute
Get Data
Compute
Where else to look for performance?
Limits in Pipelining Communication
Determined by program implementation and
interactions with the architecture
Examples:
Poor distribution of data across distributed memories
Unnecessarily fetching data that is not used
Redundant data fetches
Artifactual Communication
In uniprocessors, CPU communicates with memory
Loads and stores are to uniprocessors as
_______ and ______ are to distributed memory
multiprocessors
How is communication overlap enhanced in
uniprocessors?
Spatial locality
Temporal locality
“get” “put”
Lessons From Uniprocessors
CPU asks for data at address 1000
Memory sends data at address 1000 … 1064
Amount of data sent depends on architecture
parameters such as the cache block size
Works well if CPU actually ends up using data from
1001, 1002, …, 1064
Otherwise wasted bandwidth and cache capacity
Spatial Locality
Main memory access is expensive
Memory hierarchy adds small but fast memories
(caches) near the CPU
Memories get bigger as distance
from CPU increases
CPU asks for data at address 1000
Memory hierarchy anticipates more accesses to same
address and stores a local copy
Works well if CPU actually ends up using data from 1000 over
and over and over …
Otherwise wasted cache capacity
main
memory
cache
(level 2)
cache
(level 1)
Temporal Locality
Data is transferred in chunks to amortize
communication cost
Cell: DMA gets up to 16K
Usually get a contiguous chunk of memory
Spatial locality
Computation should exhibit good spatial locality
characteristics
Temporal locality
Reorder computation to maximize use of data fetched
Reducing Artifactual Costs in
Distributed Memory Architectures
Tasks mapped to execution units (threads)
Threads run on individual processors (cores)
finish line: sequential time + longest parallel time
Two keys to faster execution
Load balance the work among the processors
Make execution on each processor faster
sequential
parallel
sequential
parallel
Single Thread Performance
Need some way of
measuring performance
Coarse grained
measurements
% gcc sample.c
% time a.out
2.312u 0.062s 0:02.50 94.8%
% gcc sample.c –O3
% time a.out
1.921u 0.093s 0:02.03 99.0%
… but did we learn much
about what’s going on?
#define N (1 << 23)
#define T (10)
#include <string.h>
double a[N],b[N];
void cleara(double a[N]) {
int i;
for (i = 0; i < N; i++) {
a[i] = 0;
}
}
int main() {
double s=0,s2=0; int i,j;
for (j = 0; j < T; j++) {
for (i = 0; i < N; i++) {
b[i] = 0;
}
cleara(a);
memset(a,0,sizeof(a));
for (i = 0; i < N; i++) {
s += a[i] * b[i];
s2 += a[i] * a[i] + b[i] * b[i];
}
}
printf("s %f s2 %fn",s,s2);
}
record stop time
record start time
Understanding Performance
Increasingly possible to get accurate measurements
using performance counters
Special registers in the hardware to measure events
Insert code to start, read, and stop counter
Measure exactly what you want, anywhere you want
Can measure communication and computation duration
But requires manual changes
Monitoring nested scopes is an issue
Heisenberg effect: counters can perturb execution time
time
stopclear/start
Measurements Using Counters
Event-based profiling
Interrupt execution when an event counter reaches a
threshold
Time-based profiling
Interrupt execution every t seconds
Works without modifying your code
Does not require that you know where problem might be
Supports multiple languages and programming models
Quite efficient for appropriate sampling frequencies
Dynamic Profiling
Cycles (clock ticks)
Pipeline stalls
Cache hits
Cache misses
Number of instructions
Number of loads
Number of stores
Number of floating point operations
…
Counter Examples
Processor utilization
Cycles / Wall Clock Time
Instructions per cycle
Instructions / Cycles
Instructions per memory operation
Instructions / Loads + Stores
Average number of instructions per load miss
Instructions / L1 Load Misses
Memory traffic
Loads + Stores * Lk Cache Line Size
Bandwidth consumed
Loads + Stores * Lk Cache Line Size / Wall Clock Time
Many others
Cache miss rate
Branch misprediction rate
…
Useful Derived Measurements
application
source
run
(profiles
execution)
performance
profile
binary
object code
compiler
binary analysis
interpret profile
source
correlation
Common Profiling Workflow
GNU gprof
Widely available with UNIX/Linux distributions
gcc –O2 –pg foo.c –o foo
./foo
gprof foo
HPC Toolkit
http://www.hipersoft.rice.edu/hpctoolkit/
PAPI
http://icl.cs.utk.edu/papi/
VTune
http://www.intel.com/cd/software/products/asmo-na/eng/vtune/
Many others
Popular Runtime Profiling Tools
Instruction level parallelism
Multiple functional units, deeply pipelined, speculation, ...
Data level parallelism
SIMD (Single Inst, Multiple Data): short vector instructions
(multimedia extensions)–
–
–
Hardware is simpler, no heavily ported register files
Instructions are more compact
Reduces instruction fetch bandwidth
Complex memory hierarchies
Multiple level caches, may outstanding misses,
prefetching, …
Performance un Uniprocessors
time = compute + wait
Single Instruction, Multiple Data
SIMD registers hold short vectors
Instruction operates on all elements in SIMD register at once
a
b
c
Vector code
for (int i = 0; i < n; i += 4) {
c[i:i+3] = a[i:i+3] + b[i:i+3]
}
SIMD register
Scalar code
for (int i = 0; i < n; i+=1) {
c[i] = a[i] + b[i]
}
a
b
c
scalar register
Single Instruction, Multiple Data
For Example Cell
SPU has 128 128-bit registers
All instructions are SIMD instructions
Registers are treated as short vectors of 8/16/32-bit
integers or single/double-precision floats
Instruction Set
AltiVec
MMX/SSE
3DNow!
VIS
MAX2
MVI
MDMX
Architecture
PowerPC
Intel
AMD
Sun
HP
Alpha
MIPS V
SIMD Width
128
64/128
64
64
64
64
64
Floating Point
yes
yes
yes
no
no
no
yes
SIMD in Major Instruction Set
Architectures (ISAs)
Library calls and inline assembly
Difficult to program
Not portable
Different extensions to the same ISA
MMX and SSE
SSE vs. 3DNow!
Compiler vs. Crypto Oracle T4
Using SIMD Instructions
Tune the parallelism first
Then tune performance on individual processors
Modern processors are complex
Need instruction level parallelism for performance
Understanding performance requires a lot of probing
Optimize for the memory hierarchy
Memory is much slower than processors
Multi-layer memory hierarchies try to hide the speed gap
Data locality is essential for performance
Programming for Performance
May have to change everything!
Algorithms, data structures, program structure
Focus on the biggest performance impediments
Too many issues to study everything
Remember the law of diminishing returns
Programming for Performance
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Actual Cases
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 12216 February 2012
Parallel
Compilers
Parallel Execution
Parallelizing Compilers
Dependence Analysis
Increasing Parallelization Opportunities
Generation of Parallel Loops
Communication Code Generation
Compilers Outline
Raul Goycoolea S.
Multiprocessor Programming 12416 February 2012
Instruction Level Parallelism
(ILP)
Task Level Parallelism (TLP)
Loop Level Parallelism (LLP)
or Data Parallelism
Pipeline Parallelism
Divide and Conquer
Parallelism
Scheduling and Hardware
Mainly by hand
Hand or Compiler Generated
Hardware or Streaming
Recursive functions
Types of Parallelism
Raul Goycoolea S.
Multiprocessor Programming 12516 February 2012
90% of the execution time in 10% of the code
Mostly in loops
If parallel, can get good performance
Load balancing
Relatively easy to analyze
Why Loops?
Raul Goycoolea S.
Multiprocessor Programming 12616 February 2012
FORALL
No “loop carried
dependences”
Fully parallel
FORACROSS
Some “loop carried
dependences”
Programmer Defined Parallel Loop
Raul Goycoolea S.
Multiprocessor Programming 12716 February 2012
Parallel Execution
Parallelizing Compilers
Dependence Analysis
Increasing Parallelization Opportunities
Generation of Parallel Loops
Communication Code Generation
Outline
Raul Goycoolea S.
Multiprocessor Programming 12816 February 2012
Finding FORALL Loops out of FOR loops
Examples
FOR I = 0 to 5
A[I+1] = A[I] + 1
FOR I = 0 to 5
A[I] = A[I+6] + 1
For I = 0 to 5
A[2*I] = A[2*I + 1] + 1
Parallelizing Compilers
Raul Goycoolea S.
Multiprocessor Programming 12916 February 2012
True dependence
a =
= a
Anti dependence
= a
a =
Output dependence
a
a
=
=
Definition:
Data dependence exists for a dynamic instance i and j iff
either i or j is a write operation
i and j refer to the same variable
i executes before j
How about array accesses within loops?
Dependences
Raul Goycoolea S.
Multiprocessor Programming 13016 February 2012
Parallel Execution
Parallelizing Compilers
Dependence Analysis
Increasing Parallelization Opportunities
Generation of Parallel Loops
Communication Code Generation
Outline
Raul Goycoolea S.
Multiprocessor Programming 13116 February 2012
FOR I = 0 to 5
A[I] = A[I] + 1
0 1 2
Iteration Space
0 1 2 3 4 5
Data Space
3 4 5 6 7 8 9 10 11 12
A[I]
A[I]
A[I]
A[I]
A[I]
= A[I]
= A[I]
= A[I]
= A[I]
= A[I]
Array Access in a Loop
Raul Goycoolea S.
Multiprocessor Programming 13216 February 2012
Find data dependences in loop
For every pair of array acceses to the same array
If the first access has at least one dynamic instance (an iteration) in
which it refers to a location in the array that the second access also
refers to in at least one of the later dynamic instances (iterations).
Then there is a data dependence between the statements
(Note that same array can refer to itself – output dependences)
Definition
Loop-carried dependence:
dependence that crosses a loop boundary
If there are no loop carried dependences are parallelizable
Recognizing FORALL Loops
Raul Goycoolea S.
Multiprocessor Programming 13316 February 2012
FOR I = 1 to n
FOR J = 1 to n
A[I, J] = A[I-1, J+1] + 1
FOR I = 1 to n
FOR J = 1 to n
A[I] = A[I-1] + 1
J
J
I
I
What is the Dependence?
Raul Goycoolea S.
Multiprocessor Programming 13416 February 2012
Parallel Execution
Parallelizing Compilers
Dependence Analysis
Increasing Parallelization Opportunities
Generation of Parallel Loops
Communication Code Generation
Outline
Raul Goycoolea S.
Multiprocessor Programming 13516 February 2012
Scalar Privatization
Reduction Recognition
Induction Variable Identification
Array Privatization
Interprocedural Parallelization
Loop Transformations
Granularity of Parallelism
Increasing Parallelization
Opportunities
Raul Goycoolea S.
Multiprocessor Programming 13616 February 2012
Example
FOR i = 1 to n
X = A[i] * 3;
B[i] = X;
Is there a loop carried dependence?
What is the type of dependence?
Scalar Privatization
Raul Goycoolea S.
Multiprocessor Programming 13716 February 2012
Reduction Analysis:
Only associative operations
The result is never used within the loop
Transformation
Integer Xtmp[NUMPROC];
Barrier();
FOR i = myPid*Iters to MIN((myPid+1)*Iters, n)
Xtmp[myPid] = Xtmp[myPid] + A[i];
Barrier();
If(myPid == 0) {
FOR p = 0 to NUMPROC-1
X = X + Xtmp[p];
…
Reduction Recognition
Raul Goycoolea S.
Multiprocessor Programming 13816 February 2012
Example
FOR i = 0 to N
A[i] = 2^i;
After strength reduction
t = 1
FOR i = 0 to N
A[i] = t;
t = t*2;
What happened to loop carried dependences?
Need to do opposite of this!
Perform induction variable analysis
Rewrite IVs as a function of the loop variable
Induction Variables
Raul Goycoolea S.
Multiprocessor Programming 13916 February 2012
Similar to scalar privatization
However, analysis is more complex
Array Data Dependence Analysis:
Checks if two iterations access the same location
Array Data Flow Analysis:
Checks if two iterations access the same value
Transformations
Similar to scalar privatization
Private copy for each processor or expand with an additional
dimension
Array Privatization
Raul Goycoolea S.
Multiprocessor Programming 14016 February 2012
Function calls will make a loop unparallelizatble
Reduction of available parallelism
A lot of inner-loop parallelism
Solutions
Interprocedural Analysis
Inlining
Interprocedural Parallelization
Raul Goycoolea S.
Multiprocessor Programming 14116 February 2012
Cache Coherent Shared Memory Machine
Generate code for the parallel loop nest
No Cache Coherent Shared Memory
or Distributed Memory Machines
Generate code for the parallel loop nest
Identify communication
Generate communication code
Communication Code Generation
Raul Goycoolea S.
Multiprocessor Programming 14216 February 2012
Eliminating redundant communication
Communication aggregation
Multi-cast identification
Local memory management
Communication Optimizations
Raul Goycoolea S.
Multiprocessor Programming 14316 February 2012
Automatic parallelization of loops with arrays
Requires Data Dependence Analysis
Iteration space & data space abstraction
An integer programming problem
Many optimizations that’ll increase parallelism
Transforming loop nests and communication code generation
Fourier-Motzkin Elimination provides a nice framework
Summary
Raul Goycoolea S.
Multiprocessor Programming 14416 February 2012
<Insert Picture Here>
Program Agenda
• Antecedents of Parallel Computing
• Introduction to Parallel Architectures
• Parallel Programming Concepts
• Parallel Design Patterns
• Performance & Optimization
• Parallel Compilers
• Future of Parallel Architectures
Raul Goycoolea S.
Multiprocessor Programming 14516 February 2012
Future of
Parallel
Architectures
"I think there is a world market for
maybe five computers.“
– Thomas Watson, chairman of IBM, 1949
"There is no reason in the world
anyone would want a computer in their
home. No reason.”
– Ken Olsen, Chairman, DEC, 1977
"640K of RAM ought to be enough for
anybody.”
– Bill Gates, 1981
Predicting the Future is Always Risky
Raul Goycoolea S.
Multiprocessor Programming 14716 February 2012
Evolution
Relatively easy to predict
Extrapolate the trends
Revolution
A completely new technology or solution
Hard to Predict
Paradigm Shifts can occur in both
Future = Evolution + Revolution
Raul Goycoolea S.
Multiprocessor Programming 14816 February 2012
Evolution
Trends
Architecture
Languages, Compilers and Tools
Revolution
Crossing the Abstraction Boundaries
Outline
Raul Goycoolea S.
Multiprocessor Programming 14916 February 2012
Look at the trends
Moore‟s Law
Power Consumption
Wire Delay
Hardware Complexity
Parallelizing Compilers
Program Design Methodologies
Design Drivers are different in
Different Generations
Evolution
Raul Goycoolea S.
Multiprocessor Programming 15016 February 2012
Performance(vs.VAX-11/780)
NumberofTransistors
52%/year
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
%/year
10
8086
1
286
25%/year
386
486
Pentium
P2
P3
P4
Itanium 2
Itanium
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture:
A Quantitative Approach, 4th edition, 2006
The Road to Multicore: Moore’s Law
Raul Goycoolea S.
Multiprocessor Programming 15116 February 2012
Specint2000
10000.00
1000.00
100.00
10.00
1.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Spar c
Super Spar c
Spar c64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
The Road to Multicore:
Uniprocessor Performance (SPECint)
Raul Goycoolea S.
Multiprocessor Programming 15216 February 2012
Intel 386
Intel 486
The Road to Multicore:
Uniprocessor Performance (SPECint)
General-purpose unicores have stopped historic
performance scaling
Power consumption
Wire delays
DRAM access latency
Diminishing returns of more instruction-level parallelism
Raul Goycoolea S.
Multiprocessor Programming 15316 February 2012
Power
1000
100
10
1
85 87 89 91 93 95 97 99 01 03 05 07
Intel 386
Intel 486
intel pentium
intel pentium2
intel pentium3
intel pentium4
intel itanium
Alpha21064
Alpha21164
Alpha21264
Sparc
SuperSparc
Sparc64
Mips
HPPA
Power PC
AMDK6
AMDK7
AMDx86-64
Power Consumption (watts)
Raul Goycoolea S.
Multiprocessor Programming 15416 February 2012
Watts/Spec
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1982 1984 1987 1990 1993 1995 1998 2001 2004 2006
Year
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
0
Power Efficiency (watts/spec)
Raul Goycoolea S.
Multiprocessor Programming 15516 February 2012
Process(microns)
0.06
0.04
0.02
0
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
1996 1998 2000 2002 2008 2010 2012 20142004 2006
Year
700 MHz
1.25 GHz
2.1 GHz
6 GHz
10 GHz
13.5 GHz
• 400 mm2 Die
• From the SIA Roadmap
Range of a Wire in One Clock Cycle
Raul Goycoolea S.
Multiprocessor Programming 15616 February 2012
Performance
19
84
19
94
19
92
19
82
19
88
19
86
19
80
19
96
19
98
20
00
20
02
19
90
20
04
1000000
10000
100
1
Year
µProc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
DRAM Access Latency
• Access times are a
speed of light issue
• Memory technology is
also changing
SRAM are getting harder to
scale
DRAM is no longer cheapest
cost/bit
• Power efficiency is an
issue here as well
Raul Goycoolea S.
Multiprocessor Programming 15716 February 2012
PowerDensity(W/cm2)
10,000
1,000
„70 „80 „90 „00 „10
10 4004
8008
8080
1
8086
8085
286 386
486
Pentium®
Hot Plate
Nuclear Reactor
100
Sun‟s Surface
Rocket Nozzle
Intel Developer Forum, Spring 2004 - Pat Gelsinger
(Pentium at 90 W)
Cube relationship between the cycle time and power
CPUs Architecture
Heat becoming an unmanageable problem
Raul Goycoolea S.
Multiprocessor Programming 15816 February 2012
1970 1980 1990 2000 2010
Improvement in Automatic Parallelization
Automatic
Parallelizing
Compilers for
FORTRAN
Vectorization
technology
Prevalence of type
unsafe languages and
complex data
structures (C, C++)
Typesafe
languages
(Java, C#)
Demand
driven by
Multicores?
Compiling for
Instruction
Level
Parallelism
Raul Goycoolea S.
Multiprocessor Programming 15916 February 2012
# of
1985 199019801970 1975 1995 2000 2005
Raw
Cavium
Octeon
Raza
XLR
CSR-1
Intel
Tflops
Picochip
PC102
Cisco
Niagara
Boardcom 1480
Xbox360
2010
2
1
8
4
32
cores 16
128
64
512
256
Cell
Opteron 4P
Xeon MP
Ambric
AM2045
4004
8008
80868080 286 386 486 Pentium
PA-8800 Opteron Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon Itanium 2
Multicores Future
Raul Goycoolea S.
Multiprocessor Programming 16016 February 2012
Evolution
Trends
Architecture
Languages, Compilers and Tools
Revolution
Crossing the Abstraction Boundaries
Outline
Raul Goycoolea S.
Multiprocessor Programming 16116 February 2012
Don‟t have to contend with uniprocessors
The era of Moore‟s Law induced performance gains is over!
Parallel programming will be required by the masses
–
not just a few supercomputer super-users
Novel Opportunities in Multicores
Raul Goycoolea S.
Multiprocessor Programming 16216 February 2012
Don‟t have to contend with uniprocessors
The era of Moore‟s Law induced performance gains is over!
Parallel programming will be required by the masses
– not just a few supercomputer super-users
Not your same old multiprocessor problem
How does going from Multiprocessors to Multicores impact
programs?
What changed?
Where is the Impact?
–
–
Communication Bandwidth
Communication Latency
Novel Opportunities in Multicores
Raul Goycoolea S.
Multiprocessor Programming 16316 February 2012
How much data can be communicated
between two cores?
What changed?
Number of Wires
–
–
IO is the true bottleneck
On-chip wire density is very high
Clock rate
– IO is slower than on-chip
Multiplexing
No sharing of pins–
Impact on programming model?
Massive data exchange is possible
Data movement is not the bottleneck
processor affinity not that important
32 Giga bits/sec ~300 Tera bits/sec
10,000X
Communication Bandwidth
Raul Goycoolea S.
Multiprocessor Programming 16416 February 2012
How long does it take for a round trip
communication?
What changed?
Length of wire
– Very short wires are faster
Pipeline stages
–
–
–
No multiplexing
On-chip is much closer
Bypass and Speculation?
Impact on programming model?
Ultra-fast synchronization
Can run real-time apps
on multiple cores
50X
~200 Cycles ~4 cycles
Communication Latency
Raul Goycoolea S.
Multiprocessor Programming 16516 February 2012
MemoryMemory
PE
$$
PE
$$
Memory
PEPE
$$
Memory
$$
PE
$$ X
PE
$$ X
PE
$$ X
PE
$$ X
Memory Memory
Basic Multicore
IBM Power
Traditional
Multiprocessor
Integrated Multicore
8 Core 8 Thread Oracle T4
Past, Present and the Future?
Raul Goycoolea S.
Multiprocessor Programming 16616 February 2012
Summary
• As technology evolves, the inherent flexibility of Multi
processor to adapts to new requirements
• Processors can be used at anytime for a lots of kinds
of applications
• Optimization adapts processors to High Performance
requirements
Raul Goycoolea S.
Multiprocessor Programming 16716 February 2012
References
• Author: Raul Goycoolea, Oracle Corporation.
• A search on the WWW for "parallel programming" or "parallel computing" will yield a
wide variety of information.
• Recommended reading:
• "Designing and Building Parallel Programs". Ian Foster. 
http://www-
unix.mcs.anl.gov/dbpp/
• "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis,
Vipin Kumar. 
http://www-users.cs.umn.edu/~karypis/parbook/
• "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra.

www.phys.uu.nl/~steen/web03/overview.html
• MIT Multicore Programming Class: 6.189
• Prof. Saman Amarasinghe
• Photos/Graphics have been created by the author, obtained from non-copyrighted,
government or public domain (such as http://commons.wikimedia.org/) sources, or used
with the permission of authors from other presentations and web pages.
168
<Insert Picture Here>
Twitter
http://twitter.com/raul_goycoolea
Raul Goycoolea Seoane
Keep in Touch
Facebook
http://www.facebook.com/raul.goycoolea
Linkedin
http://www.linkedin.com/in/raulgoy
Blog
http://blogs.oracle.com/raulgoy/
Raul Goycoolea S.
Multiprocessor Programming 16916 February 2012
Questions?
Multiprocessor architecture and programming

Contenu connexe

Tendances

Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
mukul bhardwaj
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
nextlib
 
Multicore processing
Multicore processingMulticore processing
Multicore processing
guestc0be34a
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
Piyush Mittal
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
Sher Shah Merkhel
 
29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology
Sindhu Nathan
 

Tendances (20)

Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Multicore processors and its advantages
Multicore processors and its advantagesMulticore processors and its advantages
Multicore processors and its advantages
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Multicore processing
Multicore processingMulticore processing
Multicore processing
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Multi core processor
Multi core processorMulti core processor
Multi core processor
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology29092013042656 multicore-processor-technology
29092013042656 multicore-processor-technology
 
Multicore Processsors
Multicore ProcesssorsMulticore Processsors
Multicore Processsors
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Difference between Single core, Dual core and Quad core Processors
Difference between Single core, Dual core and Quad core ProcessorsDifference between Single core, Dual core and Quad core Processors
Difference between Single core, Dual core and Quad core Processors
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
 
27 multicore
27 multicore27 multicore
27 multicore
 

Similaire à Multiprocessor architecture and programming

Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
corehard_by
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
chiportal
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
Achronix
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
jsvetter
 

Similaire à Multiprocessor architecture and programming (20)

OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Computing Without Computers - Oct08
Computing Without Computers - Oct08Computing Without Computers - Oct08
Computing Without Computers - Oct08
 
Introduction+to+java+2
Introduction+to+java+2Introduction+to+java+2
Introduction+to+java+2
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
M1. Introducing Computers Part A .pdf
M1. Introducing Computers Part A .pdfM1. Introducing Computers Part A .pdf
M1. Introducing Computers Part A .pdf
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
CSEG1001 Lecture 1 Introduction to Computers
CSEG1001 Lecture 1 Introduction to ComputersCSEG1001 Lecture 1 Introduction to Computers
CSEG1001 Lecture 1 Introduction to Computers
 

Plus de Raul Goycoolea Seoane

Plus de Raul Goycoolea Seoane (7)

Xertica work transformation with Google
Xertica work transformation with GoogleXertica work transformation with Google
Xertica work transformation with Google
 
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias CloudTransformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
 
Transformacion Digital con Cloud Computing
Transformacion Digital con Cloud ComputingTransformacion Digital con Cloud Computing
Transformacion Digital con Cloud Computing
 
Cloud Digital Transformation
Cloud Digital TransformationCloud Digital Transformation
Cloud Digital Transformation
 
Big Data Roundtable. Why, how, where, which, and when to start doing Big Data
Big Data Roundtable. Why, how, where, which, and when to start doing Big DataBig Data Roundtable. Why, how, where, which, and when to start doing Big Data
Big Data Roundtable. Why, how, where, which, and when to start doing Big Data
 
Best Practices for Development Apps for Big Data
Best Practices for Development Apps for Big DataBest Practices for Development Apps for Big Data
Best Practices for Development Apps for Big Data
 
Deliver World Class Customer Experience with Big Data and Analytics
Deliver World Class Customer Experience with Big Data and AnalyticsDeliver World Class Customer Experience with Big Data and Analytics
Deliver World Class Customer Experience with Big Data and Analytics
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Multiprocessor architecture and programming

  • 1. Parallel Computing Architecture & Programming Techniques Raul Goycoolea S. Solution Architect Manager Oracle Enterprise Architecture Group
  • 2. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 216 February 2012
  • 4. The “Software Crisis” “To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." -- E. Dijkstra, 1972 Turing Award Lecture Raul Goycoolea S. Multiprocessor Programming 416 February 2012
  • 5. The First Software Crisis • Time Frame: ’60s and ’70s • Problem: Assembly Language Programming Computers could handle larger more complex programs • Needed to get Abstraction and Portability without losing Performance Raul Goycoolea S. Multiprocessor Programming 516 February 2012
  • 6. Common Properties Single flow of control Single memory image Differences: Register File ISA Functional Units How Did We Solve The First Software Crisis? • High-level languages for von-Neumann machines FORTRAN and C • Provided “common machine language” for uniprocessors Raul Goycoolea S. Multiprocessor Programming 616 February 2012
  • 7. The Second Software Crisis • Time Frame: ’80s and ’90s • Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers Computers could handle larger more complex programs • Needed to get Composability, Malleability and Maintainability High-performance was not an issue left for Moore’s Law Raul Goycoolea S. Multiprocessor Programming 716 February 2012
  • 8. How Did We Solve the Second Software Crisis? • Object Oriented Programming C++, C# and Java • Also… Better tools • Component libraries, Purify Better software engineering methodology • Design patterns, specification, testing, code reviews Raul Goycoolea S. Multiprocessor Programming 816 February 2012
  • 9. Today: Programmers are Oblivious to Processors • Solid boundary between Hardware and Software • Programmers don’t have to know anything about the processor High level languages abstract away the processors Ex: Java bytecode is machine independent Moore’s law does not require the programmers to know anything about the processors to get good speedups • Programs are oblivious of the processor works on all processors A program written in ’70 using C still works and is much faster today • This abstraction provides a lot of freedom for the programmers Raul Goycoolea S. Multiprocessor Programming 916 February 2012
  • 10. The Origins of a Third Crisis • Time Frame: 2005 to 20?? • Problem: Sequential performance is left behind by Moore’s law • Needed continuous and reasonable performance improvements to support new features to support larger datasets • While sustaining portability, malleability and maintainability without unduly increasing complexity faced by the programmer critical to keep-up with the current rate of evolution in software Raul Goycoolea S. Multiprocessor Programming 1016 February 2012
  • 11. Performance(vs.VAX-11/780) NumberofTransistors 52%/year 100 1000 10000 100000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 %/year 10 8086 1 286 25%/year 386 486 Pentium P2 P3 P4 Itanium 2 Itanium 1,000,000,000 100,000 10,000 1,000,000 10,000,000 100,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 The Road to Multicore: Moore’s Law Raul Goycoolea S. Multiprocessor Programming 1116 February 2012
  • 12. Specint2000 10000.00 1000.00 100.00 10.00 1.00 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Spar c Super Spar c Spar c64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 The Road to Multicore: Uniprocessor Performance (SPECint) Raul Goycoolea S. Multiprocessor Programming 1216 February 2012 Intel 386 Intel 486
  • 13. The Road to Multicore: Uniprocessor Performance (SPECint) General-purpose unicores have stopped historic performance scaling Power consumption Wire delays DRAM access latency Diminishing returns of more instruction-level parallelism Raul Goycoolea S. Multiprocessor Programming 1316 February 2012
  • 14. Power 1000 100 10 1 85 87 89 91 93 95 97 99 01 03 05 07 Intel 386 Intel 486 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha21064 Alpha21164 Alpha21264 Sparc SuperSparc Sparc64 Mips HPPA Power PC AMDK6 AMDK7 AMDx86-64 Power Consumption (watts) Raul Goycoolea S. Multiprocessor Programming 1416 February 2012
  • 15. Watts/Spec 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1982 1984 1987 1990 1993 1995 1998 2001 2004 2006 Year intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 0 Power Efficiency (watts/spec) Raul Goycoolea S. Multiprocessor Programming 1516 February 2012
  • 16. Process(microns) 0.06 0.04 0.02 0 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 1996 1998 2000 2002 2008 2010 2012 20142004 2006 Year 700 MHz 1.25 GHz 2.1 GHz 6 GHz 10 GHz 13.5 GHz • 400 mm2 Die • From the SIA Roadmap Range of a Wire in One Clock Cycle Raul Goycoolea S. Multiprocessor Programming 1616 February 2012
  • 17. Performance 19 84 19 94 19 92 19 82 19 88 19 86 19 80 19 96 19 98 20 00 20 02 19 90 20 04 1000000 10000 100 1 Year µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) DRAM Access Latency • Access times are a speed of light issue • Memory technology is also changing SRAM are getting harder to scale DRAM is no longer cheapest cost/bit • Power efficiency is an issue here as well Raul Goycoolea S. Multiprocessor Programming 1716 February 2012
  • 18. PowerDensity(W/cm2) 10,000 1,000 „70 „80 „90 „00 „10 10 4004 8008 8080 1 8086 8085 286 386 486 Pentium® Hot Plate Nuclear Reactor 100 Sun‟s Surface Rocket Nozzle Intel Developer Forum, Spring 2004 - Pat Gelsinger (Pentium at 90 W) Cube relationship between the cycle time and power CPUs Architecture Heat becoming an unmanageable problem Raul Goycoolea S. Multiprocessor Programming 1816 February 2012
  • 19. Diminishing Returns • The ’80s: Superscalar expansion 50% per year improvement in performance Transistors applied to implicit parallelism - pipeline processor (10 CPI --> 1 CPI) • The ’90s: The Era of Diminishing Returns Squeaking out the last implicit parallelism 2-way to 6-way issue, out-of-order issue, branch prediction 1 CPI --> 0.5 CPI Performance below expectations projects delayed & canceled • The ’00s: The Beginning of the Multicore Era The need for Explicit Parallelism Raul Goycoolea S. Multiprocessor Programming 1916 February 2012
  • 20. Mit Raw 16 Cores 2002 Intel Tanglewood Dual Core IA/64 Intel Dempsey Dual Core Xeon Intel Montecito 1.7 Billion transistors Dual Core IA/64 Intel Pentium D (Smithfield) Cancelled Intel Tejas & Jayhawk Unicore (4GHz P4) IBM Power 6 Dual Core IBM Power 4 and 5 Dual Cores Since 2001 Intel Pentium Extreme 3.2GHz Dual Core Intel Yonah Dual Core Mobile AMD Opteron Dual Core Sun Olympus and Niagara 8 Processor Cores IBM Cell Scalable Multicore … 1H 2005 1H 2006 2H 20062H 20052H 2004 Unicores are on extinction Now all is multicore
  • 21. # of 1985 199019801970 1975 1995 2000 2005 Raw Cavium Octeon Raza XLR CSR-1 Intel Tflops Picochip PC102 Cisco Niagara Boardcom 1480 Xbox360 2010 2 1 8 4 32 cores 16 128 64 512 256 Cell Opteron 4P Xeon MP Ambric AM2045 4004 8008 80868080 286 386 486 Pentium PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 Multicores Future Raul Goycoolea S. Multiprocessor Programming 2116 February 2012
  • 22. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 2216 February 2012
  • 24. Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU) • A problem is broken into a discrete series of instructions • Instructions are executed one after another • Only one instruction may execute at any moment in time What is Parallel Computing? Raul Goycoolea S. Multiprocessor Programming 2416 February 2012
  • 25. What is Parallel Computing? In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs Raul Goycoolea S. Multiprocessor Programming 2516 February 2012
  • 26. Options in Parallel Computing? The compute resources might be: • A single computer with multiple processors; • An arbitrary number of computers connected by a network; • A combination of both. The computational problem should be able to: • Be broken apart into discrete pieces of work that can be solved simultaneously; • Execute multiple program instructions at any moment in time; • Be solved in less time with multiple compute resources than with a single compute resource. Raul Goycoolea S. Multiprocessor Programming 2616 February 2012
  • 27. 27
  • 28. The Real World is Massively Parallel • Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For example: • Galaxy formation • Planetary movement • Weather and ocean patterns • Tectonic plate drift Rush hour traffic • Automobile assembly line • Building a jet • Ordering a hamburger at the drive through. Raul Goycoolea S. Multiprocessor Programming 2816 February 2012
  • 29. Architecture Concepts Von Neumann Architecture • Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers • Since then, virtually all computers have followed this basic design, differing from earlier computers which were programmed through "hard wiring” • Comprised of four main components: • Memory • Control Unit • Arithmetic Logic Unit • Input/Output • Read/write, random access memory is used to store both program instructions and data • Program instructions are coded data which tell the computer to do something • Data is simply information to be used by the program • Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. • Aritmetic Unit performs basic arithmetic operations • Input/Output is the interface to the human operator Raul Goycoolea S. Multiprocessor Programming 2916 February 2012
  • 30. Flynn’s Taxonomy • There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. • Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. • The matrix below defines the 4 possible classifications according to Flynn: Raul Goycoolea S. Multiprocessor Programming 3016 February 2012
  • 31. Single Instruction, Single Data (SISD): • A serial (non-parallel) computer • Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle • Single Data: Only one data stream is being used as input during any one clock cycle • Deterministic execution • This is the oldest and even today, the most common type of computer • Examples: older generation mainframes, minicomputers and workstations; most modern day PCs. Raul Goycoolea S. Multiprocessor Programming 3116 February 2012
  • 32. Single Instruction, Single Data (SISD): Raul Goycoolea S. Multiprocessor Programming 3216 February 2012
  • 33. Single Instruction, Multiple Data (SIMD): • A type of parallel computer • Single Instruction: All processing units execute the same instruction at any given clock cycle • Multiple Data: Each processing unit can operate on a different data element • Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. • Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples: • Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV • Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 • Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units. Raul Goycoolea S. Multiprocessor Programming 3316 February 2012
  • 34. Single Instruction, Multiple Data (SIMD): ILLIAC IV MasPar TM CM-2 Cell GPU Cray X-MP Cray Y-MP Raul Goycoolea S. Multiprocessor Programming 3416 February 2012
  • 35. • A type of parallel computer • Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams. • Single Data: A single data stream is fed into multiple processing units. • Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). • Some conceivable uses might be: • multiple frequency filters operating on a single signal stream • multiple cryptography algorithms attempting to crack a single coded message. Multiple Instruction, Single Data (MISD): Raul Goycoolea S. Multiprocessor Programming 3516 February 2012
  • 36. Multiple Instruction, Single Data (MISD): Raul Goycoolea S. Multiprocessor Programming 3616 February 2012
  • 37. • A type of parallel computer • Multiple Instruction: Every processor may be executing a different instruction stream • Multiple Data: Every processor may be working with a different data stream • Execution can be synchronous or asynchronous, deterministic or non-deterministic • Currently, the most common type of parallel computer - most modern supercomputers fall into this category. • Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution sub-components Multiple Instruction, Multiple Data (MIMD): Raul Goycoolea S. Multiprocessor Programming 3716 February 2012
  • 38. Multiple Instruction, Multiple Data (MIMD): Raul Goycoolea S. Multiprocessor Programming 3816 February 2012
  • 39. Multiple Instruction, Multiple Data (MIMD): IBM Power HP Alphaserver Intel IA32/x64 Oracle SPARC Cray XT3 Oracle Exadata/Exalogic Raul Goycoolea S. Multiprocessor Programming 3916 February 2012
  • 40. Parallel Computer Memory Architecture Shared Memory Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA. Uniform Memory Access (UMA): • Most commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors Non-Uniform Memory Access (NUMA): • Often made by physically linking two or more SMPs • One SMP can directly access memory of another SMP 40 Raul Goycoolea S. Multiprocessor Programming 4016 February 2012
  • 41. Parallel Computer Memory Architecture Shared Memory 41 Shared Memory (UMA) Shared Memory (NUMA) Raul Goycoolea S. Multiprocessor Programming 4116 February 2012
  • 42. Basic structure of a centralized shared-memory multiprocessor Processor Processor Processor Processor One or more levels of Cache One or more levels of Cache One or more levels of Cache One or more levels of Cache Multiple processor-cache subsystems share the same physical memory, typically connected by a bus. In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniform access time o all memory from all processors remains. Raul Goycoolea S. Multiprocessor Programming 4216 February 2012
  • 43. Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Interconnection Network Basic Architecture of a Distributed Multiprocessor Consists of individual nodes containing a processor, some memory, typically some I/O, and an interface to an interconnection network that connects all the nodes. Individual nodes may contain a small number of processors, which may be interconnected by a small bus or a different interconnection technology, which is less scalable than the global interconnection network. Raul Goycoolea S. Multiprocessor Programming 4316 February 2012
  • 44. Communication how do parallel operations communicate data results? Synchronization how are parallel operations coordinated? Resource Management how are a large number of parallel tasks scheduled onto finite hardware? Scalability how large a machine can be built? Issues in Parallel Machine Design Raul Goycoolea S. Multiprocessor Programming 4416 February 2012
  • 45. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 4516 February 2012
  • 47. ExplicitImplicit Hardware Compiler Superscalar Processors Explicitly Parallel Architectures Implicit vs. Explicit Parallelism Raul Goycoolea S. Multiprocessor Programming 4716 February 2012
  • 48. Implicit Parallelism: Superscalar Processors Explicit Parallelism Shared Instruction Processors Shared Sequencer Processors Shared Network Processors Shared Memory Processors Multicore Processors Outline Raul Goycoolea S. Multiprocessor Programming 4816 February 2012
  • 49. Issue varying numbers of instructions per clock statically scheduled – – using compiler techniques in-order execution dynamically scheduled – – – – – Extracting ILP by examining 100‟s of instructions Scheduling them in parallel as operands become available Rename registers to eliminate anti dependences out-of-order execution Speculative execution Implicit Parallelism: Superscalar Processors Raul Goycoolea S. Multiprocessor Programming 4916 February 2012
  • 50. Instruction i IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 Instruction # 1 2 3 4 5 6 7 8 IF: Instruction fetch EX : Execution Cycles ID : Instruction decode WB : Write back Pipelining Execution Raul Goycoolea S. Multiprocessor Programming 5016 February 2012
  • 51. Instruction type 1 2 3 4 5 6 7 Cycles Integer Floating point IF IF ID ID EX EX WB WB Integer Floating point Integer Floating point Integer Floating point IF IF ID ID EX EX WB WB IF IF ID ID EX EX WB WB IF IF ID ID EX EX WB WB 2-issue super-scalar machine Super-Scalar Execution Raul Goycoolea S. Multiprocessor Programming 5116 February 2012
  • 52. Intrinsic data dependent (aka true dependence) on Instructions: I: add r1,r2,r3 J: sub r4,r1,r3 If two instructions are data dependent, they cannot execute simultaneously, be completely overlapped or execute in out-of- order If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard Data Dependence and Hazards Raul Goycoolea S. Multiprocessor Programming 5216 February 2012
  • 53. HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by original source program Dependences are a property of programs Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited Goal: exploit parallelism by preserving program order only where it affects the outcome of the program ILP and Data Dependencies, Hazards Raul Goycoolea S. Multiprocessor Programming 5316 February 2012
  • 54. Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrIreads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1” If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard Name Dependence #1: Anti-dependece Raul Goycoolea S. Multiprocessor Programming 5416 February 2012
  • 55. Instruction writes operand before InstrIwrites it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “output dependence” by compiler writers. This also results from the reuse of name “r1” If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for registers Renaming can be done either by compiler or by HW Name Dependence #1: Output Dependence Raul Goycoolea S. Multiprocessor Programming 5516 February 2012
  • 56. Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. Control dependence need not be preserved willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program Speculative Execution Control Dependencies Raul Goycoolea S. Multiprocessor Programming 5616 February 2012
  • 57. Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct Speculation ⇒ fetch, issue, and execute instructions as if branch predictions were always correct Dynamic scheduling ⇒ only fetches and issues instructions Essentially a data flow execution model: Operations execute as soon as their operands are available Speculation Raul Goycoolea S. Multiprocessor Programming 5716 February 2012
  • 58. Different predictors Branch Prediction Value Prediction Prefetching (memory access pattern prediction) Inefficient Predictions can go wrong Has to flush out wrongly predicted data While not impacting performance, it consumes power Speculation in Rampant in Modern Superscalars Raul Goycoolea S. Multiprocessor Programming 5816 February 2012
  • 59. Implicit Parallelism: Superscalar Processors Explicit Parallelism Shared Instruction Processors Shared Sequencer Processors Shared Network Processors Shared Memory Processors Multicore Processors Outline Raul Goycoolea S. Multiprocessor Programming 5916 February 2012
  • 60. Parallelism is exposed to software Compiler or Programmer Many different forms Loosely coupled Multiprocessors to tightly coupled VLIW Explicit Parallel Processors Raul Goycoolea S. Multiprocessor Programming 6016 February 2012
  • 61. Throughput per Cycle One Operation Latency in Cycles Parallelism = Throughput * Latency To maintain throughput T/cycle when each operation has latency L cycles, need T*L independent operations For fixed parallelism: decreased latency allows increased throughput decreased throughput allows increased latency tolerance Little’s Law Raul Goycoolea S. Multiprocessor Programming 6116 February 2012
  • 62. Time Time Time Time Data-Level Parallelism (DLP) Instruction-Level Parallelism (ILP) Pipelining Thread-Level Parallelism (TLP) Types of Software Parallelism Raul Goycoolea S. Multiprocessor Programming 6216 February 2012
  • 64. What is a sequential program? A single thread of control that executes one instruction and when it is finished execute the next logical instruction What is a concurrent program? A collection of autonomous sequential threads, executing (logically) in parallel The implementation (i.e. execution) of a collection of threads can be: Multiprogramming – Threads multiplex their executions on a single processor. Multiprocessing – Threads multiplex their executions on a multiprocessor or a multicore system Distributed Processing – Processes multiplex their executions on several different machines What is concurrency? Raul Goycoolea S. Multiprocessor Programming 6416 February 2012
  • 65. Concurrency is not (only) parallelism Interleaved Concurrency Logically simultaneous processing Interleaved execution on a single processor Parallelism Physically simultaneous processing Requires a multiprocessors or a multicore system Time Time A B C A B C Concurrency and Parallelism Raul Goycoolea S. Multiprocessor Programming 6516 February 2012
  • 66. There are a lot of ways to use Concurrency in Programming Semaphores Blocking & non-blocking queues Concurrent hash maps Copy-on-write arrays Exchangers Barriers Futures Thread pool support Other Types of Synchronization Raul Goycoolea S. Multiprocessor Programming 6616 February 2012
  • 67. Deadlock Two or more threads stop and wait for each other Livelock Two or more threads continue to execute, but make no progress toward the ultimate goal Starvation Some thread gets deferred forever Lack of fairness Each thread gets a turn to make progress Race Condition Some possible interleaving of threads results in an undesired computation result Potential Concurrency Problems Raul Goycoolea S. Multiprocessor Programming 6716 February 2012
  • 68. Concurrency and Parallelism are important concepts in Computer Science Concurrency can simplify programming However it can be very hard to understand and debug concurrent programs Parallelism is critical for high performance From Supercomputers in national labs to Multicores and GPUs on your desktop Concurrency is the basis for writing parallel programs Next Lecture: How to write a Parallel Program Parallelism Conclusions Raul Goycoolea S. Multiprocessor Programming 6816 February 2012
  • 69. Shared memory – – – – Ex: Intel Core 2 Duo/Quad One copy of data shared among many cores Atomicity, locking and synchronization essential for correctness Many scalability issues Distributed memory – – – – Ex: Cell Cores primarily access local memory Explicit data exchange between cores Data distribution and communication orchestration is essential for performance P1 P2 P3 Pn Memory Interconnection Network Interconnection Network P1 P2 P3 Pn M1 M2 M3 Mn Two primary patterns of multicore architecture design Architecture Recap Raul Goycoolea S. Multiprocessor Programming 6916 February 2012
  • 70. Processor 1…n ask for X There is only one place to look Communication through shared variables Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize synchronization P1 P2 P3 Pn Memory x Interconnection Network Programming Shared Memory Processors Raul Goycoolea S. Multiprocessor Programming 7016 February 2012
  • 71. Data parallel Perform same computation but operate on different data A single process can fork multiple concurrent threads Each thread encapsulate its own execution path Each thread has local state and shared resources Threads communicate through shared resources such as global memory for (i = 0; i < 12; i++) C[i] = A[i] + B[i]; i=0 i=1 i=2 i=3 i=8 i=9 i = 10 i = 11 i=4 i=5 i=6 i=7 join (barrier) fork (threads) Example of Parallelization Raul Goycoolea S. Multiprocessor Programming 7116 February 2012
  • 72. int A[12] = {...}; int B[12] = {...}; int C[12]; void add_arrays(int start) { int i; for (i = start; i < start + 4; i++) C[i] = A[i] + B[i]; } int main (int argc, char *argv[]) { pthread_t threads_ids[3]; int rc, t; for(t = 0; t < 4; t++) { rc = pthread_create(&thread_ids[t], NULL /* attributes */, add_arrays /* function */, t * 4 /* args to function */); } pthread_exit(NULL); } join (barrier) i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i = 10 i = 11 fork (threads) Example Parallelization with Threads Raul Goycoolea S. Multiprocessor Programming 7216 February 2012
  • 73. Data parallelism Perform same computation but operate on different data Control parallelism Perform different functions fork (threads) join (barrier) pthread_create(/* thread id */, /* attributes */, /* /* any function args to function */, */); Types of Parallelism Raul Goycoolea S. Multiprocessor Programming 7316 February 2012
  • 74. Uniform Memory Access (UMA) Centrally located memory All processors are equidistant (access times) Non-Uniform Access (NUMA) Physically partitioned but accessible by all Processors have the same address space Placement of data affects performance Memory Access Latency in Shared Memory Architectures Raul Goycoolea S. Multiprocessor Programming 7416 February 2012
  • 75. Coverage or extent of parallelism in algorithm Granularity of data partitioning among processors Locality of computation and communication … so how do I parallelize my program? Summary of Parallel Performance Factors Raul Goycoolea S. Multiprocessor Programming 7516 February 2012
  • 76. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 7616 February 2012
  • 78. P0 Tasks Processes Processors P1 P2 P3 p0 p1 p2 p3 p0 p1 p2 p3 Partitioning Sequential computation Parallel program d e c o m p o s i t i o n a s s i g n m e n t o r c h e s t r a t i o n m a p p i n g Common Steps to Create a Parallel Program
  • 79. Identify concurrency and decide at what level to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically Number of tasks may vary with time Enough tasks to keep processors busy Number of tasks available at a time is upper bound on achievable speedup Decomposition (Amdahl’s Law)
  • 80. Specify mechanism to divide work among core Balance work and reduce communication Structured approaches usually work well Code inspection or understanding of application Well-known design patterns As programmers, we worry about partitioning first Independent of architecture or programming model But complexity often affect decisions! Granularity
  • 81. Computation and communication concurrency Preserve locality of data Schedule tasks to satisfy dependences early Orchestration and Mapping
  • 82. Provides a cookbook to systematically guide programmers Decompose, Assign, Orchestrate, Map Can lead to high quality solutions in some domains Provide common vocabulary to the programming community Each pattern has a name, providing a vocabulary for discussing solutions Helps with software reusability, malleability, and modularity Written in prescribed format to allow the reader to quickly understand the solution and its context Otherwise, too difficult for programmers, and software will not fully exploit parallel hardware Parallel Programming by Pattern
  • 83. Berkeley architecture professor Christopher Alexander In 1977, patterns for city planning, landscaping, and architecture in an attempt to capture principles for “living” design History
  • 85. Design Patterns: Elements of Reusable Object- Oriented Software (1995) Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides Catalogue of patterns Creation, structural, behavioral Patterns in Object-Oriented Programming
  • 86. Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture 4 Design Spaces Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). Patterns for Parallelizing Programs
  • 87. split frequency encoded macroblocks ZigZag IQuantization IDCT Saturation spatially encoded macroblocks differentially coded motion vectors Motion Vector Decode Repeat motion vectors MPEG bit stream VLD macroblocks, motion vectors MPEG Decoder join Motion Compensation recovered picture Picture Reorder Color Conversion Display Here’s my algorithm, Where’s the concurrency?
  • 88. Task decomposition Independent coarse-grained computation Inherent to algorithm Sequence of statements (instructions) that operate together as a group Corresponds to some logical part of program Usually follows from the way programmer thinks about a problem join motion vectorsspatially encoded macroblocks IDCT Saturation MPEG Decoder frequency encoded macroblocks ZigZag IQuantization MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Repeat Motion Compensation recovered picture Picture Reorder Color Conversion Display Here’s my algorithm, Where’s the concurrency?
  • 89. join motion vectors Saturation spatially encoded macroblocks MPEG Decoder frequency encoded macroblocks ZigZag IQuantization IDCT Motion Compensation recovered picture Picture Reorder Color Conversion Display MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Repeat Task decomposition Parallelism in the application Data decomposition Same computation is applied to small data chunks derived from large data set Here’s my algorithm, Where’s the concurrency?
  • 90. motion vectorsspatially encoded macroblocks MPEG Decoder frequency encoded macroblocks ZigZag IQuantization IDCT Saturation join Motion Compensation recovered picture Picture Reorder Color Conversion Display MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Repeat Task decomposition Parallelism in the application Data decomposition Same computation many data Pipeline decomposition Data assembly lines Producer-consumer chains Here’s my algorithm, Where’s the concurrency?
  • 91. Algorithms start with a good understanding of the problem being solved Programs often naturally decompose into tasks Two common decompositions are – – Function calls and Distinct loop iterations Easier to start with many tasks and later fuse them, rather than too few tasks and later try to split them Guidelines for Task Decomposition
  • 92. Flexibility Program design should afford flexibility in the number and size of tasks generated – – Tasks should not tied to a specific architecture Fixed tasks vs. Parameterized tasks Efficiency Tasks should have enough work to amortize the cost of creating and managing them Tasks should be sufficiently independent so that managing dependencies doesn‟t become the bottleneck Simplicity The code has to remain readable and easy to understand, and debug Guidelines for Task Decomposition
  • 93. Data decomposition is often implied by task decomposition Programmers need to address task and data decomposition to create a parallel program Which decomposition to start with? Data decomposition is a good starting point when Main computation is organized around manipulation of a large data structure Similar operations are applied to different parts of the data structure Guidelines for Data Decomposition Raul Goycoolea S. Multiprocessor Programming 9316 February 2012
  • 94. Array data structures Decomposition of arrays along rows, columns, blocks Recursive data structures Example: decomposition of trees into sub-trees problem compute subproblem compute subproblem compute subproblem compute subproblem merge subproblem merge subproblem merge solution subproblem split subproblem split split Common Data Decompositions Raul Goycoolea S. Multiprocessor Programming 9416 February 2012
  • 95. Flexibility Size and number of data chunks should support a wide range of executions Efficiency Data chunks should generate comparable amounts of work (for load balancing) Simplicity Complex data compositions can get difficult to manage and debug Raul Goycoolea S. Multiprocessor Programming 9516 February 2012 Guidelines for Data Decompositions
  • 96. Data is flowing through a sequence of stages Assembly line is a good analogy What’s a prime example of pipeline decomposition in computer architecture? Instruction pipeline in modern CPUs What’s an example pipeline you may use in your UNIX shell? Pipes in UNIX: cat foobar.c | grep bar | wc Other examples Signal processing Graphics ZigZag IQuantization IDCT Saturation Guidelines for Pipeline Decomposition Raul Goycoolea S. Multiprocessor Programming 9616 February 2012
  • 97. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 9716 February 2012
  • 99. Coverage or extent of parallelism in algorithm Amdahl‟s Law Granularity of partitioning among processors Communication cost and load balancing Locality of computation and communication Communication between processors or between processors and their memories Review: Keys to Parallel Performance
  • 100. n/m B t overlap)C f (o l frequency of messages overhead per message (at both ends) network delay per message number of messages amount of latency hidden by concurrency with computation total data sent cost induced by contention per message bandwidth along path (determined by network) Communication Cost Model
  • 101. synchronization point Get Data Compute Get Data CPU is idle Memory is idle Compute Overlapping Communication with Computation
  • 102. Computation to communication ratio limits performance gains from pipelining Get Data Compute Get Data Compute Where else to look for performance? Limits in Pipelining Communication
  • 103. Determined by program implementation and interactions with the architecture Examples: Poor distribution of data across distributed memories Unnecessarily fetching data that is not used Redundant data fetches Artifactual Communication
  • 104. In uniprocessors, CPU communicates with memory Loads and stores are to uniprocessors as _______ and ______ are to distributed memory multiprocessors How is communication overlap enhanced in uniprocessors? Spatial locality Temporal locality “get” “put” Lessons From Uniprocessors
  • 105. CPU asks for data at address 1000 Memory sends data at address 1000 … 1064 Amount of data sent depends on architecture parameters such as the cache block size Works well if CPU actually ends up using data from 1001, 1002, …, 1064 Otherwise wasted bandwidth and cache capacity Spatial Locality
  • 106. Main memory access is expensive Memory hierarchy adds small but fast memories (caches) near the CPU Memories get bigger as distance from CPU increases CPU asks for data at address 1000 Memory hierarchy anticipates more accesses to same address and stores a local copy Works well if CPU actually ends up using data from 1000 over and over and over … Otherwise wasted cache capacity main memory cache (level 2) cache (level 1) Temporal Locality
  • 107. Data is transferred in chunks to amortize communication cost Cell: DMA gets up to 16K Usually get a contiguous chunk of memory Spatial locality Computation should exhibit good spatial locality characteristics Temporal locality Reorder computation to maximize use of data fetched Reducing Artifactual Costs in Distributed Memory Architectures
  • 108. Tasks mapped to execution units (threads) Threads run on individual processors (cores) finish line: sequential time + longest parallel time Two keys to faster execution Load balance the work among the processors Make execution on each processor faster sequential parallel sequential parallel Single Thread Performance
  • 109. Need some way of measuring performance Coarse grained measurements % gcc sample.c % time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3 % time a.out 1.921u 0.093s 0:02.03 99.0% … but did we learn much about what’s going on? #define N (1 << 23) #define T (10) #include <string.h> double a[N],b[N]; void cleara(double a[N]) { int i; for (i = 0; i < N; i++) { a[i] = 0; } } int main() { double s=0,s2=0; int i,j; for (j = 0; j < T; j++) { for (i = 0; i < N; i++) { b[i] = 0; } cleara(a); memset(a,0,sizeof(a)); for (i = 0; i < N; i++) { s += a[i] * b[i]; s2 += a[i] * a[i] + b[i] * b[i]; } } printf("s %f s2 %fn",s,s2); } record stop time record start time Understanding Performance
  • 110. Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events Insert code to start, read, and stop counter Measure exactly what you want, anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect: counters can perturb execution time time stopclear/start Measurements Using Counters
  • 111. Event-based profiling Interrupt execution when an event counter reaches a threshold Time-based profiling Interrupt execution every t seconds Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for appropriate sampling frequencies Dynamic Profiling
  • 112. Cycles (clock ticks) Pipeline stalls Cache hits Cache misses Number of instructions Number of loads Number of stores Number of floating point operations … Counter Examples
  • 113. Processor utilization Cycles / Wall Clock Time Instructions per cycle Instructions / Cycles Instructions per memory operation Instructions / Loads + Stores Average number of instructions per load miss Instructions / L1 Load Misses Memory traffic Loads + Stores * Lk Cache Line Size Bandwidth consumed Loads + Stores * Lk Cache Line Size / Wall Clock Time Many others Cache miss rate Branch misprediction rate … Useful Derived Measurements
  • 115. GNU gprof Widely available with UNIX/Linux distributions gcc –O2 –pg foo.c –o foo ./foo gprof foo HPC Toolkit http://www.hipersoft.rice.edu/hpctoolkit/ PAPI http://icl.cs.utk.edu/papi/ VTune http://www.intel.com/cd/software/products/asmo-na/eng/vtune/ Many others Popular Runtime Profiling Tools
  • 116. Instruction level parallelism Multiple functional units, deeply pipelined, speculation, ... Data level parallelism SIMD (Single Inst, Multiple Data): short vector instructions (multimedia extensions)– – – Hardware is simpler, no heavily ported register files Instructions are more compact Reduces instruction fetch bandwidth Complex memory hierarchies Multiple level caches, may outstanding misses, prefetching, … Performance un Uniprocessors time = compute + wait
  • 117. Single Instruction, Multiple Data SIMD registers hold short vectors Instruction operates on all elements in SIMD register at once a b c Vector code for (int i = 0; i < n; i += 4) { c[i:i+3] = a[i:i+3] + b[i:i+3] } SIMD register Scalar code for (int i = 0; i < n; i+=1) { c[i] = a[i] + b[i] } a b c scalar register Single Instruction, Multiple Data
  • 118. For Example Cell SPU has 128 128-bit registers All instructions are SIMD instructions Registers are treated as short vectors of 8/16/32-bit integers or single/double-precision floats Instruction Set AltiVec MMX/SSE 3DNow! VIS MAX2 MVI MDMX Architecture PowerPC Intel AMD Sun HP Alpha MIPS V SIMD Width 128 64/128 64 64 64 64 64 Floating Point yes yes yes no no no yes SIMD in Major Instruction Set Architectures (ISAs)
  • 119. Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Compiler vs. Crypto Oracle T4 Using SIMD Instructions
  • 120. Tune the parallelism first Then tune performance on individual processors Modern processors are complex Need instruction level parallelism for performance Understanding performance requires a lot of probing Optimize for the memory hierarchy Memory is much slower than processors Multi-layer memory hierarchies try to hide the speed gap Data locality is essential for performance Programming for Performance
  • 121. May have to change everything! Algorithms, data structures, program structure Focus on the biggest performance impediments Too many issues to study everything Remember the law of diminishing returns Programming for Performance
  • 122. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 12216 February 2012
  • 124. Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Compilers Outline Raul Goycoolea S. Multiprocessor Programming 12416 February 2012
  • 125. Instruction Level Parallelism (ILP) Task Level Parallelism (TLP) Loop Level Parallelism (LLP) or Data Parallelism Pipeline Parallelism Divide and Conquer Parallelism Scheduling and Hardware Mainly by hand Hand or Compiler Generated Hardware or Streaming Recursive functions Types of Parallelism Raul Goycoolea S. Multiprocessor Programming 12516 February 2012
  • 126. 90% of the execution time in 10% of the code Mostly in loops If parallel, can get good performance Load balancing Relatively easy to analyze Why Loops? Raul Goycoolea S. Multiprocessor Programming 12616 February 2012
  • 127. FORALL No “loop carried dependences” Fully parallel FORACROSS Some “loop carried dependences” Programmer Defined Parallel Loop Raul Goycoolea S. Multiprocessor Programming 12716 February 2012
  • 128. Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Outline Raul Goycoolea S. Multiprocessor Programming 12816 February 2012
  • 129. Finding FORALL Loops out of FOR loops Examples FOR I = 0 to 5 A[I+1] = A[I] + 1 FOR I = 0 to 5 A[I] = A[I+6] + 1 For I = 0 to 5 A[2*I] = A[2*I + 1] + 1 Parallelizing Compilers Raul Goycoolea S. Multiprocessor Programming 12916 February 2012
  • 130. True dependence a = = a Anti dependence = a a = Output dependence a a = = Definition: Data dependence exists for a dynamic instance i and j iff either i or j is a write operation i and j refer to the same variable i executes before j How about array accesses within loops? Dependences Raul Goycoolea S. Multiprocessor Programming 13016 February 2012
  • 131. Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Outline Raul Goycoolea S. Multiprocessor Programming 13116 February 2012
  • 132. FOR I = 0 to 5 A[I] = A[I] + 1 0 1 2 Iteration Space 0 1 2 3 4 5 Data Space 3 4 5 6 7 8 9 10 11 12 A[I] A[I] A[I] A[I] A[I] = A[I] = A[I] = A[I] = A[I] = A[I] Array Access in a Loop Raul Goycoolea S. Multiprocessor Programming 13216 February 2012
  • 133. Find data dependences in loop For every pair of array acceses to the same array If the first access has at least one dynamic instance (an iteration) in which it refers to a location in the array that the second access also refers to in at least one of the later dynamic instances (iterations). Then there is a data dependence between the statements (Note that same array can refer to itself – output dependences) Definition Loop-carried dependence: dependence that crosses a loop boundary If there are no loop carried dependences are parallelizable Recognizing FORALL Loops Raul Goycoolea S. Multiprocessor Programming 13316 February 2012
  • 134. FOR I = 1 to n FOR J = 1 to n A[I, J] = A[I-1, J+1] + 1 FOR I = 1 to n FOR J = 1 to n A[I] = A[I-1] + 1 J J I I What is the Dependence? Raul Goycoolea S. Multiprocessor Programming 13416 February 2012
  • 135. Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Outline Raul Goycoolea S. Multiprocessor Programming 13516 February 2012
  • 136. Scalar Privatization Reduction Recognition Induction Variable Identification Array Privatization Interprocedural Parallelization Loop Transformations Granularity of Parallelism Increasing Parallelization Opportunities Raul Goycoolea S. Multiprocessor Programming 13616 February 2012
  • 137. Example FOR i = 1 to n X = A[i] * 3; B[i] = X; Is there a loop carried dependence? What is the type of dependence? Scalar Privatization Raul Goycoolea S. Multiprocessor Programming 13716 February 2012
  • 138. Reduction Analysis: Only associative operations The result is never used within the loop Transformation Integer Xtmp[NUMPROC]; Barrier(); FOR i = myPid*Iters to MIN((myPid+1)*Iters, n) Xtmp[myPid] = Xtmp[myPid] + A[i]; Barrier(); If(myPid == 0) { FOR p = 0 to NUMPROC-1 X = X + Xtmp[p]; … Reduction Recognition Raul Goycoolea S. Multiprocessor Programming 13816 February 2012
  • 139. Example FOR i = 0 to N A[i] = 2^i; After strength reduction t = 1 FOR i = 0 to N A[i] = t; t = t*2; What happened to loop carried dependences? Need to do opposite of this! Perform induction variable analysis Rewrite IVs as a function of the loop variable Induction Variables Raul Goycoolea S. Multiprocessor Programming 13916 February 2012
  • 140. Similar to scalar privatization However, analysis is more complex Array Data Dependence Analysis: Checks if two iterations access the same location Array Data Flow Analysis: Checks if two iterations access the same value Transformations Similar to scalar privatization Private copy for each processor or expand with an additional dimension Array Privatization Raul Goycoolea S. Multiprocessor Programming 14016 February 2012
  • 141. Function calls will make a loop unparallelizatble Reduction of available parallelism A lot of inner-loop parallelism Solutions Interprocedural Analysis Inlining Interprocedural Parallelization Raul Goycoolea S. Multiprocessor Programming 14116 February 2012
  • 142. Cache Coherent Shared Memory Machine Generate code for the parallel loop nest No Cache Coherent Shared Memory or Distributed Memory Machines Generate code for the parallel loop nest Identify communication Generate communication code Communication Code Generation Raul Goycoolea S. Multiprocessor Programming 14216 February 2012
  • 143. Eliminating redundant communication Communication aggregation Multi-cast identification Local memory management Communication Optimizations Raul Goycoolea S. Multiprocessor Programming 14316 February 2012
  • 144. Automatic parallelization of loops with arrays Requires Data Dependence Analysis Iteration space & data space abstraction An integer programming problem Many optimizations that’ll increase parallelism Transforming loop nests and communication code generation Fourier-Motzkin Elimination provides a nice framework Summary Raul Goycoolea S. Multiprocessor Programming 14416 February 2012
  • 145. <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 14516 February 2012
  • 147. "I think there is a world market for maybe five computers.“ – Thomas Watson, chairman of IBM, 1949 "There is no reason in the world anyone would want a computer in their home. No reason.” – Ken Olsen, Chairman, DEC, 1977 "640K of RAM ought to be enough for anybody.” – Bill Gates, 1981 Predicting the Future is Always Risky Raul Goycoolea S. Multiprocessor Programming 14716 February 2012
  • 148. Evolution Relatively easy to predict Extrapolate the trends Revolution A completely new technology or solution Hard to Predict Paradigm Shifts can occur in both Future = Evolution + Revolution Raul Goycoolea S. Multiprocessor Programming 14816 February 2012
  • 149. Evolution Trends Architecture Languages, Compilers and Tools Revolution Crossing the Abstraction Boundaries Outline Raul Goycoolea S. Multiprocessor Programming 14916 February 2012
  • 150. Look at the trends Moore‟s Law Power Consumption Wire Delay Hardware Complexity Parallelizing Compilers Program Design Methodologies Design Drivers are different in Different Generations Evolution Raul Goycoolea S. Multiprocessor Programming 15016 February 2012
  • 151. Performance(vs.VAX-11/780) NumberofTransistors 52%/year 100 1000 10000 100000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 %/year 10 8086 1 286 25%/year 386 486 Pentium P2 P3 P4 Itanium 2 Itanium 1,000,000,000 100,000 10,000 1,000,000 10,000,000 100,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 The Road to Multicore: Moore’s Law Raul Goycoolea S. Multiprocessor Programming 15116 February 2012
  • 152. Specint2000 10000.00 1000.00 100.00 10.00 1.00 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Spar c Super Spar c Spar c64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 The Road to Multicore: Uniprocessor Performance (SPECint) Raul Goycoolea S. Multiprocessor Programming 15216 February 2012 Intel 386 Intel 486
  • 153. The Road to Multicore: Uniprocessor Performance (SPECint) General-purpose unicores have stopped historic performance scaling Power consumption Wire delays DRAM access latency Diminishing returns of more instruction-level parallelism Raul Goycoolea S. Multiprocessor Programming 15316 February 2012
  • 154. Power 1000 100 10 1 85 87 89 91 93 95 97 99 01 03 05 07 Intel 386 Intel 486 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha21064 Alpha21164 Alpha21264 Sparc SuperSparc Sparc64 Mips HPPA Power PC AMDK6 AMDK7 AMDx86-64 Power Consumption (watts) Raul Goycoolea S. Multiprocessor Programming 15416 February 2012
  • 155. Watts/Spec 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1982 1984 1987 1990 1993 1995 1998 2001 2004 2006 Year intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 0 Power Efficiency (watts/spec) Raul Goycoolea S. Multiprocessor Programming 15516 February 2012
  • 156. Process(microns) 0.06 0.04 0.02 0 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 1996 1998 2000 2002 2008 2010 2012 20142004 2006 Year 700 MHz 1.25 GHz 2.1 GHz 6 GHz 10 GHz 13.5 GHz • 400 mm2 Die • From the SIA Roadmap Range of a Wire in One Clock Cycle Raul Goycoolea S. Multiprocessor Programming 15616 February 2012
  • 157. Performance 19 84 19 94 19 92 19 82 19 88 19 86 19 80 19 96 19 98 20 00 20 02 19 90 20 04 1000000 10000 100 1 Year µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) DRAM Access Latency • Access times are a speed of light issue • Memory technology is also changing SRAM are getting harder to scale DRAM is no longer cheapest cost/bit • Power efficiency is an issue here as well Raul Goycoolea S. Multiprocessor Programming 15716 February 2012
  • 158. PowerDensity(W/cm2) 10,000 1,000 „70 „80 „90 „00 „10 10 4004 8008 8080 1 8086 8085 286 386 486 Pentium® Hot Plate Nuclear Reactor 100 Sun‟s Surface Rocket Nozzle Intel Developer Forum, Spring 2004 - Pat Gelsinger (Pentium at 90 W) Cube relationship between the cycle time and power CPUs Architecture Heat becoming an unmanageable problem Raul Goycoolea S. Multiprocessor Programming 15816 February 2012
  • 159. 1970 1980 1990 2000 2010 Improvement in Automatic Parallelization Automatic Parallelizing Compilers for FORTRAN Vectorization technology Prevalence of type unsafe languages and complex data structures (C, C++) Typesafe languages (Java, C#) Demand driven by Multicores? Compiling for Instruction Level Parallelism Raul Goycoolea S. Multiprocessor Programming 15916 February 2012
  • 160. # of 1985 199019801970 1975 1995 2000 2005 Raw Cavium Octeon Raza XLR CSR-1 Intel Tflops Picochip PC102 Cisco Niagara Boardcom 1480 Xbox360 2010 2 1 8 4 32 cores 16 128 64 512 256 Cell Opteron 4P Xeon MP Ambric AM2045 4004 8008 80868080 286 386 486 Pentium PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 Multicores Future Raul Goycoolea S. Multiprocessor Programming 16016 February 2012
  • 161. Evolution Trends Architecture Languages, Compilers and Tools Revolution Crossing the Abstraction Boundaries Outline Raul Goycoolea S. Multiprocessor Programming 16116 February 2012
  • 162. Don‟t have to contend with uniprocessors The era of Moore‟s Law induced performance gains is over! Parallel programming will be required by the masses – not just a few supercomputer super-users Novel Opportunities in Multicores Raul Goycoolea S. Multiprocessor Programming 16216 February 2012
  • 163. Don‟t have to contend with uniprocessors The era of Moore‟s Law induced performance gains is over! Parallel programming will be required by the masses – not just a few supercomputer super-users Not your same old multiprocessor problem How does going from Multiprocessors to Multicores impact programs? What changed? Where is the Impact? – – Communication Bandwidth Communication Latency Novel Opportunities in Multicores Raul Goycoolea S. Multiprocessor Programming 16316 February 2012
  • 164. How much data can be communicated between two cores? What changed? Number of Wires – – IO is the true bottleneck On-chip wire density is very high Clock rate – IO is slower than on-chip Multiplexing No sharing of pins– Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck processor affinity not that important 32 Giga bits/sec ~300 Tera bits/sec 10,000X Communication Bandwidth Raul Goycoolea S. Multiprocessor Programming 16416 February 2012
  • 165. How long does it take for a round trip communication? What changed? Length of wire – Very short wires are faster Pipeline stages – – – No multiplexing On-chip is much closer Bypass and Speculation? Impact on programming model? Ultra-fast synchronization Can run real-time apps on multiple cores 50X ~200 Cycles ~4 cycles Communication Latency Raul Goycoolea S. Multiprocessor Programming 16516 February 2012
  • 166. MemoryMemory PE $$ PE $$ Memory PEPE $$ Memory $$ PE $$ X PE $$ X PE $$ X PE $$ X Memory Memory Basic Multicore IBM Power Traditional Multiprocessor Integrated Multicore 8 Core 8 Thread Oracle T4 Past, Present and the Future? Raul Goycoolea S. Multiprocessor Programming 16616 February 2012
  • 167. Summary • As technology evolves, the inherent flexibility of Multi processor to adapts to new requirements • Processors can be used at anytime for a lots of kinds of applications • Optimization adapts processors to High Performance requirements Raul Goycoolea S. Multiprocessor Programming 16716 February 2012
  • 168. References • Author: Raul Goycoolea, Oracle Corporation. • A search on the WWW for "parallel programming" or "parallel computing" will yield a wide variety of information. • Recommended reading: • "Designing and Building Parallel Programs". Ian Foster. 
http://www- unix.mcs.anl.gov/dbpp/ • "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar. 
http://www-users.cs.umn.edu/~karypis/parbook/ • "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra. 
www.phys.uu.nl/~steen/web03/overview.html • MIT Multicore Programming Class: 6.189 • Prof. Saman Amarasinghe • Photos/Graphics have been created by the author, obtained from non-copyrighted, government or public domain (such as http://commons.wikimedia.org/) sources, or used with the permission of authors from other presentations and web pages. 168
  • 169. <Insert Picture Here> Twitter http://twitter.com/raul_goycoolea Raul Goycoolea Seoane Keep in Touch Facebook http://www.facebook.com/raul.goycoolea Linkedin http://www.linkedin.com/in/raulgoy Blog http://blogs.oracle.com/raulgoy/ Raul Goycoolea S. Multiprocessor Programming 16916 February 2012

Notes de l'éditeur

  1. participant interaction with processes through modeling and collaboration using Process Composer.
  2. Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  3. Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  4. Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  5. Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  6. Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  7. Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.