SlideShare a Scribd company logo
1 of 98
Introduction to Parallel
Programming
Center for Institutional Research
Computing
Slides for the book "An introduction to Parallel
Programming", by Peter Pacheco (available from the
publisher
website): http://booksite.elsevier.com/9780123
742605/
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
1
Serial hardware and software
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
input
output
programs
Computer runs one
program at a time.
2
Why we need to write parallel
programs
ā€¢ Running multiple instances of a serial
program often isnā€™t very useful.
ā€¢ Think of running multiple instances of your
favorite game.
ā€¢ What you really want is for
it to run faster.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
3
How do we write parallel programs?
ā€¢ Task parallelism
ā€“ Partition various tasks carried out solving the
problem among the cores.
ā€¢ Data parallelism
ā€“ Partition the data used in solving the problem
among the cores.
ā€“ Each core carries out similar operations on
itā€™s part of the data.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
4
Professor P
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
15 questions
300 exams
5
Professor Pā€™s grading assistants
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
TA#1
TA#2 TA#3
6
Division of work ā€“
data parallelism
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
TA#1
TA#2
TA#3
100 exams
100 exams
100 exams
7
Division of work ā€“
task parallelism
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
TA#1
TA#2
TA#3
Questions 1 - 5
Questions 6 - 10
Questions 11 - 15
Questions 1 - 7
Questions 8 - 11
Questions 12 - 15
Partitioning strategy:
- either by number
- Or by workload
or
or
or
8
Coordination
ā€¢ Cores usually need to coordinate their work.
ā€¢ Communication ā€“ one or more cores send
their current partial sums to another core.
ā€¢ Load balancing ā€“ share the work evenly
among the cores so that one is not heavily
loaded.
ā€¢ Synchronization ā€“ because each core works
at its own pace, make sure cores do not get
too far ahead of the rest.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
9
What weā€™ll be doing
ā€¢ Learning to write programs that are
explicitly parallel.
ā€¢ Using the C language.
ā€¢ Using the OpenMP extension to C
(multi-threading for shared memory)
ā€“ Others you can investigate after this
workshop:
ā€¢ Message-Passing Interface (MPI)
ā€¢ Posix Threads (Pthreads)
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
10
Essential concepts
ā€¢ Memory
ā€¢ Process execution terminology
ā€¢ Configuration of Kamiak
ā€¢ Coding concepts for parallelism - Parallel
program design
ā€¢ Performance
ā€¢ OpenMP
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
11
.c programs we will use
ā€¢ loop.c
ā€¢ sync.c
ā€¢ sumcomp.c
ā€¢ matrix_vector.c (for independent study)
ā€¢ Step 1: log into kamiak
ā€¢ Step 2: run command: training
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
12
Memory
ā€¢ Two major classes of parallel programming
models:
ā€“ Shared Memory
ā€“ Distributed Memory
13
Shared Memory Architecture
CPU core #1
Memory
(Shared address
space)
disk
Memory bus
Compute node
I/O bus
Cache ($)
CPU core
Cache ($)
ā€¦
CPU core
Cache ($)
Threads
Threads Threads
All threads can see a single
shared address space.
Therefore, they can see each
otherā€™s data.
Each compute node has 1-2
CPUs each with 10-14 cores
14
Multi-Threading (for shared
memory architectures)
ā€¢ Threads are contained within processes
ā€“ One process => multiple threads
ā€¢ All threads of a process share the same
address space (in memory).
ā€¢ Threads have the capability to run
concurrently (executing different
instructions and accessing different
pieces of data at the same time)
ā€¢ But if the resource is occupied by another
thread, they form a queue and wait.
ā€“ For maximum throughput, it is ideal to
map each thread to a unique/distinct core
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
15
CPU cores
Cache ($)
Threads
Memory
(Shared
address
space)
A process and two threads
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
the ā€œmasterā€ thread
starting a thread
Is called forking
terminating a thread
Is called joining
16
Distributed Memory Architecture
Local Memory
Compute node 1
Local Memory
Compute node 2
Local Memory
Compute node m
ā€¦ā€¦ā€¦
Network Interconnect
Processes running
on cores
Processes running
on cores
Processes running
on cores
Processes cannot see each otherā€™s memory address space.
They have to send inter-process messages (using MPI).
17
Distributed Memory System
ā€¢ Clusters (most popular)
ā€“ A collection of commodity systems.
ā€“ Connected by a commodity interconnection
network.
ā€¢ Nodes of a cluster are individual
computers joined by a communication
network.
a.k.a. hybrid systems
Kamiak provides an Infiniband
interconnect between all compute nodes
18
Single Program Models:
SIMD vs. MIMD
19
ā€¢ SP: Single Program
ā€“ Your parallel program is a single program that you execute
on all threads (or processes)
ā€¢ SI: Single Instruction
ā€“ Each thread should be executing the same line of code at
any given clock cycle.
ā€¢ MI: Multiple Instruction
ā€“ Each thread (or process) could be independently running a
different line of your code (instruction) concurrently
ā€¢ MD: Multiple Data
ā€“ Each thread (or process) could be operating/accessing a
different piece of the data from the memory concurrently
Single Program Models:
SIMD vs. MIMD
20
// Begin: parallel region of the code
ā€¦
..
..
..
..
..
..
..
..
..
..
// End: parallel region of the code
// Begin: parallel region of the code
ā€¦
..
..
..
..
..
..
..
..
..
..
// End: parallel region of the code
All threads executing the
same line of code.
They may be accessing
different pieces of data.
Thread 1
Thread 2
Thread 3
SIMD MIMD
Fosterā€™s methodology
1. Partitioning: divide the computation to be
performed and the data operated on by
the computation into small tasks.
The focus here should be on identifying
tasks that can be executed in parallel.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
21
Fosterā€™s methodology
2. Communication: determine what
communication needs to be carried out
among the tasks identified in the previous
step.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
22
Fosterā€™s methodology
3. Agglomeration or aggregation: combine
tasks and communications identified in
the first step into larger tasks.
For example, if task A must be executed
before task B can be executed, it may
make sense to aggregate them into a
single composite task.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
23
Fosterā€™s methodology
4. Mapping: assign the composite tasks
identified in the previous step to
processes/threads.
This should be done so that
communication is minimized, and each
process/thread gets roughly the same
amount of work.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
24
Example from sum.c
ā€¢ Open sum.c
ā€¢ What can be parallelized here?
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
25
OPENMP FOR SHARED MEMORY
MULTITHREADED PROGRAMMING
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
26
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Roadmap
ā€¢ Writing programs that use OpenMP.
ā€¢ Using OpenMP to parallelize many serial for
loops with only small changes to the source
code.
ā€¢ Task parallelism.
ā€¢ Explicit thread synchronization.
ā€¢ Standard problems in shared-memory
programming.
27
Pragmas
ā€¢ Special preprocessor instructions.
ā€¢ Typically added to a system to allow
behaviors that arenā€™t part of the basic C
specification.
ā€¢ Compilers that donā€™t support the pragmas
ignore them.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
#pragma
28
OpenMp pragmas
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
ā€¢ # pragma omp parallel
ā€¢ # include omp.h
ā€“ Most basic parallel directive.
ā€“ The number of threads that run
the following structured block of code
is determined by the run-time system.
29
clause
ā€¢ Text that modifies a directive.
ā€¢ The num_threads clause can be added to
a parallel directive.
ā€¢ It allows the programmer to specify the
number of threads that should execute the
following block.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
# pragma omp parallel num_threads ( thread_count )
30
Some terminology
ā€¢ In OpenMP parlance the collection of
threads executing the parallel block ā€” the
original thread and the new threads ā€” is
called a team, the original thread is called
the master, and the additional threads are
called worker.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
31
In case the compiler doesnā€™t
support OpenMP
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
# include <omp.h>
#ifdef _OPENMP
# include <omp.h>
#endif
32
In case the compiler doesnā€™t
support OpenMP
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
# ifdef _OPENMP
int my_rank = omp_get_thread_num ( );
int thread_count = omp_get_num_threads ( );
# e l s e
int my_rank = 0;
int thread_count = 1;
# endif
33
#include <stdio.h>
int main()
{
printf("Hello worldn");
return 0;
}
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
34
Serial version of ā€œhello worldā€
How do we invoke omp in this case?
Compile it: gcc hello.c
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
After invoking omp
35
Compile it: gcc ā€“fopenmp hello.c
Parallel Code Template (OpenMP)
#include <omp.h>
main(ā€¦) {
ā€¦ // let p be the user-specified #threads
omp_set_num_threads(p);
#pragma omp parallel for
{
ā€¦. // openmp parallel region where p threads are
active and running concurrently
}
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
36
Scope
ā€¢ In serial programming, the scope of a
variable consists of those parts of a
program in which the variable can be
used.
ā€¢ In OpenMP, the scope of a variable refers
to the set of threads that can access the
variable in a parallel block.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
37
Scope in OpenMP
ā€¢ A variable that can be accessed by all the
threads in the team has shared scope.
ā€¢ A variable that can only be accessed by a
single thread has private scope.
ā€¢ The default scope for variables
declared before a parallel block
is shared.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
38
Loop.c
ā€¢ Lets go over our first code ā€“ loop-serial.c
ā€¢ Now edit to employ omp, save as loop-
parallel.c
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
39
Performance
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
40
Taking Timings
ā€¢ What is time?
ā€¢ Start to finish?
ā€¢ A program segment of interest?
ā€¢ CPU time?
ā€¢ Wall clock time?
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
41
Taking Timings
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
theoretical
function
MPI_Wtime omp_get_wtime
42
ā€¢ Number of threads = p
ā€¢ Serial run-time = Tserial
ā€¢ Parallel run-time = Tparallel
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Tparallel = Tserial
/ p
Speedup
Tserial
Tparallel
S =
43
Scalability
ā€¢ In general, a problem is scalable if it can handle
ever increasing problem sizes.
ā€¢ If we increase the number of processes/threads
and keep the efficiency fixed without increasing
problem size, the problem is strongly scalable.
ā€¢ If we keep the efficiency fixed by increasing the
problem size at the same rate as we increase
the number of processes/threads, the problem is
weakly scalable.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
44
Studying Scalability
Input
size (n)
Number of threads (p)
1 2 4 8 16
1,000
2,000
4,000
8,000
16,000
Table records the parallel runtime (in seconds) for varying values of n and p.
45
It is conventional to test scalability in powers of two (or by doubling n and p).
Studying Scalability
Input
size (n)
Number of threads (p)
1 2 4 8 16
1,000 800 410 201 150 100
2,000 1,601 802 409 210 120
4,000 3,100 1,504 789 399 208
8,000 6,010 3,005 1,500 758 376
16,000 12,000 6,000 3,001 1,509 758
It is conventional to test scalability in powers of two (or by doubling n and p).
Table records the parallel runtime (in seconds) for varying values of n and p.
Strong scaling
behavior
Weak scaling
behavior
46
Studying Scalability
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16
n=1,000
n=2,000
n=4,000
n=8,000
n=16,000
p
Time
(sec)
0
2
4
6
8
10
12
14
16
18
1 2 4 8 16
n=1,000
n=2,000
n=4,000
n=8,000
n=16,000
Ideal
Speedup
p
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
47
Timings with loop-parallel.c
(your code) or loop.c (our code)
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
48
You try!
< 10 minutes
x1 x1
+2
x3 x4 x5 x6 x7
+1 +2 +3 +4 +5 +6
x2
x1+2
+3
x1+2+
3+4
x1+2+3+
4+5
x1+2+3+4
+5+6
x1+2+3+4+5
+6+7
x1+2+3+4+5+
6+7+8
x8
Serial Process (1 thread, 7 operations)
Serial vs. Parallel Reduction
49
+7
x1+2+3+4+5+
6+7+8
x5+6+7+8
x1+2+3+4
x1+2
x1+2
x1+2
x1 x2 x3 x4 x5 x6 x7 x8
t1 t2 t3 t4 t5 t6 t7 t8
x1+2
+1
+2
+3
Parallel Process (8 threads, 3 operations)
50
log2 p
stages
Reduction operators
ā€¢ A reduction operator is a binary operation
(such as addition or multiplication).
ā€¢ A reduction is a computation that
repeatedly applies the same reduction
operator to a sequence of operands in
order to get a single result.
ā€¢ All of the intermediate results of the
operation should be stored in the same
variable: the reduction variable.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
51
Computing a sum
ā€¢ Open sumcompute-serial.c
ā€¢ Lets go over it
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
52
< 5 minutes
Mutual exclusion
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
# pragma omp critical
global_result += my_result ;
only one thread can execute
the following structured block at a time
53
Example
ā€¢ Open sync-unsafe.c
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
54
Synchronization
ā€¢ Synchronization imposes order constraints
and is used to protect access to shared data
ā€¢ Types of synchronization:
ā€“ critical
ā€“ atomic
ā€“ locks
ā€“ others (barrier, ordered, flush)
ā€¢ We will work on an exercise involving critical,
atomic, and locks
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
55
Critical
#pragma omp parallel for schedule(static) shared(a)
for(i = 0; i < n; i++)
{
#pragma omp critical
{
a = a+1;
}
}
Threads wait here: only one thread at a
time does the operation: ā€œa = a+1ā€. So
this is a piece of sequential code inside
the for loop.
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
56
Atomic
ā€¢ Atomic provides mutual exclusion but only applies to the load/update of a
memory location
ā€¢ It is applied only to the (single) assignment statement that immediately
follows it
ā€¢ Atomic construct may only be used together with an expression statement
with one of operations: +, *, -, /, &, ^, |, <<, >>
ā€¢ Atomic construct does not prevent multiple threads from executing the
function() at the same time (see the example below)
Code example:
int ic, i, n;
ic = 0;
#pragma omp parallel shared(n,ic) private(i)
for (i=0; i++; i<n)
{
#pragma omp atomic
ic = ic + function(c);
}
Atomic only protects the
update of ic
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
57
Locks
ā€¢ A lock consists of a data structure and functions
that allow the programmer to explicitly enforce
mutual exclusion in a critical section.
58
Acquire lock(i)
// critical section
Release lock(i)
Threads queue up
Only one
thread executes
Difference from critical
section:
- You can have multiple
locks
- A thread can try for any
specific lock
- => we can use this to
acquire data-level
locks
e.g., two threads can access different array indices without waiting.
Illustration of Locking Operation
ā€¢ The protected region contains the
update of a shared variable
ā€¢ One thread acquires the lock and
performs the update
ā€¢ Meanwhile, other threads perform
some other work
ā€¢ When the lock is released again,
the other threads perform the
update
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
59
A Locks Code Example
long long int a=0;
long long int i;
omp_lock_t my_lock;
// init lock
omp_init_lock(&my_lock);
#pragma omp parallel for
for(i = 0; i < n; i++)
{
omp_set_lock(&my_lock);
a+=1;
omp_unset_lock(&my_lock);
}
omp_destroy_lock(&my_lock);
1. Define lock variable
2. Initialize lock
3. Set lock
4. Unset lock
5. Destroy lock
Compiling and running sync.c:
gcc āˆ’g āˆ’Wall āˆ’fopenmp āˆ’o sync sync.c
./sync #of-iteration #of-threads
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
60
Some Caveats
1. You shouldnā€™t mix the different types of
mutual exclusion for a single critical
section.
2. There is no guarantee of fairness in
mutual exclusion constructs.
3. It can be dangerous to ā€œnestā€ mutual
exclusion constructs.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
61
ā€¢ Default schedule:
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Loop.c example
#pragma omp parallel for schedule(static)
private(a)//creates N threads to run the
next enclosed block
for(i = 0; i < loops; i++)
{
a = 6+7*8;
}
62
The Runtime Schedule Type
ā€¢ The system uses the environment variable
OMP_SCHEDULE to determine at run-
time how to schedule the loop.
ā€¢ The OMP_SCHEDULE environment
variable can take on any of the values that
can be used for a static, dynamic, or
guided schedule.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
63
schedule ( type , chunksize )
ā€“ Static: Assigned before the loop is executed.
ā€“ dynamic or guided: Assigned while the loop is
executing.
ā€“ auto/ runtime: Determined by the compiler and/or the
run-time system
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Controls how loop iterations are assigned
ā€¢ Consecutive iterations are broken into chunks
ā€¢ Total number = chunksize
ā€¢ Positive integer
ā€¢ Default is 1
64
schedule types can prevent load
imbalance
Threads
Time
Static schedule Dynamic schedule
vs
0
1
2
3
4
5
6
7
0
1
2
3 4
5
6
7
Time
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
65
Thread finishing
first
Thread finishing
last
Idle time
Static: default
Static, n: set chunksize
Static
Static, n
Dynamic
Guided
iteration number
0
0
0
0
N-1
N-1
N-1
N-1
Thread 0 Thread 1 Thread 2 Thread 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
T
2
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
T
0
Thre
ad 1
Thre
ad 1
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Th
r0
Th
r1
Th
r2
Th
r3
T
0
T
1
T
2
T
3
T
0
T
1
2
chunksize
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
66
Dynamic: thread executes a chunk
when done, it requests another one
Static
Static, n
Dynamic
Guided
iteration number
0
0
0
0
N-1
N-1
N-1
N-1
Thread 0 Thread 1 Thread 2 Thread 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
T
2
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
T
0
Thre
ad 1
Thre
ad 1
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Th
r0
Th
r1
Th
r2
Th
r3
T
0
T
1
T
2
T
3
T
0
T
1
2
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
67
Guided: thread executes a chunk
when done, it requests another one
new chunks decrease in size (until
chunksize is met)
Static
Static, n
Dynamic
Guided
iteration number
0
0
0
0
N-1
N-1
N-1
N-1
Thread 0 Thread 1 Thread 2 Thread 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
T
2
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
T
0
Thre
ad 1
Thre
ad 1
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Th
r0
Th
r1
Th
r2
Th
r3
T
0
T
1
T
2
T
3
T
0
T
1
2
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
68
multiplication.c
ā€¢ Go over this code at a conceptual level ā€“ show
where and how to parallelize
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
69
Matrix-vector multiplication
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
70
M x V = X: Parallelization Strategies
71
=
x
Row Decomposition
Thread 0
Thread 1
Thread 2
Thread 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
m x n n x 1 m x 1
ļƒ˜ Each thread gets m/p rows
ļƒ˜ Time taken is proportional to: (mn)/p : per thread
ļƒ˜ No need for any synchronization (static scheduling will do)
M x V = X: Parallelization Strategies
72
=
x
Column Decomposition
Threa
d 0
Thread
1
Thread
2
Thread
3
m x n n x 1 m x 1
ļƒ˜ Each thread gets n/p columns
ļƒ˜ Time taken is proportional to: (mn)/p + time for reduction : per thread
x0 + x1 + x2 + x3
X [i]
Reduction
Extra slides
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
73
What happened?
1. OpenMP compilers donā€™t
check for dependences
among iterations in a loop
thatā€™s being parallelized
with a parallel for directive.
2. A loop in which the results
of one or more iterations
depend on other iterations
cannot, in general, be
correctly parallelized by
OpenMP.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
74
Estimating Ļ€
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
75
OpenMP solution #1
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
loop dependency
76
OpenMP solution #2
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Insures factor has
private scope.
77
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Thread-Safety
78
An operating system ā€œprocessā€
ā€¢ An instance of a computer program that is
being executed.
ā€¢ Components of a process:
ā€“ The executable machine language program.
ā€“ A block of memory.
ā€“ Descriptors of resources the OS has allocated
to the process.
ā€“ Security information.
ā€“ Information about the state of the process.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
79
Shared Memory System
ā€¢ Each processor can access each memory
location.
ā€¢ The processors usually communicate
implicitly by accessing shared data
structures.
ā€¢ Example: Multiple CPU cores on a single
chip
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Kamiak compute nodes have multiple
CPUs each with multiple cores
80
UMA multicore system
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Figure 2.5
Time to access all
the memory locations
will be the same for
all the cores.
81
NUMA multicore system
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Figure 2.6
A memory location a core is
directly connected to can be
accessed faster than a memory
location that must be accessed
through another chip.
82
Input and Output
ā€¢ However, because of the indeterminacy of
the order of output to stdout, in most cases
only a single process/thread will be used
for all output to stdout other than
debugging output.
ā€¢ Debug output should always include the
rank or id of the process/thread thatā€™s
generating the output.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
83
Input and Output
ā€¢ Only a single process/thread will attempt
to access any single file other than stdin,
stdout, or stderr. So, for example, each
process/thread can open its own, private
file for reading or writing, but no two
processes/threads will open the same file.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
84
Division of work ā€“
data parallelism
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
85
Division of work ā€“
task parallelism
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Tasks
1) Receiving
2) Addition
86
Distributed Memory on Kamiak
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Figure 2.4
compute node 1 compute node N
compute node 3
compute node 2
Each compute node has 1-2
CPUs each with 10-14 cores
87
The burden is on software
ā€¢ Hardware and compilers can keep up the
pace needed.
ā€¢ From now onā€¦
ā€“ In shared memory programs:
ā€¢ Start a single process and fork threads.
ā€¢ Threads carry out tasks.
ā€“ In distributed memory programs:
ā€¢ Start multiple processes.
ā€¢ Processes carry out tasks.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
89
Writing Parallel Programs
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
double x[n], y[n];
ā€¦
for (i = 0; i < n; i++)
x[i] += y[i];
1. Divide the work among the
processes/threads
(a) so each process/thread
gets roughly the same
amount of work
(b) and communication is
minimized.
2. Arrange for the processes/threads to synchronize.
3. Arrange for communication among processes/threads.
I think we will
eventually
introduce this when
doing sum. I would
remove this slide
from here.
90
Ananth can you put in a simple
example here?
Remove Slide
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
91
OpenMP (ask ananth to tailor more
for the loop.c code)
ā€¢ An API for shared-memory parallel
programming.
ā€¢ MP = multiprocessing
ā€¢ Designed for systems in which each
thread or process can potentially have
access to all available memory.
ā€¢ System is viewed as a collection of cores
or CPUā€™s, all of which have access to
main memory.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Remove Slide
92
A process forking and joining two
threads
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
93
Of noteā€¦
ā€¢ There may be system-defined limitations on the
number of threads that a program can start.
ā€¢ The OpenMP standard doesnā€™t guarantee that
this will actually start thread_count threads.
ā€¢ Most current systems can start hundreds or
even thousands of threads.
ā€¢ Unless weā€™re trying to start a lot of threads, we
will almost always get the desired number of
threads.
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
94
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Have people take serial version and make into the openMP version
of loop.c
-have a small example code that has the pragma and openMP calls
in it
-then have them adapt the loop.c to have the openmp in itā€¦give
them 10 minutes to do while we go around the room and
helpā€¦.then walk them through our parallel version of loop.c
95
SCHEDULING LOOPS IN SYNC.C
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
96
Locks
ā€¢ A lock implies a memory fence of all thread visible variables
ā€¢ The lock routines are used to guarantee that only one thread
accesses a variable at a time to avoid race conditions
ā€¢ C/C++ lock variables must have type ā€œomp_lock_tā€ or
ā€œomp_nest_lock_tā€ (will not discuss nested lock in this
workshop)
ā€¢ All lock functions require an argument that has a pointer to
omp_lock_t or omp_nest_lock_t
ā€¢ Simple Lock functions:
ā€“ omp_init_lock(omp_lock_t*);
ā€“ omp_set_lock(omp_lock_t*);
ā€“ omp_unset_lock(omp_lock_t*);
ā€“ omp_test_lock(omp_lock_t*);
ā€“ omp_destroy_lock(omp_lock_t*);
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
97
How to Use Locks
1) Define the lock variables
2) Initialize the lock via a call to omp_init_lock
3) Set the lock using omp_set_lock or omp_test_lock. The
latter checks whether the lock is actually available
before attempting to set it. It is useful to achieve
asynchronous thread execution.
4) Unset a lock after the work is done via a call to
omp_unset_lock.
5) Remove the lock association via a call to
omp_destroy_lock.
Copyright Ā© 2010, Elsevier Inc. All rights
Reserved
98
Matrix-vector multiplication
Copyright Ā© 2010, Elsevier
Inc. All rights Reserved
Run-times and efficiencies
of matrix-vector multiplication
(times are in seconds)
99

More Related Content

Similar to 6-9-2017-slides-vFinal.pptx

Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdfSwatantraPrakash5
Ā 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPPier Luca Lanzi
Ā 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
Ā 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# WayBishnu Rawal
Ā 
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESPARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESsuthi
Ā 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Marcirio Chaves
Ā 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptxGopalPatidar13
Ā 
slides8 SharedMemory.ppt
slides8 SharedMemory.pptslides8 SharedMemory.ppt
slides8 SharedMemory.pptaminnezarat
Ā 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
Ā 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETssusere19c741
Ā 
CS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: IntroductionCS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: IntroductionEelco Visser
Ā 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentationVIKAS SINGH BHADOURIA
Ā 

Similar to 6-9-2017-slides-vFinal.pptx (20)

Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdf
Ā 
Parallel computation
Parallel computationParallel computation
Parallel computation
Ā 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Ā 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Ā 
Multithreading by rj
Multithreading by rjMultithreading by rj
Multithreading by rj
Ā 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
Ā 
Lecture6
Lecture6Lecture6
Lecture6
Ā 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Ā 
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESPARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
Ā 
Lecture1
Lecture1Lecture1
Lecture1
Ā 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
Ā 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
Ā 
slides8 SharedMemory.ppt
slides8 SharedMemory.pptslides8 SharedMemory.ppt
slides8 SharedMemory.ppt
Ā 
Java multi thread programming on cmp system
Java multi thread programming on cmp systemJava multi thread programming on cmp system
Java multi thread programming on cmp system
Ā 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Ā 
Os lectures
Os lecturesOs lectures
Os lectures
Ā 
Threads
ThreadsThreads
Threads
Ā 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NET
Ā 
CS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: IntroductionCS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: Introduction
Ā 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
Ā 

Recently uploaded

DragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptxDragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptxmirandajeremy200221
Ā 
WAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsWAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsCharles Obaleagbon
Ā 
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...Call Girls in Nagpur High Profile
Ā 
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...ranjana rawat
Ā 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptxVanshNarang19
Ā 
Verified Trusted Call Girls AdugodišŸ’˜ 9352852248 Good Looking standard Profil...
Verified Trusted Call Girls AdugodišŸ’˜ 9352852248  Good Looking standard Profil...Verified Trusted Call Girls AdugodišŸ’˜ 9352852248  Good Looking standard Profil...
Verified Trusted Call Girls AdugodišŸ’˜ 9352852248 Good Looking standard Profil...kumaririma588
Ā 
Stark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxStark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxjeswinjees
Ā 
SD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptxSD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptxjanettecruzeiro1
Ā 
Editorial design Magazine design project.pdf
Editorial design Magazine design project.pdfEditorial design Magazine design project.pdf
Editorial design Magazine design project.pdftbatkhuu1
Ā 
The_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdf
The_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdfThe_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdf
The_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdfAmirYakdi
Ā 
Design Inspiration for College by Slidesgo.pptx
Design Inspiration for College by Slidesgo.pptxDesign Inspiration for College by Slidesgo.pptx
Design Inspiration for College by Slidesgo.pptxTusharBahuguna2
Ā 
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...SUHANI PANDEY
Ā 
Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)
Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)
Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)amitlee9823
Ā 
VVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts Service
VVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts ServiceVVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts Service
VVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts Servicearoranaina404
Ā 
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779Delhi Call girls
Ā 
Chapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdfChapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdfParomita Roy
Ā 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Delhi Call girls
Ā 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...Pooja Nehwal
Ā 
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130Suhani Kapoor
Ā 

Recently uploaded (20)

DragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptxDragonBall PowerPoint Template for demo.pptx
DragonBall PowerPoint Template for demo.pptx
Ā 
WAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsWAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past Questions
Ā 
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
VVIP Pune Call Girls Hadapsar (7001035870) Pune Escorts Nearby with Complete ...
Ā 
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
(AISHA) Ambegaon Khurd Call Girls Just Call 7001035870 [ Cash on Delivery ] P...
Ā 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptx
Ā 
Verified Trusted Call Girls AdugodišŸ’˜ 9352852248 Good Looking standard Profil...
Verified Trusted Call Girls AdugodišŸ’˜ 9352852248  Good Looking standard Profil...Verified Trusted Call Girls AdugodišŸ’˜ 9352852248  Good Looking standard Profil...
Verified Trusted Call Girls AdugodišŸ’˜ 9352852248 Good Looking standard Profil...
Ā 
Stark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxStark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptx
Ā 
SD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptxSD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptx
Ā 
Editorial design Magazine design project.pdf
Editorial design Magazine design project.pdfEditorial design Magazine design project.pdf
Editorial design Magazine design project.pdf
Ā 
The_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdf
The_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdfThe_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdf
The_Canvas_of_Creative_Mastery_Newsletter_April_2024_Version.pdf
Ā 
Design Inspiration for College by Slidesgo.pptx
Design Inspiration for College by Slidesgo.pptxDesign Inspiration for College by Slidesgo.pptx
Design Inspiration for College by Slidesgo.pptx
Ā 
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
Ā 
Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)
Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)
Escorts Service Basapura ā˜Ž 7737669865ā˜Ž Book Your One night Stand (Bangalore)
Ā 
VVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts Service
VVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts ServiceVVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts Service
VVIP CALL GIRLS Lucknow šŸ’“ Lucknow < Renuka Sharma > 7877925207 Escorts Service
Ā 
young call girls in Vivek ViharšŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Vivek ViharšŸ” 9953056974 šŸ” Delhi escort Serviceyoung call girls in Vivek ViharšŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Vivek ViharšŸ” 9953056974 šŸ” Delhi escort Service
Ā 
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Ā 
Chapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdfChapter 19_DDA_TOD Policy_First Draft 2012.pdf
Chapter 19_DDA_TOD Policy_First Draft 2012.pdf
Ā 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Ā 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Ā 
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
Ā 

6-9-2017-slides-vFinal.pptx

  • 1. Introduction to Parallel Programming Center for Institutional Research Computing Slides for the book "An introduction to Parallel Programming", by Peter Pacheco (available from the publisher website): http://booksite.elsevier.com/9780123 742605/ Copyright Ā© 2010, Elsevier Inc. All rights Reserved 1
  • 2. Serial hardware and software Copyright Ā© 2010, Elsevier Inc. All rights Reserved input output programs Computer runs one program at a time. 2
  • 3. Why we need to write parallel programs ā€¢ Running multiple instances of a serial program often isnā€™t very useful. ā€¢ Think of running multiple instances of your favorite game. ā€¢ What you really want is for it to run faster. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 3
  • 4. How do we write parallel programs? ā€¢ Task parallelism ā€“ Partition various tasks carried out solving the problem among the cores. ā€¢ Data parallelism ā€“ Partition the data used in solving the problem among the cores. ā€“ Each core carries out similar operations on itā€™s part of the data. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 4
  • 5. Professor P Copyright Ā© 2010, Elsevier Inc. All rights Reserved 15 questions 300 exams 5
  • 6. Professor Pā€™s grading assistants Copyright Ā© 2010, Elsevier Inc. All rights Reserved TA#1 TA#2 TA#3 6
  • 7. Division of work ā€“ data parallelism Copyright Ā© 2010, Elsevier Inc. All rights Reserved TA#1 TA#2 TA#3 100 exams 100 exams 100 exams 7
  • 8. Division of work ā€“ task parallelism Copyright Ā© 2010, Elsevier Inc. All rights Reserved TA#1 TA#2 TA#3 Questions 1 - 5 Questions 6 - 10 Questions 11 - 15 Questions 1 - 7 Questions 8 - 11 Questions 12 - 15 Partitioning strategy: - either by number - Or by workload or or or 8
  • 9. Coordination ā€¢ Cores usually need to coordinate their work. ā€¢ Communication ā€“ one or more cores send their current partial sums to another core. ā€¢ Load balancing ā€“ share the work evenly among the cores so that one is not heavily loaded. ā€¢ Synchronization ā€“ because each core works at its own pace, make sure cores do not get too far ahead of the rest. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 9
  • 10. What weā€™ll be doing ā€¢ Learning to write programs that are explicitly parallel. ā€¢ Using the C language. ā€¢ Using the OpenMP extension to C (multi-threading for shared memory) ā€“ Others you can investigate after this workshop: ā€¢ Message-Passing Interface (MPI) ā€¢ Posix Threads (Pthreads) Copyright Ā© 2010, Elsevier Inc. All rights Reserved 10
  • 11. Essential concepts ā€¢ Memory ā€¢ Process execution terminology ā€¢ Configuration of Kamiak ā€¢ Coding concepts for parallelism - Parallel program design ā€¢ Performance ā€¢ OpenMP Copyright Ā© 2010, Elsevier Inc. All rights Reserved 11
  • 12. .c programs we will use ā€¢ loop.c ā€¢ sync.c ā€¢ sumcomp.c ā€¢ matrix_vector.c (for independent study) ā€¢ Step 1: log into kamiak ā€¢ Step 2: run command: training Copyright Ā© 2010, Elsevier Inc. All rights Reserved 12
  • 13. Memory ā€¢ Two major classes of parallel programming models: ā€“ Shared Memory ā€“ Distributed Memory 13
  • 14. Shared Memory Architecture CPU core #1 Memory (Shared address space) disk Memory bus Compute node I/O bus Cache ($) CPU core Cache ($) ā€¦ CPU core Cache ($) Threads Threads Threads All threads can see a single shared address space. Therefore, they can see each otherā€™s data. Each compute node has 1-2 CPUs each with 10-14 cores 14
  • 15. Multi-Threading (for shared memory architectures) ā€¢ Threads are contained within processes ā€“ One process => multiple threads ā€¢ All threads of a process share the same address space (in memory). ā€¢ Threads have the capability to run concurrently (executing different instructions and accessing different pieces of data at the same time) ā€¢ But if the resource is occupied by another thread, they form a queue and wait. ā€“ For maximum throughput, it is ideal to map each thread to a unique/distinct core Copyright Ā© 2010, Elsevier Inc. All rights Reserved 15 CPU cores Cache ($) Threads Memory (Shared address space)
  • 16. A process and two threads Copyright Ā© 2010, Elsevier Inc. All rights Reserved the ā€œmasterā€ thread starting a thread Is called forking terminating a thread Is called joining 16
  • 17. Distributed Memory Architecture Local Memory Compute node 1 Local Memory Compute node 2 Local Memory Compute node m ā€¦ā€¦ā€¦ Network Interconnect Processes running on cores Processes running on cores Processes running on cores Processes cannot see each otherā€™s memory address space. They have to send inter-process messages (using MPI). 17
  • 18. Distributed Memory System ā€¢ Clusters (most popular) ā€“ A collection of commodity systems. ā€“ Connected by a commodity interconnection network. ā€¢ Nodes of a cluster are individual computers joined by a communication network. a.k.a. hybrid systems Kamiak provides an Infiniband interconnect between all compute nodes 18
  • 19. Single Program Models: SIMD vs. MIMD 19 ā€¢ SP: Single Program ā€“ Your parallel program is a single program that you execute on all threads (or processes) ā€¢ SI: Single Instruction ā€“ Each thread should be executing the same line of code at any given clock cycle. ā€¢ MI: Multiple Instruction ā€“ Each thread (or process) could be independently running a different line of your code (instruction) concurrently ā€¢ MD: Multiple Data ā€“ Each thread (or process) could be operating/accessing a different piece of the data from the memory concurrently
  • 20. Single Program Models: SIMD vs. MIMD 20 // Begin: parallel region of the code ā€¦ .. .. .. .. .. .. .. .. .. .. // End: parallel region of the code // Begin: parallel region of the code ā€¦ .. .. .. .. .. .. .. .. .. .. // End: parallel region of the code All threads executing the same line of code. They may be accessing different pieces of data. Thread 1 Thread 2 Thread 3 SIMD MIMD
  • 21. Fosterā€™s methodology 1. Partitioning: divide the computation to be performed and the data operated on by the computation into small tasks. The focus here should be on identifying tasks that can be executed in parallel. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 21
  • 22. Fosterā€™s methodology 2. Communication: determine what communication needs to be carried out among the tasks identified in the previous step. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 22
  • 23. Fosterā€™s methodology 3. Agglomeration or aggregation: combine tasks and communications identified in the first step into larger tasks. For example, if task A must be executed before task B can be executed, it may make sense to aggregate them into a single composite task. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 23
  • 24. Fosterā€™s methodology 4. Mapping: assign the composite tasks identified in the previous step to processes/threads. This should be done so that communication is minimized, and each process/thread gets roughly the same amount of work. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 24
  • 25. Example from sum.c ā€¢ Open sum.c ā€¢ What can be parallelized here? Copyright Ā© 2010, Elsevier Inc. All rights Reserved 25
  • 26. OPENMP FOR SHARED MEMORY MULTITHREADED PROGRAMMING Copyright Ā© 2010, Elsevier Inc. All rights Reserved 26
  • 27. Copyright Ā© 2010, Elsevier Inc. All rights Reserved Roadmap ā€¢ Writing programs that use OpenMP. ā€¢ Using OpenMP to parallelize many serial for loops with only small changes to the source code. ā€¢ Task parallelism. ā€¢ Explicit thread synchronization. ā€¢ Standard problems in shared-memory programming. 27
  • 28. Pragmas ā€¢ Special preprocessor instructions. ā€¢ Typically added to a system to allow behaviors that arenā€™t part of the basic C specification. ā€¢ Compilers that donā€™t support the pragmas ignore them. Copyright Ā© 2010, Elsevier Inc. All rights Reserved #pragma 28
  • 29. OpenMp pragmas Copyright Ā© 2010, Elsevier Inc. All rights Reserved ā€¢ # pragma omp parallel ā€¢ # include omp.h ā€“ Most basic parallel directive. ā€“ The number of threads that run the following structured block of code is determined by the run-time system. 29
  • 30. clause ā€¢ Text that modifies a directive. ā€¢ The num_threads clause can be added to a parallel directive. ā€¢ It allows the programmer to specify the number of threads that should execute the following block. Copyright Ā© 2010, Elsevier Inc. All rights Reserved # pragma omp parallel num_threads ( thread_count ) 30
  • 31. Some terminology ā€¢ In OpenMP parlance the collection of threads executing the parallel block ā€” the original thread and the new threads ā€” is called a team, the original thread is called the master, and the additional threads are called worker. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 31
  • 32. In case the compiler doesnā€™t support OpenMP Copyright Ā© 2010, Elsevier Inc. All rights Reserved # include <omp.h> #ifdef _OPENMP # include <omp.h> #endif 32
  • 33. In case the compiler doesnā€™t support OpenMP Copyright Ā© 2010, Elsevier Inc. All rights Reserved # ifdef _OPENMP int my_rank = omp_get_thread_num ( ); int thread_count = omp_get_num_threads ( ); # e l s e int my_rank = 0; int thread_count = 1; # endif 33
  • 34. #include <stdio.h> int main() { printf("Hello worldn"); return 0; } Copyright Ā© 2010, Elsevier Inc. All rights Reserved 34 Serial version of ā€œhello worldā€ How do we invoke omp in this case? Compile it: gcc hello.c
  • 35. Copyright Ā© 2010, Elsevier Inc. All rights Reserved After invoking omp 35 Compile it: gcc ā€“fopenmp hello.c
  • 36. Parallel Code Template (OpenMP) #include <omp.h> main(ā€¦) { ā€¦ // let p be the user-specified #threads omp_set_num_threads(p); #pragma omp parallel for { ā€¦. // openmp parallel region where p threads are active and running concurrently } Copyright Ā© 2010, Elsevier Inc. All rights Reserved 36
  • 37. Scope ā€¢ In serial programming, the scope of a variable consists of those parts of a program in which the variable can be used. ā€¢ In OpenMP, the scope of a variable refers to the set of threads that can access the variable in a parallel block. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 37
  • 38. Scope in OpenMP ā€¢ A variable that can be accessed by all the threads in the team has shared scope. ā€¢ A variable that can only be accessed by a single thread has private scope. ā€¢ The default scope for variables declared before a parallel block is shared. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 38
  • 39. Loop.c ā€¢ Lets go over our first code ā€“ loop-serial.c ā€¢ Now edit to employ omp, save as loop- parallel.c Copyright Ā© 2010, Elsevier Inc. All rights Reserved 39
  • 40. Performance Copyright Ā© 2010, Elsevier Inc. All rights Reserved 40
  • 41. Taking Timings ā€¢ What is time? ā€¢ Start to finish? ā€¢ A program segment of interest? ā€¢ CPU time? ā€¢ Wall clock time? Copyright Ā© 2010, Elsevier Inc. All rights Reserved 41
  • 42. Taking Timings Copyright Ā© 2010, Elsevier Inc. All rights Reserved theoretical function MPI_Wtime omp_get_wtime 42
  • 43. ā€¢ Number of threads = p ā€¢ Serial run-time = Tserial ā€¢ Parallel run-time = Tparallel Copyright Ā© 2010, Elsevier Inc. All rights Reserved Tparallel = Tserial / p Speedup Tserial Tparallel S = 43
  • 44. Scalability ā€¢ In general, a problem is scalable if it can handle ever increasing problem sizes. ā€¢ If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size, the problem is strongly scalable. ā€¢ If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes/threads, the problem is weakly scalable. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 44
  • 45. Studying Scalability Input size (n) Number of threads (p) 1 2 4 8 16 1,000 2,000 4,000 8,000 16,000 Table records the parallel runtime (in seconds) for varying values of n and p. 45 It is conventional to test scalability in powers of two (or by doubling n and p).
  • 46. Studying Scalability Input size (n) Number of threads (p) 1 2 4 8 16 1,000 800 410 201 150 100 2,000 1,601 802 409 210 120 4,000 3,100 1,504 789 399 208 8,000 6,010 3,005 1,500 758 376 16,000 12,000 6,000 3,001 1,509 758 It is conventional to test scalability in powers of two (or by doubling n and p). Table records the parallel runtime (in seconds) for varying values of n and p. Strong scaling behavior Weak scaling behavior 46
  • 47. Studying Scalability 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 n=1,000 n=2,000 n=4,000 n=8,000 n=16,000 p Time (sec) 0 2 4 6 8 10 12 14 16 18 1 2 4 8 16 n=1,000 n=2,000 n=4,000 n=8,000 n=16,000 Ideal Speedup p Copyright Ā© 2010, Elsevier Inc. All rights Reserved 47
  • 48. Timings with loop-parallel.c (your code) or loop.c (our code) Copyright Ā© 2010, Elsevier Inc. All rights Reserved 48 You try! < 10 minutes
  • 49. x1 x1 +2 x3 x4 x5 x6 x7 +1 +2 +3 +4 +5 +6 x2 x1+2 +3 x1+2+ 3+4 x1+2+3+ 4+5 x1+2+3+4 +5+6 x1+2+3+4+5 +6+7 x1+2+3+4+5+ 6+7+8 x8 Serial Process (1 thread, 7 operations) Serial vs. Parallel Reduction 49 +7
  • 50. x1+2+3+4+5+ 6+7+8 x5+6+7+8 x1+2+3+4 x1+2 x1+2 x1+2 x1 x2 x3 x4 x5 x6 x7 x8 t1 t2 t3 t4 t5 t6 t7 t8 x1+2 +1 +2 +3 Parallel Process (8 threads, 3 operations) 50 log2 p stages
  • 51. Reduction operators ā€¢ A reduction operator is a binary operation (such as addition or multiplication). ā€¢ A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result. ā€¢ All of the intermediate results of the operation should be stored in the same variable: the reduction variable. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 51
  • 52. Computing a sum ā€¢ Open sumcompute-serial.c ā€¢ Lets go over it Copyright Ā© 2010, Elsevier Inc. All rights Reserved 52 < 5 minutes
  • 53. Mutual exclusion Copyright Ā© 2010, Elsevier Inc. All rights Reserved # pragma omp critical global_result += my_result ; only one thread can execute the following structured block at a time 53
  • 54. Example ā€¢ Open sync-unsafe.c Copyright Ā© 2010, Elsevier Inc. All rights Reserved 54
  • 55. Synchronization ā€¢ Synchronization imposes order constraints and is used to protect access to shared data ā€¢ Types of synchronization: ā€“ critical ā€“ atomic ā€“ locks ā€“ others (barrier, ordered, flush) ā€¢ We will work on an exercise involving critical, atomic, and locks Copyright Ā© 2010, Elsevier Inc. All rights Reserved 55
  • 56. Critical #pragma omp parallel for schedule(static) shared(a) for(i = 0; i < n; i++) { #pragma omp critical { a = a+1; } } Threads wait here: only one thread at a time does the operation: ā€œa = a+1ā€. So this is a piece of sequential code inside the for loop. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 56
  • 57. Atomic ā€¢ Atomic provides mutual exclusion but only applies to the load/update of a memory location ā€¢ It is applied only to the (single) assignment statement that immediately follows it ā€¢ Atomic construct may only be used together with an expression statement with one of operations: +, *, -, /, &, ^, |, <<, >> ā€¢ Atomic construct does not prevent multiple threads from executing the function() at the same time (see the example below) Code example: int ic, i, n; ic = 0; #pragma omp parallel shared(n,ic) private(i) for (i=0; i++; i<n) { #pragma omp atomic ic = ic + function(c); } Atomic only protects the update of ic Copyright Ā© 2010, Elsevier Inc. All rights Reserved 57
  • 58. Locks ā€¢ A lock consists of a data structure and functions that allow the programmer to explicitly enforce mutual exclusion in a critical section. 58 Acquire lock(i) // critical section Release lock(i) Threads queue up Only one thread executes Difference from critical section: - You can have multiple locks - A thread can try for any specific lock - => we can use this to acquire data-level locks e.g., two threads can access different array indices without waiting.
  • 59. Illustration of Locking Operation ā€¢ The protected region contains the update of a shared variable ā€¢ One thread acquires the lock and performs the update ā€¢ Meanwhile, other threads perform some other work ā€¢ When the lock is released again, the other threads perform the update Copyright Ā© 2010, Elsevier Inc. All rights Reserved 59
  • 60. A Locks Code Example long long int a=0; long long int i; omp_lock_t my_lock; // init lock omp_init_lock(&my_lock); #pragma omp parallel for for(i = 0; i < n; i++) { omp_set_lock(&my_lock); a+=1; omp_unset_lock(&my_lock); } omp_destroy_lock(&my_lock); 1. Define lock variable 2. Initialize lock 3. Set lock 4. Unset lock 5. Destroy lock Compiling and running sync.c: gcc āˆ’g āˆ’Wall āˆ’fopenmp āˆ’o sync sync.c ./sync #of-iteration #of-threads Copyright Ā© 2010, Elsevier Inc. All rights Reserved 60
  • 61. Some Caveats 1. You shouldnā€™t mix the different types of mutual exclusion for a single critical section. 2. There is no guarantee of fairness in mutual exclusion constructs. 3. It can be dangerous to ā€œnestā€ mutual exclusion constructs. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 61
  • 62. ā€¢ Default schedule: Copyright Ā© 2010, Elsevier Inc. All rights Reserved Loop.c example #pragma omp parallel for schedule(static) private(a)//creates N threads to run the next enclosed block for(i = 0; i < loops; i++) { a = 6+7*8; } 62
  • 63. The Runtime Schedule Type ā€¢ The system uses the environment variable OMP_SCHEDULE to determine at run- time how to schedule the loop. ā€¢ The OMP_SCHEDULE environment variable can take on any of the values that can be used for a static, dynamic, or guided schedule. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 63
  • 64. schedule ( type , chunksize ) ā€“ Static: Assigned before the loop is executed. ā€“ dynamic or guided: Assigned while the loop is executing. ā€“ auto/ runtime: Determined by the compiler and/or the run-time system Copyright Ā© 2010, Elsevier Inc. All rights Reserved Controls how loop iterations are assigned ā€¢ Consecutive iterations are broken into chunks ā€¢ Total number = chunksize ā€¢ Positive integer ā€¢ Default is 1 64
  • 65. schedule types can prevent load imbalance Threads Time Static schedule Dynamic schedule vs 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Time Copyright Ā© 2010, Elsevier Inc. All rights Reserved 65 Thread finishing first Thread finishing last Idle time
  • 66. Static: default Static, n: set chunksize Static Static, n Dynamic Guided iteration number 0 0 0 0 N-1 N-1 N-1 N-1 Thread 0 Thread 1 Thread 2 Thread 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 T 2 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 T 0 Thre ad 1 Thre ad 1 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Th r0 Th r1 Th r2 Th r3 T 0 T 1 T 2 T 3 T 0 T 1 2 chunksize Copyright Ā© 2010, Elsevier Inc. All rights Reserved 66
  • 67. Dynamic: thread executes a chunk when done, it requests another one Static Static, n Dynamic Guided iteration number 0 0 0 0 N-1 N-1 N-1 N-1 Thread 0 Thread 1 Thread 2 Thread 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 T 2 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 T 0 Thre ad 1 Thre ad 1 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Th r0 Th r1 Th r2 Th r3 T 0 T 1 T 2 T 3 T 0 T 1 2 Copyright Ā© 2010, Elsevier Inc. All rights Reserved 67
  • 68. Guided: thread executes a chunk when done, it requests another one new chunks decrease in size (until chunksize is met) Static Static, n Dynamic Guided iteration number 0 0 0 0 N-1 N-1 N-1 N-1 Thread 0 Thread 1 Thread 2 Thread 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 T 2 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 T 0 Thre ad 1 Thre ad 1 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 Th r0 Th r1 Th r2 Th r3 T 0 T 1 T 2 T 3 T 0 T 1 2 Copyright Ā© 2010, Elsevier Inc. All rights Reserved 68
  • 69. multiplication.c ā€¢ Go over this code at a conceptual level ā€“ show where and how to parallelize Copyright Ā© 2010, Elsevier Inc. All rights Reserved 69
  • 70. Matrix-vector multiplication Copyright Ā© 2010, Elsevier Inc. All rights Reserved 70
  • 71. M x V = X: Parallelization Strategies 71 = x Row Decomposition Thread 0 Thread 1 Thread 2 Thread 3 Thre ad 0 Thre ad 1 Thre ad 2 Thre ad 3 m x n n x 1 m x 1 ļƒ˜ Each thread gets m/p rows ļƒ˜ Time taken is proportional to: (mn)/p : per thread ļƒ˜ No need for any synchronization (static scheduling will do)
  • 72. M x V = X: Parallelization Strategies 72 = x Column Decomposition Threa d 0 Thread 1 Thread 2 Thread 3 m x n n x 1 m x 1 ļƒ˜ Each thread gets n/p columns ļƒ˜ Time taken is proportional to: (mn)/p + time for reduction : per thread x0 + x1 + x2 + x3 X [i] Reduction
  • 73. Extra slides Copyright Ā© 2010, Elsevier Inc. All rights Reserved 73
  • 74. What happened? 1. OpenMP compilers donā€™t check for dependences among iterations in a loop thatā€™s being parallelized with a parallel for directive. 2. A loop in which the results of one or more iterations depend on other iterations cannot, in general, be correctly parallelized by OpenMP. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 74
  • 75. Estimating Ļ€ Copyright Ā© 2010, Elsevier Inc. All rights Reserved 75
  • 76. OpenMP solution #1 Copyright Ā© 2010, Elsevier Inc. All rights Reserved loop dependency 76
  • 77. OpenMP solution #2 Copyright Ā© 2010, Elsevier Inc. All rights Reserved Insures factor has private scope. 77
  • 78. Copyright Ā© 2010, Elsevier Inc. All rights Reserved Thread-Safety 78
  • 79. An operating system ā€œprocessā€ ā€¢ An instance of a computer program that is being executed. ā€¢ Components of a process: ā€“ The executable machine language program. ā€“ A block of memory. ā€“ Descriptors of resources the OS has allocated to the process. ā€“ Security information. ā€“ Information about the state of the process. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 79
  • 80. Shared Memory System ā€¢ Each processor can access each memory location. ā€¢ The processors usually communicate implicitly by accessing shared data structures. ā€¢ Example: Multiple CPU cores on a single chip Copyright Ā© 2010, Elsevier Inc. All rights Reserved Kamiak compute nodes have multiple CPUs each with multiple cores 80
  • 81. UMA multicore system Copyright Ā© 2010, Elsevier Inc. All rights Reserved Figure 2.5 Time to access all the memory locations will be the same for all the cores. 81
  • 82. NUMA multicore system Copyright Ā© 2010, Elsevier Inc. All rights Reserved Figure 2.6 A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through another chip. 82
  • 83. Input and Output ā€¢ However, because of the indeterminacy of the order of output to stdout, in most cases only a single process/thread will be used for all output to stdout other than debugging output. ā€¢ Debug output should always include the rank or id of the process/thread thatā€™s generating the output. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 83
  • 84. Input and Output ā€¢ Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for example, each process/thread can open its own, private file for reading or writing, but no two processes/threads will open the same file. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 84
  • 85. Division of work ā€“ data parallelism Copyright Ā© 2010, Elsevier Inc. All rights Reserved 85
  • 86. Division of work ā€“ task parallelism Copyright Ā© 2010, Elsevier Inc. All rights Reserved Tasks 1) Receiving 2) Addition 86
  • 87. Distributed Memory on Kamiak Copyright Ā© 2010, Elsevier Inc. All rights Reserved Figure 2.4 compute node 1 compute node N compute node 3 compute node 2 Each compute node has 1-2 CPUs each with 10-14 cores 87
  • 88. The burden is on software ā€¢ Hardware and compilers can keep up the pace needed. ā€¢ From now onā€¦ ā€“ In shared memory programs: ā€¢ Start a single process and fork threads. ā€¢ Threads carry out tasks. ā€“ In distributed memory programs: ā€¢ Start multiple processes. ā€¢ Processes carry out tasks. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 89
  • 89. Writing Parallel Programs Copyright Ā© 2010, Elsevier Inc. All rights Reserved double x[n], y[n]; ā€¦ for (i = 0; i < n; i++) x[i] += y[i]; 1. Divide the work among the processes/threads (a) so each process/thread gets roughly the same amount of work (b) and communication is minimized. 2. Arrange for the processes/threads to synchronize. 3. Arrange for communication among processes/threads. I think we will eventually introduce this when doing sum. I would remove this slide from here. 90
  • 90. Ananth can you put in a simple example here? Remove Slide Copyright Ā© 2010, Elsevier Inc. All rights Reserved 91
  • 91. OpenMP (ask ananth to tailor more for the loop.c code) ā€¢ An API for shared-memory parallel programming. ā€¢ MP = multiprocessing ā€¢ Designed for systems in which each thread or process can potentially have access to all available memory. ā€¢ System is viewed as a collection of cores or CPUā€™s, all of which have access to main memory. Copyright Ā© 2010, Elsevier Inc. All rights Reserved Remove Slide 92
  • 92. A process forking and joining two threads Copyright Ā© 2010, Elsevier Inc. All rights Reserved 93
  • 93. Of noteā€¦ ā€¢ There may be system-defined limitations on the number of threads that a program can start. ā€¢ The OpenMP standard doesnā€™t guarantee that this will actually start thread_count threads. ā€¢ Most current systems can start hundreds or even thousands of threads. ā€¢ Unless weā€™re trying to start a lot of threads, we will almost always get the desired number of threads. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 94
  • 94. Copyright Ā© 2010, Elsevier Inc. All rights Reserved Have people take serial version and make into the openMP version of loop.c -have a small example code that has the pragma and openMP calls in it -then have them adapt the loop.c to have the openmp in itā€¦give them 10 minutes to do while we go around the room and helpā€¦.then walk them through our parallel version of loop.c 95
  • 95. SCHEDULING LOOPS IN SYNC.C Copyright Ā© 2010, Elsevier Inc. All rights Reserved 96
  • 96. Locks ā€¢ A lock implies a memory fence of all thread visible variables ā€¢ The lock routines are used to guarantee that only one thread accesses a variable at a time to avoid race conditions ā€¢ C/C++ lock variables must have type ā€œomp_lock_tā€ or ā€œomp_nest_lock_tā€ (will not discuss nested lock in this workshop) ā€¢ All lock functions require an argument that has a pointer to omp_lock_t or omp_nest_lock_t ā€¢ Simple Lock functions: ā€“ omp_init_lock(omp_lock_t*); ā€“ omp_set_lock(omp_lock_t*); ā€“ omp_unset_lock(omp_lock_t*); ā€“ omp_test_lock(omp_lock_t*); ā€“ omp_destroy_lock(omp_lock_t*); Copyright Ā© 2010, Elsevier Inc. All rights Reserved 97
  • 97. How to Use Locks 1) Define the lock variables 2) Initialize the lock via a call to omp_init_lock 3) Set the lock using omp_set_lock or omp_test_lock. The latter checks whether the lock is actually available before attempting to set it. It is useful to achieve asynchronous thread execution. 4) Unset a lock after the work is done via a call to omp_unset_lock. 5) Remove the lock association via a call to omp_destroy_lock. Copyright Ā© 2010, Elsevier Inc. All rights Reserved 98
  • 98. Matrix-vector multiplication Copyright Ā© 2010, Elsevier Inc. All rights Reserved Run-times and efficiencies of matrix-vector multiplication (times are in seconds) 99

Editor's Notes

  1. Ask jeff to add in notes about the standard compute node architecture
  2. 23 May 2023
  3. Students would test different timings of loop.c in class.
  4. Have animation on these slides
  5. Have animation on these slides
  6. Please refer to the detailed slides on cache coherence and thread-safety in Chapter 4.
  7. Please refer to the detailed slides on cache coherence and thread-safety in Chapter 4.
  8. Shared-memory The cores can share access to the computerā€™s memory. Coordinate the cores by having them examine and update shared memory locations. Example: A single system with multiple CPU cores
  9. Please refer to the detailed slides on cache coherence and thread-safety in Chapter 4.