13. Memory
ā¢ Two major classes of parallel programming
models:
ā Shared Memory
ā Distributed Memory
13
14. Shared Memory Architecture
CPU core #1
Memory
(Shared address
space)
disk
Memory bus
Compute node
I/O bus
Cache ($)
CPU core
Cache ($)
ā¦
CPU core
Cache ($)
Threads
Threads Threads
All threads can see a single
shared address space.
Therefore, they can see each
otherās data.
Each compute node has 1-2
CPUs each with 10-14 cores
14
17. Distributed Memory Architecture
Local Memory
Compute node 1
Local Memory
Compute node 2
Local Memory
Compute node m
ā¦ā¦ā¦
Network Interconnect
Processes running
on cores
Processes running
on cores
Processes running
on cores
Processes cannot see each otherās memory address space.
They have to send inter-process messages (using MPI).
17
18. Distributed Memory System
ā¢ Clusters (most popular)
ā A collection of commodity systems.
ā Connected by a commodity interconnection
network.
ā¢ Nodes of a cluster are individual
computers joined by a communication
network.
a.k.a. hybrid systems
Kamiak provides an Infiniband
interconnect between all compute nodes
18
19. Single Program Models:
SIMD vs. MIMD
19
ā¢ SP: Single Program
ā Your parallel program is a single program that you execute
on all threads (or processes)
ā¢ SI: Single Instruction
ā Each thread should be executing the same line of code at
any given clock cycle.
ā¢ MI: Multiple Instruction
ā Each thread (or process) could be independently running a
different line of your code (instruction) concurrently
ā¢ MD: Multiple Data
ā Each thread (or process) could be operating/accessing a
different piece of the data from the memory concurrently
20. Single Program Models:
SIMD vs. MIMD
20
// Begin: parallel region of the code
ā¦
..
..
..
..
..
..
..
..
..
..
// End: parallel region of the code
// Begin: parallel region of the code
ā¦
..
..
..
..
..
..
..
..
..
..
// End: parallel region of the code
All threads executing the
same line of code.
They may be accessing
different pieces of data.
Thread 1
Thread 2
Thread 3
SIMD MIMD
45. Studying Scalability
Input
size (n)
Number of threads (p)
1 2 4 8 16
1,000
2,000
4,000
8,000
16,000
Table records the parallel runtime (in seconds) for varying values of n and p.
45
It is conventional to test scalability in powers of two (or by doubling n and p).
46. Studying Scalability
Input
size (n)
Number of threads (p)
1 2 4 8 16
1,000 800 410 201 150 100
2,000 1,601 802 409 210 120
4,000 3,100 1,504 789 399 208
8,000 6,010 3,005 1,500 758 376
16,000 12,000 6,000 3,001 1,509 758
It is conventional to test scalability in powers of two (or by doubling n and p).
Table records the parallel runtime (in seconds) for varying values of n and p.
Strong scaling
behavior
Weak scaling
behavior
46
58. Locks
ā¢ A lock consists of a data structure and functions
that allow the programmer to explicitly enforce
mutual exclusion in a critical section.
58
Acquire lock(i)
// critical section
Release lock(i)
Threads queue up
Only one
thread executes
Difference from critical
section:
- You can have multiple
locks
- A thread can try for any
specific lock
- => we can use this to
acquire data-level
locks
e.g., two threads can access different array indices without waiting.
71. M x V = X: Parallelization Strategies
71
=
x
Row Decomposition
Thread 0
Thread 1
Thread 2
Thread 3
Thre
ad 0
Thre
ad 1
Thre
ad 2
Thre
ad 3
m x n n x 1 m x 1
ļ Each thread gets m/p rows
ļ Time taken is proportional to: (mn)/p : per thread
ļ No need for any synchronization (static scheduling will do)
72. M x V = X: Parallelization Strategies
72
=
x
Column Decomposition
Threa
d 0
Thread
1
Thread
2
Thread
3
m x n n x 1 m x 1
ļ Each thread gets n/p columns
ļ Time taken is proportional to: (mn)/p + time for reduction : per thread
x0 + x1 + x2 + x3
X [i]
Reduction
Ask jeff to add in notes about the standard compute node architecture
23 May 2023
Students would test different timings of loop.c in class.
Have animation on these slides
Have animation on these slides
Please refer to the detailed slides on cache coherence and thread-safety in Chapter 4.
Please refer to the detailed slides on cache coherence and thread-safety in Chapter 4.
Shared-memory
The cores can share access to the computerās memory.
Coordinate the cores by having them examine and update shared memory locations.
Example: A single system with multiple CPU cores
Please refer to the detailed slides on cache coherence and thread-safety in Chapter 4.