2. OpenMP is an Application Program Interface (API) for
• explicit
• portable
• shared-memory parallel programming
• in C/C++ and Fortran.
OpenMP consists of
• compiler directives,
• runtime calls and
• environment variables.
It is supported by all major compilers on Unix and
Windows platforms
GNU, IBM, Oracle, Intel, PGI, Absoft, Lahey/Fujitsu,
PathScale, HP, MS, Cray
OpenMP : What is it?
3. OpenMP Programming Model
➢ Designed for multi-processor/core, shared
memory machines.
➢ OpenMP programs accomplish parallelism
exclusively through the use of threads.
➢ Programmer has full control over
parallelization.
➢ Consists of a set of #pragmas (Compiler
Instructions/ Directives) that control how the
program works.
4. OpenMP: Core Elements
Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
User level runtime functions & Env. variables
5. Thread Creation/Fork-Join
All OpenMP programs begin as a single process: the master
thread.
The master thread executes sequentially until the
first parallel region construct is encountered.
FORK: the master thread then creates a team of
parallel threads.
The statements in the program that are enclosed by the
parallel region construct are then executed in parallel
among the various team threads.
JOIN: When the team threads complete the statements in
the parallel region construct, they synchronize and
terminate, leaving only the master thread.
6. Thread Creation/Fork-Join
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goals are
met: i.e. the sequential program evolves into a parallel
program
7. OpenMP Run Time Variables
❖Modify/check/get info about the number of threads
omp_get_num_threads() //number of threads in use
omp_get_thread_num() //tells which thread you are
omp_get_max_threads() //max threads that can be used
❖Are we in a parallel region?
omp_in_parallel()
❖How many processors in the system?
omp_get_num_procs()
❖Explicit locks
omp_[set|unset]_lock()
And several more...
8. OpenMP: Few Syntax Details
❖Most of the constructs in OpenMP are compiler directives or
pragmas
For C/C++ the pragmas take the form
#pragma omp construct [clause [clause]…]
For Fortran, the directives take one of the forms
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
❖Header File or Fortran 90 module
#include omp.h
use omp_lib
10. Compiling OpenMP code
❖Same code can run on single-core or multi-core machines
❖Compiler directives are picked up ONLY when thee
program is instructed to be compiled in OpenMP mode.
❖Method depends on the compiler
G++
$ g++ -o foo foo.c -fopenmp
ICC
$ icc -o foo foo.c -fopenmp
11. Running OpenMP code
❖Controlling the number of threads at runtime
The default number of threads = number of online
processors on the machine.
C shell : setenv OMP_NUM_THREADS number
Bash shell: export OMP_NUM_THREADS = number
Runtime OpenMP function omp_set_num_threads(4)
Clause in #pragma for parallel region
❖Execution Timing #include omp.h
stime = omp_get_wtime();
longfunction();
etime = omp_get_wtime();
total = etime-stime;
12. To create a 4 thread Parallel region :
Each thread calls pooh(ID,A) for ID = 0 to 3
Thread Creation/Fork-Join
13. OpenMP: Core Elements
Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
User level runtime functions & Env. variables
14. Data vs. Task Parallelism
Data parallelism
Large amount of data elements and each data element
(or possibly a subset of elements) needs to be processed
to produce a result. When this processing can be done in
parallel, we have data parallelism
Task parallelism
A collection of tasks that need to be completed. If
these tasks can be performed in parallel you are faced
with a task parallel job
15. OpenMP: Work Sharing
A work-sharing construct divides the
execution of the enclosed code region among
different Threads
categories of work sharing in OpenMP
• omp for
• omp sections
16. Threads are assigned
independent sets of iterations.
Threads must wait at the end
of the work sharing construct.
#pragma omp for
#pragma omp parallel for
18. Schedule Clause
How is the work is divided among threads?
Directives for work distribution
19. OpenMP for Parallelization
for (int i = 2; i < 10; i++)
{
x[i] = a * x[i-1] + b
}
Can all loops be parallelized?
Loop iterations have to be independent.
Simple Test: If the results differ when the code is executed
backwards, the loop cannot by parallelized!
Between 2 Synchronization points, if atleast 1 thread
writes to a memory location, that atleast 1 other thread
reads from => The result is non-deterministic
20. Work Sharing: sections
SECTIONS directive is a non-iterative work-sharing
construct.
➢ It specifies that the enclosed section(s) of code are to be
divided among the threads in the team.
➢ Each SECTION is executed ONCE by a thread in the
team.
22. OpenMP: Core Elements
Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
User level runtime functions & Env. variables
23. Synchronization Constructs
Synchronization is achieved by
1) Barriers (Task Dependencies)
Implicit : Sync points exist at the end of
parallel –necessary barrier – cant be removed
for – can be removed by using the nowait clause
sections – can be removed by using the nowait clause
single – can be removed by using the nowait clause
Explicit : Must be used when ordering is required
#pragma omp barrier
each thread waits until all threads arrive at the barrier
25. Data Dependencies
OpenMP assumes that there is NO data-
dependency across jobs running in parallel
When the omp parallel directive is placed around
a code block, it is the programmer’s
responsibility to make sure data dependency is
ruled out
26. Race Condition
Non Deterministic Behaviour
Two or more threads access a shared variable at the same time.
Both Threads A and B are executing
27. Synchronization Constructs
2) Mutual Exclusion (Data Dependencies)
Critical Sections : Protect access to shared & modifiable data,
allowing ONLY ONE thread to enter it at a given time
#pragma omp critical
#pragma omp atomic – special case of critical, less overhead
Locks
Only one thread
updates this at a
time
29. OpenMP: Core Elements
Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
User level runtime functions & Env. variables
30. OpenMP: Data Scoping
Challenge in Shared Memory Parallelization => Managing Data Environment
Scoping
OpenMP Shared variable : Can be Read/Written by all Threads in the team.
OpenMP Private variable : Each Thread has its own local copy of this variable
int i;
int j;
#pragma omp parallel private(j)
{
int k;
i = …….
j = ……..
k = …
}
Private
Shared
Loop variables in an omp for are private;
Local variables in the parallel region are private.
Alter default behaviour with the {default}
clause:
#pragma omp parallel default(shared)
private(x)
{ ... }
#pragma omp parallel default(private) shared
(matrix)
{ ... }
31. OpenMP: private Clause
• Reproduce the private variable for each thread.
• Variables are not initialized.
• The value that Thread1 stores in x is different from
the value Thread2 stores in x
32. OpenMP Parallel Programming
➢ Start with a parallelizable algorithm
Loop level parallelism
➢ Implement Serially : Optimized Serial Program
➢ Test, Debug & Time to solution
➢ Annotate the code with parallelization and
Synchronization directives
➢ Remove Race Conditions, False Sharing***
➢ Test and Debug
➢ Measure speed-up
33. Problem: Count the Number of times each ASCII character occurs in page of text
Input; ASCII text, stored as an ARRAY of characters, Number of bins (128)
Output: Histogram with 128 buckets – one for each ASCII character
➢Start with a parallelizable algorithm
▪Loop level parallelism?
void compute_histogram_st(char *page, int page_size, int
*histogram)
{
for(int i = 0; i < page_size; i++){
char read_character = page[i];
histogram[read_character]++;
}
}
Can this loop be
parallelized?
34. Annotate the code with parallelization and
Synchronization directives
void compute_histogram_st(char *page, int page_size, int
*histogram)
{
#pragma omp parallel for
for(int i = 0; i < page_size; i++) {
char read_character = page[i];
histogram[read_character]++;
}
}
omp parallel for
This will not work! Why?
Shared
Mutual Exclusion
Private variable
Critical Section
35. Problem: Count the Number of times each ASCII character occurs in page of text
Input; ASCII text, stored as an ARRAY of characters, Number of bins (128)
Output: Histogram with 128 buckets – one for each ASCII character
Could be slower than the Serial Code.
Overhead = Critical Section + Parallelization
void compute_histogram_st(char *page, int page_size, int
*histogram)
{
#pragma omp parallel for
for(int i = 0; i < page_size; i++){
char read_character = page[i];
#pragma omp atomic
histogram[read_character]++;
}
}
36. void compute_histogram (char *page, int page_size, int *histogram, int num_bins)
{
int num_threads = omp_get_max_threads();
#pragma omp parallel
{
int local_histogram [num_bins] = {0};
#pragma omp for
for(int i = 0; i < page_size; i++){
char read_character = page[i];
local_histogram [read_character]++;
}
#pragma omp critical
for(int i = 0; i < num_bins; i++){
histogram[i] += local_histogram [i];
}
}
}
Each Thread Updates
its local copy
Combine from thread locals
to shared variable
local_histogram
Thread0
Thread1
Thread2
Bins 1,2,3,….num_bins ------>
37. OpenMP: Reduction
One or more variables that are private to each thread are subject of
reduction operation at the end of the parallel region.
#pragma omp for reduction(operator : var)
Operator: + , * , - , & , | , && , ||, ^
Combines multiple local copies of the var from threads into a single
copy at the master.
sum = 0;
#pragma omp parallel for
for (int i = 0; i < 9; i++)
{
sum += a[i]
}
38. OpenMP: Reduction
sum = 0;
#pragma omp parallel for shared(sum, a) reduction(+: sum)
for (int i = 0; i < 9; i++)
{
sum += a[i]
}
sumloc_1 = a[0] + a[1] + a[2]
sumloc_2 = a[3] + a[4] + a[5]
sumloc_3 = a[6] + a[7] + a[8]
3 Threads
sum = sum_loc1 + sum_loc2 + sum_loc3
40. static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i++)
{
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum
}
Serial Code
Loop
41. static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum
}
Computing ∏ by method of Numerical Integration
#include <omp.h>
#define NUM_THREADS 4
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:sum)
private(x)
for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum
}
Serial Code Parallel Code