Options and trade-offs for
parallelism and concurrency in
Modern C++
Mats Brorsson
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• Conclusions
Parallelism everywhere
• Server level parallelism
• Distributed memory
• Multicore architectures
• Shared memory
• Instruction-level parallelism
• Vector parallelism
• Thread parallelism
• Hardware vs software threads
• Simultaneous multithreading
• ”Switch-on-event” multithreading
Vector vs Thread parallelism
• Vector parallelism maps naturally to
Regular Data Parallelism
• Inner loops can (sometimes) be
• Ways to vectorization:
• Auto-vectorization
• Vector Intrinsics
• __mm_add_ps(__m128 x,y)
• Compiler hints
• Cilk Plus array notation
• #pragma omp simd
Images courtesy Intel and Rebel Science News
Key features for Performance
• Data locality
• Chunks that fit in cache
• Reuse data locally
• Avoid cache conflicts
• Use few virtual pages
• Avoid false sharing
• Parallel slack
• Specify potential parallelism much
higher than the actual parallelism
• Load balance
• All threads have the same amount of
work to do
Example: SAXPY, scaling of a vector
• SAXPY scales a vector, , by a factor, added by vector
• is used for both input and output
• Single-precision floating point (DAXPY: double precision)
• Low arithmetic intensity
• Little arithmetic work compared to the amount of data consumed and
• 2 FLOPS, 8 bytes read and 4 bytes written
The Map Pattern
• Applies a function to every
element of a collection of data
• Elemental function
• No side effects
• Embarrassingly parallel
• Often combined with collective
Crash course C++11 thread programming
Compile: g++ -std=c++11 –O –lpthread –o hello-threads
#include <thread> at top of file
std::thread t; declare a thread; acts as thread handle
t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2)
with arguments a1 and a2
t.join() join with thread t. Wait for thread with handle tid to finish
std::mutex m; Declare mutual exclusion lock
m.lock() Enter critical section protected by m
m.unlock(); Leave the critical section
Explicit threading on
multicores (in C++11)
• Create one thread per core
• Divide the work manually
• Substantial amount of extra
code over the serial
• Inflexible scheduling of work
void saxpy_t(float a, const vector<float> &x,
vector<float> &y, int nthreads,
int thr_id) {
int n = x.size();
int start = thr_id*n/num_threads;
int end= min((thr_id+1)*n/num_threads, n);
for (int i = start; i < end; i++)
y[i] = a*x[i] + y[i];
void main(…)
vector<thread> tarr;
for (int i = 0; i < nthreads; i++){
tarr.push_back(thread(saxpy_t, a, ref(x),
ref(y), nthreads, i));
// Wait for threads to finish
for (auto & t : tarr){
Load imbalance?
What happens when the iterations are
void map_serial(
float a; // scale factor
const std::vector<float> &x; // input vec
std::vector<float> &y; // output and input vec )
for (int i = 0; i < n; i++)
y[i] = map(i);
Explicit threading with
load imbalance
• Atomic update of index variable
• Fine granularity of load balancing
• Overheads in multiple threads
wanting to update index
• Note that the declaration of
saxpy_index to be atomic
guarantees no data races
• saxpy_index.fetch_add(1) returns
the old value and atomically adds 1
to it.
std::atomic<int> saxpy_index {-1};
void saxpy_t(float a, vector<float> &x,
vector<float> &y) {
int i = std::saxpy_index.fetch_add(1);
while (i < x.size()) {
y[i] = map(i, a*x[i] + y[i]);
i = std::saxpy_index.fetch_add(1);
void main(…)
vector<thread> tarr;
index = 0;
for (int i = 0; i < num_threads; i++){
tarr.push_back(thread(saxpy_t, a,
ref(x), ref(y));
// Wait
// join
for (auto & t : tarr){
Load imbalance with chunks
• The index variable might be a
• Use CHUNKS so that each
threads work on a range of
• Note that the declaration of
index to be atomic guarantees
no data races
std::atomic<int> saxpy_index {-CHUNK};
void saxpy_t(float a, std::vector<float> x,
std::vector<float> y) {
int c = std::saxpy_index.fetch_add(CHUNK);
int n = x.size();
while (c < n) {
for (int i = c ; c < min(n, c+CHUNK); i++)
y[i] = map(i, a*x[i] + y[i]);
c = std::saxpy_index.fetch_add(CHUNK);
Sequence of maps vs Map of Sequence
• Also called: Code fusion
• Do this whenever possible!
• Increases the Arithmetic
• Less data to load and store
• Explicit changes needed
• Make sure consecutive elemental
functions do not refer to memory
either through compiler
optimizations or by design
Cache fusion optimization
• Almost as important as code
• Break down maps to sequences
of smaller maps, executed by
each thread
• Keep aggregate data small
enough to fit in cache
Higher abstraction models
• OpenMP
void saxpy_par_openmp(float a,
const vector<float> & x,
vector<float> & y) {
auto n = x.size();
#pragma omp parallel for
for (auto i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
auto n = x.size();
tbb::parallel_for(size_t(0), n,
[&]( size_t i ) {
y[i] = a * x[i] + y[i];
• Parallel STL (C++17)
std::transform(std::par, x.begin(),
[=](float x, float y){
return a*x + y;
• Threads are assigned
independent set of
• Work-sharing
OpenMP support for map: for loops
Work sharing
Implicit barrier
#pragma omp parallel
#pragma omp for
for (i=0; i < 16; i++)
c[i] = b[i] + a[i];
The Reduce Pattern
• Combining the elements of a
collection of data to a single
• A combiner function is used to
combine elementd pairwise
• The combiner function must be
associative for parallelism
• 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
Serial reduction
Example: Dot-product
float sdot(vector x, vector y){
float sum = 0.0;
for (int i; i < x.size(); i++)
sum += x[i]*y[i];
return sum;
Note that this is a fusion of a map (vector element product) and the reduce (sum).
Implementation of parallel reduction
• Simple approach:
• Let each thread make the reduce on its part of the data
• Let one (master) thread combine the results to a scalar value
Each thread performs local reduction in parallel
Master thread reduces to scalar value
Thread 0 1 2 3
Awkward way of returning results from a
thread: dot-product example
Plain C/C++:
void sprod(const vector<float> &a,
const vector<float> &b,
int start,
int end,
double &sum) {
float lsum = 0.0;
for (int i=start; i < end; i++)
sum += a[i] * b[i];
#include <thread>
using namespace std;
vector<float> sum_array(nthr, 0.0);
vector<thread> t_arr;
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
sprod, ref(a), ref(b),
start, end,
for (i = 0; i < nthr; i++){
sum += sum_array[i];
The Async function using futures
Plain C/C++:
float sprod(vector<float> &a,
vector<float> &b,
int start,
int end) {
float lsum = 0.0;
for (int i=start; i < end; i++)
lsum += a[i] * b[i];
return lsum;
#include <thread>
#include <future>
using namespace std;
future<float> f_arr[i];
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
f_arr[i] = async(launch::async,
sprod, a, b,
start, end);
for (i = 0; i < nthr; i++){
sum += f_arr[i].get();
Definition of async
• The template function async runs the function f asynchronously
(potentially in a separate thread) and returns a std::future that will
eventually hold the result of that function call.
• The launch::async argument makes the function run on a separate
thread (which could be held in a thread pool, or created for this call)
Numerical Integration
(1+x2) dx = 
 F(xi)x  
i = 0
Mathematically, we know that:
We can approximate the
integral as a sum of
Where each rectangle has
width x and height F(xi) at
the middle of interval i.
Serial PI Program
The Map
The Reduction
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
pi = step * sum;
Map reduce in OpenMP
static long num_steps = 100000;
double step;
void main ()
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for reduction(+:sum) private(x)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
pi = step * sum;
Main challenges in writing parallel software
• Difficult to write composable parallel software
• The parallel models of different languages do not work well together
• Poor resource management
• Difficult to write portable parallel software
Make Tasks a First Class Citizen
• Separation of concerns
• Concentrate on exposing parallelism
• Not how it is mapped onto hardware
scheduler HW
• The (naïve) sequential Fibonacci calculation
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
a = fib(n-1);
b = fib(n-2);
return b+a;
An example of task-parallelism
Parallelism in fib:
• The two calls to fib are independent and can
be computed in any order and in parallel
• It helps that fib is side-effect free but
disjoint side-effects are OK
The need for synchronization:
• The return statement must be executed
after both recursive calls have been
completed because of data-dependence on
a and b.
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
return b+a;
A task-parallel fib in OpenMP 3+
Starting code:
#pragma omp parallel
#pragma omp single
Work-stealing schedulers
• Cores work on tasks in their
own queue
• Generated tasks are put in local
• If empty, select a random
queue to steal work from
• Steal from:
• Continuation, or
• Generated tasks
c c c c
Task-centric parallel models
Gaining momentum
• C/C++:
• OpenMP , Cilk Plus, TBB, GCD...
• C#:
• Microsoft TPL
• Java:
• fork/join
• Erlang:
• processes
• X10:
• activities
• Etc…
Task model benefits
• Automatic load-balancing through work-stealing
• Serial semantics => debug in serial mode
• Composable parallelism
• Parallel libraries can be called from parallel code
• Can be mixed with data-parallelism
• SIMD/Vector instructions
• Data-parallel workers can also be tasks
• Adapts naturally to
• Different number of cores, even in run-time
• Different speeds of cores (e.g. ARM big.LITTLE)
Greatest thing since sliced bread?
• Overheads are still too big to not care about when creating tasks
• Tasks need to have high enough arithmetic intensity to amortize the cost of
creation and scheduling
• Different models do not use same run-time
• You can’t have a task in TBB calling a function creating tasks written in
• Still no accepted way to target different ISA:s
• Research is going on
• The operating system does not know about tasks
• Current OS:s only schedules threads
Heteregeneous processing
• Same ISA, performance heterogeneous processing is transparant
• Real heterogeneity is challenging
OpenMP 4.0: extending OpenMP with
dependence annotations
OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task
#pragma omp task
#pragma omp task
#pragma omp task
#pragma omp taskwait
#pragma omp task
#pragma omp task
OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task depend(inout:A)
#pragma omp task depend(inout:B)
#pragma omp task depend(inout:C)
#pragma omp task depend(inout:D)
// taskwait not needed
#pragma omp task depend(in:A,B, out:E)
#pragma omp task depend(in:C,D, out:F)
Benefits of tasks in OpenMP 4.0
• More parallelism can be exposed
• Complex synchronization patterns can be avoided/automated
• Knowing a task's memory usage/footprint, offloading to accelerators
can now be made almost transparent to the user
- In terms of memory handling and execution!
Writing heterogeneous code
#pragma omp task device(gpu)
void cuda_inc_arr(int *A, int *B) {
cuda_inc_array_kernel <<<4,256>>>(A,B);
#pragma omp task device(tilera) implements(inc_arr)
void tilera_inc_arr(int *A,int *B) {
#pragma omp parallel
int i = omp_get_thread_num();
B[i] += A[i];
} Task Spawn)
#pragma omp task depend(in:A, out:B) target(tilera,gpu,host)
inc_arr(&A[0], &B[0]);
• Parallelism should be exploited
at all levels
• User higher abstraction models
and compiler support
• Vector instructions
• STL Parallel Algorithms
• TBB / OpenMP
• Only then, use threads
• Measure performance
bottlenecks with profilers
• Beware of
• Granularity
• False sharing

Options and trade offs for parallelism and concurrency in Modern C++

  • 1. Options and trade-offs for parallelism and concurrency in Modern C++ Mats Brorsson
  • 2. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • Conclusions
  • 3. Parallelism everywhere • Server level parallelism • Distributed memory • Multicore architectures • Shared memory • Instruction-level parallelism • Vector parallelism • Thread parallelism • Hardware vs software threads • Simultaneous multithreading • ”Switch-on-event” multithreading
  • 4. Vector vs Thread parallelism • Vector parallelism maps naturally to Regular Data Parallelism • Inner loops can (sometimes) be vectorized • Ways to vectorization: • Auto-vectorization • Vector Intrinsics • __mm_add_ps(__m128 x,y) • Compiler hints • Cilk Plus array notation • #pragma omp simd Images courtesy Intel and Rebel Science News
  • 5. Key features for Performance • Data locality • Chunks that fit in cache • Reuse data locally • Avoid cache conflicts • Use few virtual pages • Avoid false sharing • Parallel slack • Specify potential parallelism much higher than the actual parallelism • Load balance • All threads have the same amount of work to do
  • 6. Example: SAXPY, scaling of a vector • SAXPY scales a vector, , by a factor, added by vector • is used for both input and output • Single-precision floating point (DAXPY: double precision) • Low arithmetic intensity • Little arithmetic work compared to the amount of data consumed and produced • 2 FLOPS, 8 bytes read and 4 bytes written
  • 7. The Map Pattern • Applies a function to every element of a collection of data items • Elemental function • No side effects • Embarrassingly parallel • Often combined with collective patterns
  • 8. Crash course C++11 thread programming Compile: g++ -std=c++11 –O –lpthread –o hello-threads #include <thread> at top of file std::thread t; declare a thread; acts as thread handle t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2) with arguments a1 and a2 t.join() join with thread t. Wait for thread with handle tid to finish std::mutex m; Declare mutual exclusion lock m.lock() Enter critical section protected by m m.unlock(); Leave the critical section
  • 9. Explicit threading on multicores (in C++11) • Create one thread per core • Divide the work manually • Substantial amount of extra code over the serial • Inflexible scheduling of work void saxpy_t(float a, const vector<float> &x, vector<float> &y, int nthreads, int thr_id) { int n = x.size(); int start = thr_id*n/num_threads; int end= min((thr_id+1)*n/num_threads, n); for (int i = start; i < end; i++) y[i] = a*x[i] + y[i]; } void main(…) … vector<thread> tarr; for (int i = 0; i < nthreads; i++){ tarr.push_back(thread(saxpy_t, a, ref(x), ref(y), nthreads, i)); } // Wait for threads to finish for (auto & t : tarr){ t.join(); }
  • 11. What happens when the iterations are different? void map_serial( float a; // scale factor const std::vector<float> &x; // input vec std::vector<float> &y; // output and input vec ) { for (int i = 0; i < n; i++) y[i] = map(i); }
  • 12. Explicit threading with load imbalance • Atomic update of index variable • Fine granularity of load balancing • Overheads in multiple threads wanting to update index • Note that the declaration of saxpy_index to be atomic guarantees no data races • saxpy_index.fetch_add(1) returns the old value and atomically adds 1 to it. std::atomic<int> saxpy_index {-1}; void saxpy_t(float a, vector<float> &x, vector<float> &y) { int i = std::saxpy_index.fetch_add(1); while (i < x.size()) { y[i] = map(i, a*x[i] + y[i]); i = std::saxpy_index.fetch_add(1); } } void main(…) … vector<thread> tarr; index = 0; for (int i = 0; i < num_threads; i++){ tarr.push_back(thread(saxpy_t, a, ref(x), ref(y)); } // Wait // join for (auto & t : tarr){ t.join(); }
  • 13. Load imbalance with chunks • The index variable might be a bottleneck • Use CHUNKS so that each threads work on a range of indeces • Note that the declaration of index to be atomic guarantees no data races std::atomic<int> saxpy_index {-CHUNK}; void saxpy_t(float a, std::vector<float> x, std::vector<float> y) { int c = std::saxpy_index.fetch_add(CHUNK); int n = x.size(); while (c < n) { for (int i = c ; c < min(n, c+CHUNK); i++) y[i] = map(i, a*x[i] + y[i]); c = std::saxpy_index.fetch_add(CHUNK); } }
  • 14. Sequence of maps vs Map of Sequence • Also called: Code fusion • Do this whenever possible! • Increases the Arithmetic intensity • Less data to load and store • Explicit changes needed • Make sure consecutive elemental functions do not refer to memory either through compiler optimizations or by design
  • 15. Cache fusion optimization • Almost as important as code fusion • Break down maps to sequences of smaller maps, executed by each thread • Keep aggregate data small enough to fit in cache
  • 16. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 17. Higher abstraction models • OpenMP void saxpy_par_openmp(float a, const vector<float> & x, vector<float> & y) { auto n = x.size(); #pragma omp parallel for for (auto i = 0; i < n; i++) { y[i] = a * x[i] + y[i]; } } • TBB auto n = x.size(); tbb::parallel_for(size_t(0), n, [&]( size_t i ) { y[i] = a * x[i] + y[i]; }); • Parallel STL (C++17) std::transform(std::par, x.begin(), x.end(), y.begin(), y.begin(), [=](float x, float y){ return a*x + y; });
  • 18. • Threads are assigned independent set of iterations • Work-sharing construct 24 OpenMP support for map: for loops parallel Work sharing i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 i=13 i=14 i=15 Implicit barrier #pragma omp parallel #pragma omp for for (i=0; i < 16; i++) c[i] = b[i] + a[i];
  • 19. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 20. The Reduce Pattern • Combining the elements of a collection of data to a single value • A combiner function is used to combine elementd pairwise • The combiner function must be associative for parallelism • 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
  • 21. Serial reduction Example: Dot-product float sdot(vector x, vector y){ float sum = 0.0; for (int i; i < x.size(); i++) sum += x[i]*y[i]; return sum; } Note that this is a fusion of a map (vector element product) and the reduce (sum).
  • 22. Implementation of parallel reduction • Simple approach: • Let each thread make the reduce on its part of the data • Let one (master) thread combine the results to a scalar value Each thread performs local reduction in parallel Master thread reduces to scalar value Thread 0 1 2 3
  • 23. Awkward way of returning results from a thread: dot-product example Plain C/C++: void sprod(const vector<float> &a, const vector<float> &b, int start, int end, double &sum) { float lsum = 0.0; for (int i=start; i < end; i++) sum += a[i] * b[i]; } #include <thread> using namespace std; … vector<float> sum_array(nthr, 0.0); vector<thread> t_arr; for (i = 0; i < nthr; i++) { int start=i*size/nthr; int end = (i+1)*size/nthr; if (i==nthr-1) end = size; t_arr.push_pack(thread( sprod, ref(a), ref(b), start, end, ref(sum_array[i]))); } for (i = 0; i < nthr; i++){ t_arr[i]->join(); sum += sum_array[i]; }
  • 24. The Async function using futures Plain C/C++: float sprod(vector<float> &a, vector<float> &b, int start, int end) { float lsum = 0.0; for (int i=start; i < end; i++) lsum += a[i] * b[i]; return lsum; } #include <thread> #include <future> using namespace std; … future<float> f_arr[i]; for (i = 0; i < nthr; i++) { int start=i*size/nthr; int end = (i+1)*size/nthr; if (i==nthr-1) end = size; f_arr[i] = async(launch::async, sprod, a, b, start, end); } for (i = 0; i < nthr; i++){ sum += f_arr[i].get(); }
  • 25. Definition of async • The template function async runs the function f asynchronously (potentially in a separate thread) and returns a std::future that will eventually hold the result of that function call. • The launch::async argument makes the function run on a separate thread (which could be held in a thread pool, or created for this call)
  • 26. 32 Example: Numerical Integration  4.0 (1+x2) dx =  0 1  F(xi)x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width x and height F(xi) at the middle of interval i. 4.0 2.0 1.0 X 0.0
  • 27. 33 Serial PI Program The Map The Reduction static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 28. Map reduce in OpenMP static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) private(x) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 29. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 30. Main challenges in writing parallel software • Difficult to write composable parallel software • The parallel models of different languages do not work well together • Poor resource management • Difficult to write portable parallel software
  • 31. Make Tasks a First Class Citizen • Separation of concerns • Concentrate on exposing parallelism • Not how it is mapped onto hardware Run-time scheduler HW
  • 32. • The (naïve) sequential Fibonacci calculation int fib(int n){ if( n<2 ) return n; else { int a,b; a = fib(n-1); b = fib(n-2); return b+a; } } An example of task-parallelism Parallelism in fib: • The two calls to fib are independent and can be computed in any order and in parallel • It helps that fib is side-effect free but disjoint side-effects are OK The need for synchronization: • The return statement must be executed after both recursive calls have been completed because of data-dependence on a and b.
  • 33. int fib(int n){ if( n<2 ) return n; else { int a,b; #pragma omp task shared(a) a = fib(n-1); #pragma omp task shared(b) b = fib(n-2); #pragma omp taskwait return b+a; } } A task-parallel fib in OpenMP 3+ Starting code: ... #pragma omp parallel #pragma omp single fib(n); ...
  • 34. Work-stealing schedulers • Cores work on tasks in their own queue • Generated tasks are put in local queue • If empty, select a random queue to steal work from • Steal from: • Continuation, or • Generated tasks c c c c
  • 35. Task-centric parallel models Gaining momentum • C/C++: • OpenMP , Cilk Plus, TBB, GCD... • C#: • Microsoft TPL • Java: • fork/join • Erlang: • processes • X10: • activities • Etc…
  • 36. Task model benefits • Automatic load-balancing through work-stealing • Serial semantics => debug in serial mode • Composable parallelism • Parallel libraries can be called from parallel code • Can be mixed with data-parallelism • SIMD/Vector instructions • Data-parallel workers can also be tasks • Adapts naturally to • Different number of cores, even in run-time • Different speeds of cores (e.g. ARM big.LITTLE)
  • 37. Greatest thing since sliced bread? • Overheads are still too big to not care about when creating tasks • Tasks need to have high enough arithmetic intensity to amortize the cost of creation and scheduling • Different models do not use same run-time • You can’t have a task in TBB calling a function creating tasks written in OpenMP • Still no accepted way to target different ISA:s • Research is going on • The operating system does not know about tasks • Current OS:s only schedules threads
  • 38. Heteregeneous processing • Same ISA, performance heterogeneous processing is transparant • Real heterogeneity is challenging
  • 39. OpenMP 4.0: extending OpenMP with dependence annotations sort(A); sort(B); sort(C); sort(D); merge(A,B,E); merge(C,D,F); A B C D A, B C, D
  • 40. OpenMP 4.0: extending OpenMP with dependence annotations #pragma omp task sort(A); #pragma omp task sort(B); #pragma omp task sort(C); #pragma omp task sort(D); #pragma omp taskwait #pragma omp task merge(A,B,E); #pragma omp task merge(C,D,F); A B C D A, B C, D
  • 41. OpenMP 4.0: extending OpenMP with dependence annotations #pragma omp task depend(inout:A) sort(A); #pragma omp task depend(inout:B) sort(B); #pragma omp task depend(inout:C) sort(C); #pragma omp task depend(inout:D) sort(D); // taskwait not needed #pragma omp task depend(in:A,B, out:E) merge(A,B,E); #pragma omp task depend(in:C,D, out:F) merge(C,D,F); A B C D A, B C, D
  • 42. Benefits of tasks in OpenMP 4.0 • More parallelism can be exposed • Complex synchronization patterns can be avoided/automated • Knowing a task's memory usage/footprint, offloading to accelerators can now be made almost transparent to the user - In terms of memory handling and execution!
  • 43. Writing heterogeneous code GPU) #pragma omp task device(gpu) implements(inc_arr) void cuda_inc_arr(int *A, int *B) { cuda_inc_array_kernel <<<4,256>>>(A,B); } TilePRO64) #pragma omp task device(tilera) implements(inc_arr) void tilera_inc_arr(int *A,int *B) { #pragma omp parallel { int i = omp_get_thread_num(); B[i] += A[i]; } } Task Spawn) #pragma omp task depend(in:A, out:B) target(tilera,gpu,host) inc_arr(&A[0], &B[0]);
  • 44. Conclusions • Parallelism should be exploited at all levels • User higher abstraction models and compiler support • Vector instructions • STL Parallel Algorithms • TBB / OpenMP • Only then, use threads • Measure performance bottlenecks with profilers • Beware of • Granularity • False sharing