SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Options and trade-offs for
parallelism and concurrency in
Modern C++
Mats Brorsson
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• Conclusions
Parallelism everywhere
• Server level parallelism
• Distributed memory
• Multicore architectures
• Shared memory
• Instruction-level parallelism
• Vector parallelism
• Thread parallelism
• Hardware vs software threads
• Simultaneous multithreading
• ”Switch-on-event” multithreading
Vector vs Thread parallelism
• Vector parallelism maps naturally to
Regular Data Parallelism
• Inner loops can (sometimes) be
vectorized
• Ways to vectorization:
• Auto-vectorization
• Vector Intrinsics
• __mm_add_ps(__m128 x,y)
• Compiler hints
• Cilk Plus array notation
• #pragma omp simd
Images courtesy Intel and Rebel Science News
Key features for Performance
• Data locality
• Chunks that fit in cache
• Reuse data locally
• Avoid cache conflicts
• Use few virtual pages
• Avoid false sharing
• Parallel slack
• Specify potential parallelism much
higher than the actual parallelism
• Load balance
• All threads have the same amount of
work to do
Example: SAXPY, scaling of a vector
• SAXPY scales a vector, , by a factor, added by vector
• is used for both input and output
• Single-precision floating point (DAXPY: double precision)
• Low arithmetic intensity
• Little arithmetic work compared to the amount of data consumed and
produced
• 2 FLOPS, 8 bytes read and 4 bytes written
The Map Pattern
• Applies a function to every
element of a collection of data
items
• Elemental function
• No side effects
• Embarrassingly parallel
• Often combined with collective
patterns
Crash course C++11 thread programming
Compile: g++ -std=c++11 –O –lpthread –o hello-threads hello-threads.cc
#include <thread> at top of file
std::thread t; declare a thread; acts as thread handle
t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2)
with arguments a1 and a2
t.join() join with thread t. Wait for thread with handle tid to finish
std::mutex m; Declare mutual exclusion lock
m.lock() Enter critical section protected by m
m.unlock(); Leave the critical section
Explicit threading on
multicores (in C++11)
• Create one thread per core
• Divide the work manually
• Substantial amount of extra
code over the serial
• Inflexible scheduling of work
void saxpy_t(float a, const vector<float> &x,
vector<float> &y, int nthreads,
int thr_id) {
int n = x.size();
int start = thr_id*n/num_threads;
int end= min((thr_id+1)*n/num_threads, n);
for (int i = start; i < end; i++)
y[i] = a*x[i] + y[i];
}
void main(…)
…
vector<thread> tarr;
for (int i = 0; i < nthreads; i++){
tarr.push_back(thread(saxpy_t, a, ref(x),
ref(y), nthreads, i));
}
// Wait for threads to finish
for (auto & t : tarr){
t.join();
}
Load imbalance?
What happens when the iterations are
different?
void map_serial(
float a; // scale factor
const std::vector<float> &x; // input vec
std::vector<float> &y; // output and input vec )
{
for (int i = 0; i < n; i++)
y[i] = map(i);
}
Explicit threading with
load imbalance
• Atomic update of index variable
• Fine granularity of load balancing
• Overheads in multiple threads
wanting to update index
• Note that the declaration of
saxpy_index to be atomic
guarantees no data races
• saxpy_index.fetch_add(1) returns
the old value and atomically adds 1
to it.
std::atomic<int> saxpy_index {-1};
void saxpy_t(float a, vector<float> &x,
vector<float> &y) {
int i = std::saxpy_index.fetch_add(1);
while (i < x.size()) {
y[i] = map(i, a*x[i] + y[i]);
i = std::saxpy_index.fetch_add(1);
}
}
void main(…)
…
vector<thread> tarr;
index = 0;
for (int i = 0; i < num_threads; i++){
tarr.push_back(thread(saxpy_t, a,
ref(x), ref(y));
}
// Wait
// join
for (auto & t : tarr){
t.join();
}
Load imbalance with chunks
• The index variable might be a
bottleneck
• Use CHUNKS so that each
threads work on a range of
indeces
• Note that the declaration of
index to be atomic guarantees
no data races
std::atomic<int> saxpy_index {-CHUNK};
void saxpy_t(float a, std::vector<float> x,
std::vector<float> y) {
int c = std::saxpy_index.fetch_add(CHUNK);
int n = x.size();
while (c < n) {
for (int i = c ; c < min(n, c+CHUNK); i++)
y[i] = map(i, a*x[i] + y[i]);
c = std::saxpy_index.fetch_add(CHUNK);
}
}
Sequence of maps vs Map of Sequence
• Also called: Code fusion
• Do this whenever possible!
• Increases the Arithmetic
intensity
• Less data to load and store
• Explicit changes needed
• Make sure consecutive elemental
functions do not refer to memory
either through compiler
optimizations or by design
Cache fusion optimization
• Almost as important as code
fusion
• Break down maps to sequences
of smaller maps, executed by
each thread
• Keep aggregate data small
enough to fit in cache
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
Higher abstraction models
• OpenMP
void saxpy_par_openmp(float a,
const vector<float> & x,
vector<float> & y) {
auto n = x.size();
#pragma omp parallel for
for (auto i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
• TBB
auto n = x.size();
tbb::parallel_for(size_t(0), n,
[&]( size_t i ) {
y[i] = a * x[i] + y[i];
});
• Parallel STL (C++17)
std::transform(std::par, x.begin(),
x.end(),
y.begin(),
y.begin(),
[=](float x, float y){
return a*x + y;
});
• Threads are assigned
independent set of
iterations
• Work-sharing
construct
24
OpenMP support for map: for loops
parallel
Work sharing
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
i=12
i=13
i=14
i=15
Implicit barrier
#pragma omp parallel
#pragma omp for
for (i=0; i < 16; i++)
c[i] = b[i] + a[i];
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
The Reduce Pattern
• Combining the elements of a
collection of data to a single
value
• A combiner function is used to
combine elementd pairwise
• The combiner function must be
associative for parallelism
• 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
Serial reduction
Example: Dot-product
float sdot(vector x, vector y){
float sum = 0.0;
for (int i; i < x.size(); i++)
sum += x[i]*y[i];
return sum;
}
Note that this is a fusion of a map (vector element product) and the reduce (sum).
Implementation of parallel reduction
• Simple approach:
• Let each thread make the reduce on its part of the data
• Let one (master) thread combine the results to a scalar value
Each thread performs local reduction in parallel
Master thread reduces to scalar value
Thread 0 1 2 3
Awkward way of returning results from a
thread: dot-product example
Plain C/C++:
void sprod(const vector<float> &a,
const vector<float> &b,
int start,
int end,
double &sum) {
float lsum = 0.0;
for (int i=start; i < end; i++)
sum += a[i] * b[i];
}
#include <thread>
using namespace std;
…
vector<float> sum_array(nthr, 0.0);
vector<thread> t_arr;
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
t_arr.push_pack(thread(
sprod, ref(a), ref(b),
start, end,
ref(sum_array[i])));
}
for (i = 0; i < nthr; i++){
t_arr[i]->join();
sum += sum_array[i];
}
The Async function using futures
Plain C/C++:
float sprod(vector<float> &a,
vector<float> &b,
int start,
int end) {
float lsum = 0.0;
for (int i=start; i < end; i++)
lsum += a[i] * b[i];
return lsum;
}
#include <thread>
#include <future>
using namespace std;
…
future<float> f_arr[i];
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
f_arr[i] = async(launch::async,
sprod, a, b,
start, end);
}
for (i = 0; i < nthr; i++){
sum += f_arr[i].get();
}
Definition of async
• The template function async runs the function f asynchronously
(potentially in a separate thread) and returns a std::future that will
eventually hold the result of that function call.
• The launch::async argument makes the function run on a separate
thread (which could be held in a thread pool, or created for this call)
32
Example:
Numerical Integration

4.0
(1+x2) dx = 
0
1
 F(xi)x  
i = 0
N
Mathematically, we know that:
We can approximate the
integral as a sum of
rectangles:
Where each rectangle has
width x and height F(xi) at
the middle of interval i.
4.0
2.0
1.0
X
0.0
33
Serial PI Program
The Map
The Reduction
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
Map reduce in OpenMP
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for reduction(+:sum) private(x)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
Main challenges in writing parallel software
• Difficult to write composable parallel software
• The parallel models of different languages do not work well together
• Poor resource management
• Difficult to write portable parallel software
Make Tasks a First Class Citizen
• Separation of concerns
• Concentrate on exposing parallelism
• Not how it is mapped onto hardware
Run-time
scheduler HW
• The (naïve) sequential Fibonacci calculation
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
a = fib(n-1);
b = fib(n-2);
return b+a;
}
}
An example of task-parallelism
Parallelism in fib:
• The two calls to fib are independent and can
be computed in any order and in parallel
• It helps that fib is side-effect free but
disjoint side-effects are OK
The need for synchronization:
• The return statement must be executed
after both recursive calls have been
completed because of data-dependence on
a and b.
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
return b+a;
}
}
A task-parallel fib in OpenMP 3+
Starting code:
...
#pragma omp parallel
#pragma omp single
fib(n);
...
Work-stealing schedulers
• Cores work on tasks in their
own queue
• Generated tasks are put in local
queue
• If empty, select a random
queue to steal work from
• Steal from:
• Continuation, or
• Generated tasks
c c c c
Task-centric parallel models
Gaining momentum
• C/C++:
• OpenMP , Cilk Plus, TBB, GCD...
• C#:
• Microsoft TPL
• Java:
• fork/join
• Erlang:
• processes
• X10:
• activities
• Etc…
Task model benefits
• Automatic load-balancing through work-stealing
• Serial semantics => debug in serial mode
• Composable parallelism
• Parallel libraries can be called from parallel code
• Can be mixed with data-parallelism
• SIMD/Vector instructions
• Data-parallel workers can also be tasks
• Adapts naturally to
• Different number of cores, even in run-time
• Different speeds of cores (e.g. ARM big.LITTLE)
Greatest thing since sliced bread?
• Overheads are still too big to not care about when creating tasks
• Tasks need to have high enough arithmetic intensity to amortize the cost of
creation and scheduling
• Different models do not use same run-time
• You can’t have a task in TBB calling a function creating tasks written in
OpenMP
• Still no accepted way to target different ISA:s
• Research is going on
• The operating system does not know about tasks
• Current OS:s only schedules threads
Heteregeneous processing
• Same ISA, performance heterogeneous processing is transparant
• Real heterogeneity is challenging
OpenMP 4.0: extending OpenMP with
dependence annotations
sort(A);
sort(B);
sort(C);
sort(D);
merge(A,B,E);
merge(C,D,F);
A B C D
A,
B
C,
D
OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task
sort(A);
#pragma omp task
sort(B);
#pragma omp task
sort(C);
#pragma omp task
sort(D);
#pragma omp taskwait
#pragma omp task
merge(A,B,E);
#pragma omp task
merge(C,D,F);
A B C D
A,
B
C,
D
OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task depend(inout:A)
sort(A);
#pragma omp task depend(inout:B)
sort(B);
#pragma omp task depend(inout:C)
sort(C);
#pragma omp task depend(inout:D)
sort(D);
// taskwait not needed
#pragma omp task depend(in:A,B, out:E)
merge(A,B,E);
#pragma omp task depend(in:C,D, out:F)
merge(C,D,F);
A B C D
A,
B
C,
D
Benefits of tasks in OpenMP 4.0
• More parallelism can be exposed
• Complex synchronization patterns can be avoided/automated
• Knowing a task's memory usage/footprint, offloading to accelerators
can now be made almost transparent to the user
- In terms of memory handling and execution!
Writing heterogeneous code
GPU)
#pragma omp task device(gpu)
implements(inc_arr)
void cuda_inc_arr(int *A, int *B) {
cuda_inc_array_kernel <<<4,256>>>(A,B);
}
TilePRO64)
#pragma omp task device(tilera) implements(inc_arr)
void tilera_inc_arr(int *A,int *B) {
#pragma omp parallel
{
int i = omp_get_thread_num();
B[i] += A[i];
}
} Task Spawn)
#pragma omp task depend(in:A, out:B) target(tilera,gpu,host)
inc_arr(&A[0], &B[0]);
Conclusions
• Parallelism should be exploited
at all levels
• User higher abstraction models
and compiler support
• Vector instructions
• STL Parallel Algorithms
• TBB / OpenMP
• Only then, use threads
• Measure performance
bottlenecks with profilers
• Beware of
• Granularity
• False sharing

Contenu connexe

Tendances

"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...Edge AI and Vision Alliance
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowOswald Campesato
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and SparkOswald Campesato
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Pythonrik0
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Introduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at BerkeleyIntroduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at BerkeleyTed Xiao
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowOswald Campesato
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Namgee Lee
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexityashishtinku
 
Tensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi chaTensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi chaDonghwi Cha
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to TensorflowTzar Umang
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowAndrew Ferlitsch
 
Vector class in C++
Vector class in C++Vector class in C++
Vector class in C++Jawad Khan
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Alessio Tonioni
 

Tendances (20)

"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Introduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at BerkeleyIntroduction to TensorFlow, by Machine Learning at Berkeley
Introduction to TensorFlow, by Machine Learning at Berkeley
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
 
Tensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi chaTensorflow in practice by Engineer - donghwi cha
Tensorflow in practice by Engineer - donghwi cha
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Vector class in C++
Vector class in C++Vector class in C++
Vector class in C++
 
Lec4
Lec4Lec4
Lec4
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
 

Similaire à Options and trade offs for parallelism and concurrency in Modern C++

C++11: Feel the New Language
C++11: Feel the New LanguageC++11: Feel the New Language
C++11: Feel the New Languagemspline
 
Programming in python
Programming in pythonProgramming in python
Programming in pythonIvan Rojas
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialJin-Hwa Kim
 
SimpleArray between Python and C++
SimpleArray between Python and C++SimpleArray between Python and C++
SimpleArray between Python and C++Yung-Yu Chen
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Siouxnikomatsakis
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35Bilal Ahmed
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithmK Hari Shankar
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for SpeedYung-Yu Chen
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelizationAlbert DeFusco
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)PROIDEA
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Platonov Sergey
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Chris Fregly
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net PerformanceCUSTIS
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinaloscon2007
 

Similaire à Options and trade offs for parallelism and concurrency in Modern C++ (20)

C++11: Feel the New Language
C++11: Feel the New LanguageC++11: Feel the New Language
C++11: Feel the New Language
 
Programming in python
Programming in pythonProgramming in python
Programming in python
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
 
SimpleArray between Python and C++
SimpleArray between Python and C++SimpleArray between Python and C++
SimpleArray between Python and C++
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35
 
C++ Language
C++ LanguageC++ Language
C++ Language
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 

Dernier

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 

Dernier (20)

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 

Options and trade offs for parallelism and concurrency in Modern C++

  • 1. Options and trade-offs for parallelism and concurrency in Modern C++ Mats Brorsson
  • 2. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • Conclusions
  • 3. Parallelism everywhere • Server level parallelism • Distributed memory • Multicore architectures • Shared memory • Instruction-level parallelism • Vector parallelism • Thread parallelism • Hardware vs software threads • Simultaneous multithreading • ”Switch-on-event” multithreading
  • 4. Vector vs Thread parallelism • Vector parallelism maps naturally to Regular Data Parallelism • Inner loops can (sometimes) be vectorized • Ways to vectorization: • Auto-vectorization • Vector Intrinsics • __mm_add_ps(__m128 x,y) • Compiler hints • Cilk Plus array notation • #pragma omp simd Images courtesy Intel and Rebel Science News
  • 5. Key features for Performance • Data locality • Chunks that fit in cache • Reuse data locally • Avoid cache conflicts • Use few virtual pages • Avoid false sharing • Parallel slack • Specify potential parallelism much higher than the actual parallelism • Load balance • All threads have the same amount of work to do
  • 6. Example: SAXPY, scaling of a vector • SAXPY scales a vector, , by a factor, added by vector • is used for both input and output • Single-precision floating point (DAXPY: double precision) • Low arithmetic intensity • Little arithmetic work compared to the amount of data consumed and produced • 2 FLOPS, 8 bytes read and 4 bytes written
  • 7. The Map Pattern • Applies a function to every element of a collection of data items • Elemental function • No side effects • Embarrassingly parallel • Often combined with collective patterns
  • 8. Crash course C++11 thread programming Compile: g++ -std=c++11 –O –lpthread –o hello-threads hello-threads.cc #include <thread> at top of file std::thread t; declare a thread; acts as thread handle t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2) with arguments a1 and a2 t.join() join with thread t. Wait for thread with handle tid to finish std::mutex m; Declare mutual exclusion lock m.lock() Enter critical section protected by m m.unlock(); Leave the critical section
  • 9. Explicit threading on multicores (in C++11) • Create one thread per core • Divide the work manually • Substantial amount of extra code over the serial • Inflexible scheduling of work void saxpy_t(float a, const vector<float> &x, vector<float> &y, int nthreads, int thr_id) { int n = x.size(); int start = thr_id*n/num_threads; int end= min((thr_id+1)*n/num_threads, n); for (int i = start; i < end; i++) y[i] = a*x[i] + y[i]; } void main(…) … vector<thread> tarr; for (int i = 0; i < nthreads; i++){ tarr.push_back(thread(saxpy_t, a, ref(x), ref(y), nthreads, i)); } // Wait for threads to finish for (auto & t : tarr){ t.join(); }
  • 11. What happens when the iterations are different? void map_serial( float a; // scale factor const std::vector<float> &x; // input vec std::vector<float> &y; // output and input vec ) { for (int i = 0; i < n; i++) y[i] = map(i); }
  • 12. Explicit threading with load imbalance • Atomic update of index variable • Fine granularity of load balancing • Overheads in multiple threads wanting to update index • Note that the declaration of saxpy_index to be atomic guarantees no data races • saxpy_index.fetch_add(1) returns the old value and atomically adds 1 to it. std::atomic<int> saxpy_index {-1}; void saxpy_t(float a, vector<float> &x, vector<float> &y) { int i = std::saxpy_index.fetch_add(1); while (i < x.size()) { y[i] = map(i, a*x[i] + y[i]); i = std::saxpy_index.fetch_add(1); } } void main(…) … vector<thread> tarr; index = 0; for (int i = 0; i < num_threads; i++){ tarr.push_back(thread(saxpy_t, a, ref(x), ref(y)); } // Wait // join for (auto & t : tarr){ t.join(); }
  • 13. Load imbalance with chunks • The index variable might be a bottleneck • Use CHUNKS so that each threads work on a range of indeces • Note that the declaration of index to be atomic guarantees no data races std::atomic<int> saxpy_index {-CHUNK}; void saxpy_t(float a, std::vector<float> x, std::vector<float> y) { int c = std::saxpy_index.fetch_add(CHUNK); int n = x.size(); while (c < n) { for (int i = c ; c < min(n, c+CHUNK); i++) y[i] = map(i, a*x[i] + y[i]); c = std::saxpy_index.fetch_add(CHUNK); } }
  • 14. Sequence of maps vs Map of Sequence • Also called: Code fusion • Do this whenever possible! • Increases the Arithmetic intensity • Less data to load and store • Explicit changes needed • Make sure consecutive elemental functions do not refer to memory either through compiler optimizations or by design
  • 15. Cache fusion optimization • Almost as important as code fusion • Break down maps to sequences of smaller maps, executed by each thread • Keep aggregate data small enough to fit in cache
  • 16. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 17. Higher abstraction models • OpenMP void saxpy_par_openmp(float a, const vector<float> & x, vector<float> & y) { auto n = x.size(); #pragma omp parallel for for (auto i = 0; i < n; i++) { y[i] = a * x[i] + y[i]; } } • TBB auto n = x.size(); tbb::parallel_for(size_t(0), n, [&]( size_t i ) { y[i] = a * x[i] + y[i]; }); • Parallel STL (C++17) std::transform(std::par, x.begin(), x.end(), y.begin(), y.begin(), [=](float x, float y){ return a*x + y; });
  • 18. • Threads are assigned independent set of iterations • Work-sharing construct 24 OpenMP support for map: for loops parallel Work sharing i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 i=13 i=14 i=15 Implicit barrier #pragma omp parallel #pragma omp for for (i=0; i < 16; i++) c[i] = b[i] + a[i];
  • 19. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 20. The Reduce Pattern • Combining the elements of a collection of data to a single value • A combiner function is used to combine elementd pairwise • The combiner function must be associative for parallelism • 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
  • 21. Serial reduction Example: Dot-product float sdot(vector x, vector y){ float sum = 0.0; for (int i; i < x.size(); i++) sum += x[i]*y[i]; return sum; } Note that this is a fusion of a map (vector element product) and the reduce (sum).
  • 22. Implementation of parallel reduction • Simple approach: • Let each thread make the reduce on its part of the data • Let one (master) thread combine the results to a scalar value Each thread performs local reduction in parallel Master thread reduces to scalar value Thread 0 1 2 3
  • 23. Awkward way of returning results from a thread: dot-product example Plain C/C++: void sprod(const vector<float> &a, const vector<float> &b, int start, int end, double &sum) { float lsum = 0.0; for (int i=start; i < end; i++) sum += a[i] * b[i]; } #include <thread> using namespace std; … vector<float> sum_array(nthr, 0.0); vector<thread> t_arr; for (i = 0; i < nthr; i++) { int start=i*size/nthr; int end = (i+1)*size/nthr; if (i==nthr-1) end = size; t_arr.push_pack(thread( sprod, ref(a), ref(b), start, end, ref(sum_array[i]))); } for (i = 0; i < nthr; i++){ t_arr[i]->join(); sum += sum_array[i]; }
  • 24. The Async function using futures Plain C/C++: float sprod(vector<float> &a, vector<float> &b, int start, int end) { float lsum = 0.0; for (int i=start; i < end; i++) lsum += a[i] * b[i]; return lsum; } #include <thread> #include <future> using namespace std; … future<float> f_arr[i]; for (i = 0; i < nthr; i++) { int start=i*size/nthr; int end = (i+1)*size/nthr; if (i==nthr-1) end = size; f_arr[i] = async(launch::async, sprod, a, b, start, end); } for (i = 0; i < nthr; i++){ sum += f_arr[i].get(); }
  • 25. Definition of async • The template function async runs the function f asynchronously (potentially in a separate thread) and returns a std::future that will eventually hold the result of that function call. • The launch::async argument makes the function run on a separate thread (which could be held in a thread pool, or created for this call)
  • 26. 32 Example: Numerical Integration  4.0 (1+x2) dx =  0 1  F(xi)x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width x and height F(xi) at the middle of interval i. 4.0 2.0 1.0 X 0.0
  • 27. 33 Serial PI Program The Map The Reduction static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 28. Map reduce in OpenMP static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) private(x) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }
  • 29. Outline • Parallelism and Concurrency in C++ • Higher abstraction models: • OpenMP • TBB • Parallel STL • Parallel map-reduce with threads and OpenMP • Task-parallel models • Threads vs tasks • OpenMP tasks • TBB • Performance implications • Conclusions
  • 30. Main challenges in writing parallel software • Difficult to write composable parallel software • The parallel models of different languages do not work well together • Poor resource management • Difficult to write portable parallel software
  • 31. Make Tasks a First Class Citizen • Separation of concerns • Concentrate on exposing parallelism • Not how it is mapped onto hardware Run-time scheduler HW
  • 32. • The (naïve) sequential Fibonacci calculation int fib(int n){ if( n<2 ) return n; else { int a,b; a = fib(n-1); b = fib(n-2); return b+a; } } An example of task-parallelism Parallelism in fib: • The two calls to fib are independent and can be computed in any order and in parallel • It helps that fib is side-effect free but disjoint side-effects are OK The need for synchronization: • The return statement must be executed after both recursive calls have been completed because of data-dependence on a and b.
  • 33. int fib(int n){ if( n<2 ) return n; else { int a,b; #pragma omp task shared(a) a = fib(n-1); #pragma omp task shared(b) b = fib(n-2); #pragma omp taskwait return b+a; } } A task-parallel fib in OpenMP 3+ Starting code: ... #pragma omp parallel #pragma omp single fib(n); ...
  • 34. Work-stealing schedulers • Cores work on tasks in their own queue • Generated tasks are put in local queue • If empty, select a random queue to steal work from • Steal from: • Continuation, or • Generated tasks c c c c
  • 35. Task-centric parallel models Gaining momentum • C/C++: • OpenMP , Cilk Plus, TBB, GCD... • C#: • Microsoft TPL • Java: • fork/join • Erlang: • processes • X10: • activities • Etc…
  • 36. Task model benefits • Automatic load-balancing through work-stealing • Serial semantics => debug in serial mode • Composable parallelism • Parallel libraries can be called from parallel code • Can be mixed with data-parallelism • SIMD/Vector instructions • Data-parallel workers can also be tasks • Adapts naturally to • Different number of cores, even in run-time • Different speeds of cores (e.g. ARM big.LITTLE)
  • 37. Greatest thing since sliced bread? • Overheads are still too big to not care about when creating tasks • Tasks need to have high enough arithmetic intensity to amortize the cost of creation and scheduling • Different models do not use same run-time • You can’t have a task in TBB calling a function creating tasks written in OpenMP • Still no accepted way to target different ISA:s • Research is going on • The operating system does not know about tasks • Current OS:s only schedules threads
  • 38. Heteregeneous processing • Same ISA, performance heterogeneous processing is transparant • Real heterogeneity is challenging
  • 39. OpenMP 4.0: extending OpenMP with dependence annotations sort(A); sort(B); sort(C); sort(D); merge(A,B,E); merge(C,D,F); A B C D A, B C, D
  • 40. OpenMP 4.0: extending OpenMP with dependence annotations #pragma omp task sort(A); #pragma omp task sort(B); #pragma omp task sort(C); #pragma omp task sort(D); #pragma omp taskwait #pragma omp task merge(A,B,E); #pragma omp task merge(C,D,F); A B C D A, B C, D
  • 41. OpenMP 4.0: extending OpenMP with dependence annotations #pragma omp task depend(inout:A) sort(A); #pragma omp task depend(inout:B) sort(B); #pragma omp task depend(inout:C) sort(C); #pragma omp task depend(inout:D) sort(D); // taskwait not needed #pragma omp task depend(in:A,B, out:E) merge(A,B,E); #pragma omp task depend(in:C,D, out:F) merge(C,D,F); A B C D A, B C, D
  • 42. Benefits of tasks in OpenMP 4.0 • More parallelism can be exposed • Complex synchronization patterns can be avoided/automated • Knowing a task's memory usage/footprint, offloading to accelerators can now be made almost transparent to the user - In terms of memory handling and execution!
  • 43. Writing heterogeneous code GPU) #pragma omp task device(gpu) implements(inc_arr) void cuda_inc_arr(int *A, int *B) { cuda_inc_array_kernel <<<4,256>>>(A,B); } TilePRO64) #pragma omp task device(tilera) implements(inc_arr) void tilera_inc_arr(int *A,int *B) { #pragma omp parallel { int i = omp_get_thread_num(); B[i] += A[i]; } } Task Spawn) #pragma omp task depend(in:A, out:B) target(tilera,gpu,host) inc_arr(&A[0], &B[0]);
  • 44. Conclusions • Parallelism should be exploited at all levels • User higher abstraction models and compiler support • Vector instructions • STL Parallel Algorithms • TBB / OpenMP • Only then, use threads • Measure performance bottlenecks with profilers • Beware of • Granularity • False sharing