SlideShare une entreprise Scribd logo
1  sur  19
Work Replication with Parallel Region
#pragma omp parallel
{
for ( j=0; j<10; j++)
printf(“Hellon”);
}
On 5 threads we get
50 print out of hello
since each thread
executes 10 iterations
concurrently with other
10 threads
#pragma omp parallel for
{
for ( j=0; j<10; j++)
printf(“Hellon”);
}
Regardless of # of threads we
get 10 print out of hello
since do loop iterations
are executed in parallel by
team of threads
NOWAIT clause : C
#pragma omp parallel
{
#pragma omp for nowait
for ( j=1; j<n; j++)
b[j]= (a[j]+a[j-1]) /2.0;
#pragma omp for
for ( j=1; j<n; j++)
c[j] = d[j]/e[j] ;
}
Parallel Sections
• So far we have divided the work of one task among
threads
• Parallel sections allow us to assign different tasks to
different threads
– Need to make sure that none of the later tasks depends
on the results of the earlier ones
• This is helpful where it is difficult or impossible to
speedup individual tasks by executing them in parallel
• The code for the entire sequence of tasks or sections
begins with a sections directive and ends with an end
sections directive
• The beginning of each section is marked by a section
directive which is optional for the very first section
Fortran section clause
!$omp parallel sections [clause..]
[!$omp section]
code for 1st
section
!$omp section
code for 2nd
section
!$omp section
code for 3rd
section
.
.
!$omp end parallel sections
C/C++ section clause
#pragma omp parallel sections [clause…]
{
[#pragma omp section]
code for 1st
section
#pragma omp section
code for 2nd
section
#pragma omp section
code for 3rd
section
.
.
}
• clause can be private, firstprivate, lastprivate, reduction
• In Fortran the NOWAIT clause goes at the end:
!$omp end sections [nowait]
• In C/C++ NOWAIT is provided with the omp sections
pragma: #pragma omp sections nowait
• Each section is executed once and each thread executes
zero or more sections
• A thread may execute more than one section if there are
more sections than threads
• It is not possible to determine if one section will be
executed before another or if two sections will be
executed by the same thread
Assigning work to single thread
• Within a parallel region a block of code may be executed
just once by any one of the threads in the team
– There is implicit barrier at the end of single (unless
nowait clause supplied)
– Clause can be private or firstprivate
Fortran :
!$omp single [clause…]
block of code to be executed by just one thread
!$omp end single [nowait]
C/C++ :
#pragma omp single [clause,….. nowait]
block of code to be executed by just one thread
single for I/O
• Common use of single is for reading in shared input
variables or writing output within a parallel region
• I/O may not be easy to parallelize
omp_get_thread_num, omp_get_num_threads
• Remember OpenMP uses fork/join model of
parallelization
• Thread teams are only created within a parallel construct
(parallel do/for, parallel)
• omp_get_thread_num and omp_get_num_threads are only
valid within a parallel construct where you have forked
threads
Synchronization
• Critical - for any block of code
• Barrier – where all threads join
• Other synchronization directives :
– master
– ordered
Synchronization : master clause
• The master directive identifies a structured block of code
that is executed by the master thread of the team
• No implicit barrier at the end of master directive
• Fortran :
!$omp master
code block
!$omp end master
• C/C++ :
#pragma omp master
code block
master example
!$ (or #pragma) parallel
!$ (or #pragma) omp do (or for)
loop I = 1: n
calculation
end loop
!$ (or #pragma) omp master
print result (reduction) from above loop
!$omp end master
more computation
end parallel loop
Synchronization : ordered clause
• The structured block following an ordered directive is
executed in the order in which iterations would be
executed in a sequential loop
• Fortran :
!$omp ordered
code block
!$omp end ordered
• C/C++:
#pragma omp ordered
code block
ordered example
parallel loop (with parallel do/for) ordered
loop I=1 : n
a[I] = ..calculation…
!$ [OR #pragma] omp ordered
print a[I]
!$omp end ordered
end parallel loop
OpenMP Performance
• Each processor has its own cache in shared memory
machine
• Data locality in caches and loop scheduling
• False sharing
Data locality in caches and loop scheduling
• loop j = 0 : n
loop k = 0 : n
a[j][k] = k +1 + a[j][k]
• loop j = 0 : n
loop k = 0 : n
a[j][k] = 1./a[j][k]
• Assume each processor’s cache can hold local matrix
• After first loop each processor’s cache will have some
data (cache line dependent). For next iteration it may or
may not get to operate on those data depending on
scheduling
• Static scheduling may provide better cache performance
than dynamic scheduling
False sharing
• If different processors update stride one elements of an
array – this can cause poor cache performance
• Cache line has to be invalidated all the time among all the
processors
• Parallel loop with schedule (static,1)
loop j = 1 : n
a[j] = a[j] + j
• Proc1 updates a[1], proc2 updates a[2]… etc.
• Cache line needs to be invalidated for each processor –
this leads to bad performance
Look up from OpenMP standard
• Threadprivate
!$omp threadprivate (/cb1/, /cb2/)
#pragma omp threadprivate(list)
• cb1, cb2 are common blocks in fortran, list is a list of
named file scope or namespace scope variables in C
• Threadprivate makes named common blocks private to a
thread but global within the thread
• Threadprivate makes the named file scope or namespace
scope variables (list) private to a thread but file scope
visible within the thread
Look up from OpenMP standard
• Atomic directive ensures that specific memory location is
updated atomically – provides better optimization than
critical due to hardware instructions
• C:
#pragma omp parallel for
for (I =1; I< n; I ++) {
#pragma omp atomic
a[index[I]] = a[index[I]] + 1 }
• Fortan:
!$omp parallel do
do I = 1, n
$omp atomic
y(index(j)) = y(index(j)) + c

Contenu connexe

Tendances

Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 
Erlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent WorldErlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent WorldZvi Avraham
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 
Improving Robustness In Distributed Systems
Improving Robustness In Distributed SystemsImproving Robustness In Distributed Systems
Improving Robustness In Distributed Systemsl xf
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variablesSuveeksha
 
Erlang Message Passing Concurrency, For The Win
Erlang  Message  Passing  Concurrency,  For  The  WinErlang  Message  Passing  Concurrency,  For  The  Win
Erlang Message Passing Concurrency, For The Winl xf
 
Automatic Reference Counting @ Pragma Night
Automatic Reference Counting @ Pragma NightAutomatic Reference Counting @ Pragma Night
Automatic Reference Counting @ Pragma NightGiuseppe Arici
 
Fork and join framework
Fork and join frameworkFork and join framework
Fork and join frameworkMinh Tran
 
Erlang
ErlangErlang
ErlangESUG
 

Tendances (20)

Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
openmp
openmpopenmp
openmp
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Open mp
Open mpOpen mp
Open mp
 
Erlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent WorldErlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent World
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Improving Robustness In Distributed Systems
Improving Robustness In Distributed SystemsImproving Robustness In Distributed Systems
Improving Robustness In Distributed Systems
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variables
 
OpenMp
OpenMpOpenMp
OpenMp
 
Erlang Message Passing Concurrency, For The Win
Erlang  Message  Passing  Concurrency,  For  The  WinErlang  Message  Passing  Concurrency,  For  The  Win
Erlang Message Passing Concurrency, For The Win
 
Zero mq logs
Zero mq logsZero mq logs
Zero mq logs
 
Parllelizaion
ParllelizaionParllelizaion
Parllelizaion
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
Automatic Reference Counting @ Pragma Night
Automatic Reference Counting @ Pragma NightAutomatic Reference Counting @ Pragma Night
Automatic Reference Counting @ Pragma Night
 
Fork and join framework
Fork and join frameworkFork and join framework
Fork and join framework
 
Erlang
ErlangErlang
Erlang
 

Similaire à Lecture8

2008 07-24 kwpm-threads_and_synchronization
2008 07-24 kwpm-threads_and_synchronization2008 07-24 kwpm-threads_and_synchronization
2008 07-24 kwpm-threads_and_synchronizationfangjiafu
 
Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5AbdullahMunir32
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
openmp.ppt
openmp.pptopenmp.ppt
openmp.pptFAfazi1
 
introduction to server-side scripting
introduction to server-side scriptingintroduction to server-side scripting
introduction to server-side scriptingAmirul Shafeeq
 
127 Ch 2: Stack overflows on Linux
127 Ch 2: Stack overflows on Linux127 Ch 2: Stack overflows on Linux
127 Ch 2: Stack overflows on LinuxSam Bowne
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrencyMosky Liu
 
CNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on LinuxCNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on LinuxSam Bowne
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsDaniel Blezek
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick reviewCe.Se.N.A. Security
 

Similaire à Lecture8 (20)

Open MP cheet sheet
Open MP cheet sheetOpen MP cheet sheet
Open MP cheet sheet
 
2008 07-24 kwpm-threads_and_synchronization
2008 07-24 kwpm-threads_and_synchronization2008 07-24 kwpm-threads_and_synchronization
2008 07-24 kwpm-threads_and_synchronization
 
Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
Lecture6
Lecture6Lecture6
Lecture6
 
OpenMP.pptx
OpenMP.pptxOpenMP.pptx
OpenMP.pptx
 
openmpfinal.pdf
openmpfinal.pdfopenmpfinal.pdf
openmpfinal.pdf
 
openmp.ppt
openmp.pptopenmp.ppt
openmp.ppt
 
openmp.ppt
openmp.pptopenmp.ppt
openmp.ppt
 
introduction to server-side scripting
introduction to server-side scriptingintroduction to server-side scripting
introduction to server-side scripting
 
127 Ch 2: Stack overflows on Linux
127 Ch 2: Stack overflows on Linux127 Ch 2: Stack overflows on Linux
127 Ch 2: Stack overflows on Linux
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
test
testtest
test
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrency
 
CNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on LinuxCNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on Linux
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick review
 
Nbvtalkataitamimageprocessingconf
NbvtalkataitamimageprocessingconfNbvtalkataitamimageprocessingconf
Nbvtalkataitamimageprocessingconf
 

Plus de tt_aljobory (20)

Homework 2 sol
Homework 2 solHomework 2 sol
Homework 2 sol
 
Lecture12
Lecture12Lecture12
Lecture12
 
Lecture11
Lecture11Lecture11
Lecture11
 
Lecture10
Lecture10Lecture10
Lecture10
 
Lecture9
Lecture9Lecture9
Lecture9
 
Lecture5
Lecture5Lecture5
Lecture5
 
Lecture4
Lecture4Lecture4
Lecture4
 
Lecture3
Lecture3Lecture3
Lecture3
 
Lecture2
Lecture2Lecture2
Lecture2
 
Lecture1
Lecture1Lecture1
Lecture1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Good example on ga
Good example on gaGood example on ga
Good example on ga
 
Lect3
Lect3Lect3
Lect3
 
Lect4
Lect4Lect4
Lect4
 
Lect4
Lect4Lect4
Lect4
 
Above theclouds
Above thecloudsAbove theclouds
Above theclouds
 
Inet prog
Inet progInet prog
Inet prog
 
Form
FormForm
Form
 
8051 experiments1
8051 experiments18051 experiments1
8051 experiments1
 
63071507 interrupts-up
63071507 interrupts-up63071507 interrupts-up
63071507 interrupts-up
 

Lecture8

  • 1. Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j<10; j++) printf(“Hellon”); } On 5 threads we get 50 print out of hello since each thread executes 10 iterations concurrently with other 10 threads #pragma omp parallel for { for ( j=0; j<10; j++) printf(“Hellon”); } Regardless of # of threads we get 10 print out of hello since do loop iterations are executed in parallel by team of threads
  • 2. NOWAIT clause : C #pragma omp parallel { #pragma omp for nowait for ( j=1; j<n; j++) b[j]= (a[j]+a[j-1]) /2.0; #pragma omp for for ( j=1; j<n; j++) c[j] = d[j]/e[j] ; }
  • 3. Parallel Sections • So far we have divided the work of one task among threads • Parallel sections allow us to assign different tasks to different threads – Need to make sure that none of the later tasks depends on the results of the earlier ones • This is helpful where it is difficult or impossible to speedup individual tasks by executing them in parallel • The code for the entire sequence of tasks or sections begins with a sections directive and ends with an end sections directive • The beginning of each section is marked by a section directive which is optional for the very first section
  • 4. Fortran section clause !$omp parallel sections [clause..] [!$omp section] code for 1st section !$omp section code for 2nd section !$omp section code for 3rd section . . !$omp end parallel sections
  • 5. C/C++ section clause #pragma omp parallel sections [clause…] { [#pragma omp section] code for 1st section #pragma omp section code for 2nd section #pragma omp section code for 3rd section . . }
  • 6. • clause can be private, firstprivate, lastprivate, reduction • In Fortran the NOWAIT clause goes at the end: !$omp end sections [nowait] • In C/C++ NOWAIT is provided with the omp sections pragma: #pragma omp sections nowait • Each section is executed once and each thread executes zero or more sections • A thread may execute more than one section if there are more sections than threads • It is not possible to determine if one section will be executed before another or if two sections will be executed by the same thread
  • 7. Assigning work to single thread • Within a parallel region a block of code may be executed just once by any one of the threads in the team – There is implicit barrier at the end of single (unless nowait clause supplied) – Clause can be private or firstprivate Fortran : !$omp single [clause…] block of code to be executed by just one thread !$omp end single [nowait] C/C++ : #pragma omp single [clause,….. nowait] block of code to be executed by just one thread
  • 8. single for I/O • Common use of single is for reading in shared input variables or writing output within a parallel region • I/O may not be easy to parallelize
  • 9. omp_get_thread_num, omp_get_num_threads • Remember OpenMP uses fork/join model of parallelization • Thread teams are only created within a parallel construct (parallel do/for, parallel) • omp_get_thread_num and omp_get_num_threads are only valid within a parallel construct where you have forked threads
  • 10. Synchronization • Critical - for any block of code • Barrier – where all threads join • Other synchronization directives : – master – ordered
  • 11. Synchronization : master clause • The master directive identifies a structured block of code that is executed by the master thread of the team • No implicit barrier at the end of master directive • Fortran : !$omp master code block !$omp end master • C/C++ : #pragma omp master code block
  • 12. master example !$ (or #pragma) parallel !$ (or #pragma) omp do (or for) loop I = 1: n calculation end loop !$ (or #pragma) omp master print result (reduction) from above loop !$omp end master more computation end parallel loop
  • 13. Synchronization : ordered clause • The structured block following an ordered directive is executed in the order in which iterations would be executed in a sequential loop • Fortran : !$omp ordered code block !$omp end ordered • C/C++: #pragma omp ordered code block
  • 14. ordered example parallel loop (with parallel do/for) ordered loop I=1 : n a[I] = ..calculation… !$ [OR #pragma] omp ordered print a[I] !$omp end ordered end parallel loop
  • 15. OpenMP Performance • Each processor has its own cache in shared memory machine • Data locality in caches and loop scheduling • False sharing
  • 16. Data locality in caches and loop scheduling • loop j = 0 : n loop k = 0 : n a[j][k] = k +1 + a[j][k] • loop j = 0 : n loop k = 0 : n a[j][k] = 1./a[j][k] • Assume each processor’s cache can hold local matrix • After first loop each processor’s cache will have some data (cache line dependent). For next iteration it may or may not get to operate on those data depending on scheduling • Static scheduling may provide better cache performance than dynamic scheduling
  • 17. False sharing • If different processors update stride one elements of an array – this can cause poor cache performance • Cache line has to be invalidated all the time among all the processors • Parallel loop with schedule (static,1) loop j = 1 : n a[j] = a[j] + j • Proc1 updates a[1], proc2 updates a[2]… etc. • Cache line needs to be invalidated for each processor – this leads to bad performance
  • 18. Look up from OpenMP standard • Threadprivate !$omp threadprivate (/cb1/, /cb2/) #pragma omp threadprivate(list) • cb1, cb2 are common blocks in fortran, list is a list of named file scope or namespace scope variables in C • Threadprivate makes named common blocks private to a thread but global within the thread • Threadprivate makes the named file scope or namespace scope variables (list) private to a thread but file scope visible within the thread
  • 19. Look up from OpenMP standard • Atomic directive ensures that specific memory location is updated atomically – provides better optimization than critical due to hardware instructions • C: #pragma omp parallel for for (I =1; I< n; I ++) { #pragma omp atomic a[index[I]] = a[index[I]] + 1 } • Fortan: !$omp parallel do do I = 1, n $omp atomic y(index(j)) = y(index(j)) + c