SlideShare a Scribd company logo
1 of 33
Download to read offline
April 4-7, 2016 | Silicon Valley
James Beyer, NVIDIA
Jeff Larkin, NVIDIA
GTC16 – April 7, 2016
S6410 - Comparing OpenACC 2.5
and OpenMP 4.5
2
AGENDA
History of OpenMP & OpenACC
Philosophical Differences
Technical Differences
Portability Challenges
Conclusions
3
A Tale of Two Specs.
4
A Brief History of OpenMP
1996 - Architecture Review Board (ARB) formed by several vendors implementing
their own directives for Shared Memory Parallelism (SMP).
1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops
across threads.
2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released.
2005 – Version 2.5 released, combining both specs into one.
2008 – Version 3.0 released, added support for tasking
2011 – Version 3.1 release, improved support for tasking
2013 – Version 4.0 released, added support for offloading (and more)
2015 – Version 4.5 released, improved support for offloading targets (and more)
4/6/2016
5
A Brief History of OpenACC
2010 – OpenACC founded by CAPS, Cray, PGI, and NVIDIA, to unify directives for
accelerators being developed by CAPS, Cray, and PGI independently
2011 – OpenACC 1.0 released
2013 – OpenACC 2.0 released, adding support for unstructured data management and
clarifying specification language
2015 – OpenACC 2.5 released, contains primarily clarifications with some additional
features.
4/6/2016
6
Philosophical Differences
7
OpenMP: Compilers are dumb, users are
smart. Restructuring non-parallel code is
optional.
OpenACC: Compilers can be smart and
smarter with the user’s help. Non-parallel
code must be made parallel.
8
Philosophical Differences
OpenMP:
The OpenMP API covers only user-directedparallelization, wherein the programmer
explicitly specifies the actions to be taken by the compiler and runtime system in
order to execute the program in parallel.
The OpenMP API does not cover compiler-generated automatic parallelization and
directives to the compiler to assist such parallelization.
OpenACC:
The programming model allows the programmer to augment information available
to the compilers, including specification of data local to an accelerator, guidance
on mapping of loops onto an accelerator, and similar performance-related details.
4/6/2016
9
Philosophical Trade-offs
Consistent, predictable behavior
between implementations
Users can parallelize non-parallel
code and protect data races
explicitly
Some optimizations are off the
table
Substantially different architectures
require substantially different
directives.
Quality of implementation will
greatly affect performance
Users must restructure their code
to be parallel and free of data
races
Compiler has more freedom and
information to optimize
High level parallel directives can be
applied to different architectures
by the compiler
OpenMP OpenACC
10
Technical Differences
11
Parallel: Similar, but Different
OMP Parallel
Creates a team of threads
Very well-defined how the number
of threads is chosen.
May synchronize within the team
Data races are the user’s
responsibility
ACC Parallel
Creates 1 or more gangs of workers
Compiler free to choose number of
gangs, workers, vector length
May not synchronize between gangs
Data races not allowed
12
OMP Teams vs. ACC Parallel
OMP Teams
Creates a league of 1 or more
thread teams
Compiler free to choose number of
teams, threads, and simd lanes.
May not synchronize between teams
Only available within target regions
ACC Parallel
Creates 1 or more gangs of workers
Compiler free to choose number of
gangs, workers, vector length
May not synchronize between gangs
May be used anywhere
13
Compiler-Driven Mode
OpenMP
Fully user-driven (no analogue)
Some compilers choose to go above
and beyond after applying OpenMP,
but not guaranteed
OpenACC
Kernels directive declares desire
to parallelize a region of code, but
places the burden of analysis on the
compiler
Compiler required to be able to do
analysis and make decisions.
14
Loop: Similar but Different
OMP Loop (For/Do)
Splits (“Workshares”) the iterations
of the next loop to threads in the
team, guarantees the user has
managed any data races
Loop will be run over threads and
scheduling of loop iterations may
restrict the compiler
ACC Loop
Declares the loop iterations as
independent & race free (parallel)
or interesting & should be analyzed
(kernels)
User able to declare independence
w/o declaring scheduling
Compiler free to schedule with
gangs/workers/vector, unless
overridden by user
15
Distribute vs. Loop
OMP Distribute
Must live in a TEAMS region
Distributes loop iterations over 1 or
more thread teams
Only master thread of each team
runs iterations, until PARALLEL is
encountered
Loop iterations are implicitly
independent, but some compiler
optimizations still restricted
ACC Loop
Declares the loop iterations as
independent & race free (parallel)
or interesting & should be analyzed
(kernels)
Compiler free to schedule with
gangs/workers/vector, unless
overridden by user
16
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
17
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
Generate a 1 or more
thread teams
Distribute “i” over
teams.
No information about
“j” or “k” loops
18
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
Generate a 1 or more
gangs
These loops are
independent, do the
right thing
19
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
What’s the right thing?
Interchange? Distribute? Workshare?
Vectorize? Stripmine? Ignore? …
20
Synchronization
OpenMP
Users may use barriers, critical
regions, and/or locks to protect
data races
It’s possible to parallelize non-
parallel code
OpenACC
Users expected to refactor code to
remove data races.
Code should be made truly parallel
and scalable
21
Synchronization Example
#pragma omp parallel private(p)
{
funcA(p);
#pragma omp barrier
funcB(p);
}
function funcA(p[N]){
#pragma acc parallel
}
function funcB(p[N]){
#pragma acc parallel
}
22
Synchronization Example
#pragma omp parallel for
for (i=0; i<N; i++)
{
#pragma omp critical
A[i] = rand();
A[i] *= 2;
}
parallelRand(A);
#pragma acc parallel loop
for (i=0; i<N; i++)
{
A[i] *= 2;
}
23
Portability Challenges
24
How to Write Portable Code (OMP)
#ifdef GPU
#pragma omp target omp teams distribute parallel for reduction(max:error) 
collapse(2) schedule(static,1)
#elif defined(CPU)
#pragma omp parallel for reduction(max:error)
#elif defined(SOMETHING_ELSE)
#pragma omp …
endif
for( int j = 1; j < n-1; j++)
{
#if defined(CPU) && defined(USE_SIMD)
#pragma omp simd
#endif
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Ifdefs can be used to
choose particular
directives per device
at compile-time
25
How to Write Portable Code (OMP)
#pragma omp 
#ifdef GPU
target teams distribute 
#endif
parallel for reduction(max:error) 
#ifdef GPU
collapse(2) schedule(static,1)
endif
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Creative ifdefs might
clean up the code, but
still one target at a
time.
26
How to Write Portable Code (OMP)
usegpu = 1;
#pragma omp target teams distribute parallel for reduction(max:error) 
#ifdef GPU
collapse(2) schedule(static,1) 
endif
if(target:usegpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
The OpenMP if clause
can help some too (4.5
improves this).
Note: This example
assumes that a compiler
will choose to generate 1
team when not in a target,
making it the same as a
standard “parallel for.”
27
How to Write Portable Code (ACC)
#pragma acc kernels
{
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
}
Developer presents the
desire to parallelize to
the compiler, compiler
handles the rest.
28
How to Write Portable Code (ACC)
#pragma acc parallel loop reduction(max:error)
{
for( int j = 1; j < n-1; j++)
{
#pragma acc loop reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
}
Developer asserts the
parallelism of the loops
to the compiler,
compiler makes
decision about
scheduling.
29
Execution Time (Smaller is Better)
1.00X
5.12X
0.12X
1.01X
2.47X
0.96X
4.00X
17.42X
4.96X
23.03X
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
Original CPU
Threaded
GPU
Threaded
GPU Teams GPU Split GPU
Collapse
GPU Split
Sched
GPU
Collapse
Sched
OpenACC
(CPU)
OpenACC
(GPU)
893
Same
directives
(unoptimized)
ExecutionTime(seconds)
NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz – See GTC16 S6510 for additional information
30
Compiler Portability (CPU)
OpenMP
Numerous well-tested
implementations
• PGI, IBM, Intel, GCC, Cray, …
OpenACC
CPU implementations beginning to
emerge
• X86: PGI
• ARM: PathScale
• Power: Coming soon
31
Compiler Portability (Offload)
OpenMP
Few mature implementations
• Intel (Phi)
• Cray (GPU, Phi?)
• GCC (Phi, GPUs in development)
• Clang (Multiple targets in
development)
OpenACC
Multiple mature implementations
• PGI (NVIDIA & AMD)
• PathScale (NVIDIA & AMD)
• Cray (NVIDIA)
• GCC (in development)
32
Conclusions
33
Conclusions
OpenMP & OpenACC, while similar, are still quite different in their approach
Each approach has clear tradeoffs with no clear “winner”
It should be possible to translate between the two, but the process may not be
automatic
4/6/2016

More Related Content

What's hot

Return-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code InjectionReturn-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code Injection
guest9f4856
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 

What's hot (20)

Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programming
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
OpenMP
OpenMPOpenMP
OpenMP
 
Open mp
Open mpOpen mp
Open mp
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Return-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code InjectionReturn-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code Injection
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
Erlang For Five Nines
Erlang For Five NinesErlang For Five Nines
Erlang For Five Nines
 
Metrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMetrics ekon 14_2_kleiner
Metrics ekon 14_2_kleiner
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)
 
Use of an Oscilloscope - maXbox Starter33
Use of an Oscilloscope - maXbox Starter33Use of an Oscilloscope - maXbox Starter33
Use of an Oscilloscope - maXbox Starter33
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 

Viewers also liked

Savoldi vittoria paper moda
Savoldi vittoria   paper modaSavoldi vittoria   paper moda
Savoldi vittoria paper moda
Vittoria Savoldi
 
Metodologia de bruno munari
Metodologia de bruno munariMetodologia de bruno munari
Metodologia de bruno munari
Klaudhia Payarez
 
160116_Certificat-Travail-Ueli2
160116_Certificat-Travail-Ueli2160116_Certificat-Travail-Ueli2
160116_Certificat-Travail-Ueli2
Gianpaolo Rando
 
IBM OOAD Part1 Summary
IBM OOAD Part1 SummaryIBM OOAD Part1 Summary
IBM OOAD Part1 Summary
Haitham Raik
 

Viewers also liked (13)

Mesa 1363 oropel hasta palavecino
Mesa 1363   oropel hasta palavecinoMesa 1363   oropel hasta palavecino
Mesa 1363 oropel hasta palavecino
 
Khalig Atakishiyev CV
Khalig Atakishiyev CVKhalig Atakishiyev CV
Khalig Atakishiyev CV
 
Savoldi vittoria paper moda
Savoldi vittoria   paper modaSavoldi vittoria   paper moda
Savoldi vittoria paper moda
 
Metodologia de bruno munari
Metodologia de bruno munariMetodologia de bruno munari
Metodologia de bruno munari
 
Contratacion nucleos de formacion artística
Contratacion nucleos de formacion artísticaContratacion nucleos de formacion artística
Contratacion nucleos de formacion artística
 
160116_Certificat-Travail-Ueli2
160116_Certificat-Travail-Ueli2160116_Certificat-Travail-Ueli2
160116_Certificat-Travail-Ueli2
 
Professional Profile
Professional ProfileProfessional Profile
Professional Profile
 
Virtualizion: Maximize Your Resources
Virtualizion: Maximize Your ResourcesVirtualizion: Maximize Your Resources
Virtualizion: Maximize Your Resources
 
คู่มือติดตั้ง Obeclms for linux
คู่มือติดตั้ง Obeclms for linuxคู่มือติดตั้ง Obeclms for linux
คู่มือติดตั้ง Obeclms for linux
 
Gandini forma 2016
Gandini forma 2016Gandini forma 2016
Gandini forma 2016
 
IBM OOAD Part1 Summary
IBM OOAD Part1 SummaryIBM OOAD Part1 Summary
IBM OOAD Part1 Summary
 
SM-Twittometro "CESAR ACUÑA"
SM-Twittometro "CESAR ACUÑA"SM-Twittometro "CESAR ACUÑA"
SM-Twittometro "CESAR ACUÑA"
 
SM-Twittometro Lourdes Flores
SM-Twittometro Lourdes FloresSM-Twittometro Lourdes Flores
SM-Twittometro Lourdes Flores
 

Similar to GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5

Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
Ruymán Reyes
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
Jeff Larkin
 

Similar to GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 (20)

Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
The features of java 11 vs. java 12
The features of  java 11 vs. java 12The features of  java 11 vs. java 12
The features of java 11 vs. java 12
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
CS4961-L9.ppt
CS4961-L9.pptCS4961-L9.ppt
CS4961-L9.ppt
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
AMD_11th_Intl_SoC_Conf_UCI_Irvine
AMD_11th_Intl_SoC_Conf_UCI_IrvineAMD_11th_Intl_SoC_Conf_UCI_Irvine
AMD_11th_Intl_SoC_Conf_UCI_Irvine
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 

More from Jeff Larkin

Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
Jeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
Jeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Jeff Larkin
 

More from Jeff Larkin (10)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5

  • 1. April 4-7, 2016 | Silicon Valley James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 – April 7, 2016 S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
  • 2. 2 AGENDA History of OpenMP & OpenACC Philosophical Differences Technical Differences Portability Challenges Conclusions
  • 3. 3 A Tale of Two Specs.
  • 4. 4 A Brief History of OpenMP 1996 - Architecture Review Board (ARB) formed by several vendors implementing their own directives for Shared Memory Parallelism (SMP). 1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops across threads. 2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released. 2005 – Version 2.5 released, combining both specs into one. 2008 – Version 3.0 released, added support for tasking 2011 – Version 3.1 release, improved support for tasking 2013 – Version 4.0 released, added support for offloading (and more) 2015 – Version 4.5 released, improved support for offloading targets (and more) 4/6/2016
  • 5. 5 A Brief History of OpenACC 2010 – OpenACC founded by CAPS, Cray, PGI, and NVIDIA, to unify directives for accelerators being developed by CAPS, Cray, and PGI independently 2011 – OpenACC 1.0 released 2013 – OpenACC 2.0 released, adding support for unstructured data management and clarifying specification language 2015 – OpenACC 2.5 released, contains primarily clarifications with some additional features. 4/6/2016
  • 7. 7 OpenMP: Compilers are dumb, users are smart. Restructuring non-parallel code is optional. OpenACC: Compilers can be smart and smarter with the user’s help. Non-parallel code must be made parallel.
  • 8. 8 Philosophical Differences OpenMP: The OpenMP API covers only user-directedparallelization, wherein the programmer explicitly specifies the actions to be taken by the compiler and runtime system in order to execute the program in parallel. The OpenMP API does not cover compiler-generated automatic parallelization and directives to the compiler to assist such parallelization. OpenACC: The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details. 4/6/2016
  • 9. 9 Philosophical Trade-offs Consistent, predictable behavior between implementations Users can parallelize non-parallel code and protect data races explicitly Some optimizations are off the table Substantially different architectures require substantially different directives. Quality of implementation will greatly affect performance Users must restructure their code to be parallel and free of data races Compiler has more freedom and information to optimize High level parallel directives can be applied to different architectures by the compiler OpenMP OpenACC
  • 11. 11 Parallel: Similar, but Different OMP Parallel Creates a team of threads Very well-defined how the number of threads is chosen. May synchronize within the team Data races are the user’s responsibility ACC Parallel Creates 1 or more gangs of workers Compiler free to choose number of gangs, workers, vector length May not synchronize between gangs Data races not allowed
  • 12. 12 OMP Teams vs. ACC Parallel OMP Teams Creates a league of 1 or more thread teams Compiler free to choose number of teams, threads, and simd lanes. May not synchronize between teams Only available within target regions ACC Parallel Creates 1 or more gangs of workers Compiler free to choose number of gangs, workers, vector length May not synchronize between gangs May be used anywhere
  • 13. 13 Compiler-Driven Mode OpenMP Fully user-driven (no analogue) Some compilers choose to go above and beyond after applying OpenMP, but not guaranteed OpenACC Kernels directive declares desire to parallelize a region of code, but places the burden of analysis on the compiler Compiler required to be able to do analysis and make decisions.
  • 14. 14 Loop: Similar but Different OMP Loop (For/Do) Splits (“Workshares”) the iterations of the next loop to threads in the team, guarantees the user has managed any data races Loop will be run over threads and scheduling of loop iterations may restrict the compiler ACC Loop Declares the loop iterations as independent & race free (parallel) or interesting & should be analyzed (kernels) User able to declare independence w/o declaring scheduling Compiler free to schedule with gangs/workers/vector, unless overridden by user
  • 15. 15 Distribute vs. Loop OMP Distribute Must live in a TEAMS region Distributes loop iterations over 1 or more thread teams Only master thread of each team runs iterations, until PARALLEL is encountered Loop iterations are implicitly independent, but some compiler optimizations still restricted ACC Loop Declares the loop iterations as independent & race free (parallel) or interesting & should be analyzed (kernels) Compiler free to schedule with gangs/workers/vector, unless overridden by user
  • 16. 16 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) }
  • 17. 17 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) } Generate a 1 or more thread teams Distribute “i” over teams. No information about “j” or “k” loops
  • 18. 18 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) } Generate a 1 or more gangs These loops are independent, do the right thing
  • 19. 19 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) } What’s the right thing? Interchange? Distribute? Workshare? Vectorize? Stripmine? Ignore? …
  • 20. 20 Synchronization OpenMP Users may use barriers, critical regions, and/or locks to protect data races It’s possible to parallelize non- parallel code OpenACC Users expected to refactor code to remove data races. Code should be made truly parallel and scalable
  • 21. 21 Synchronization Example #pragma omp parallel private(p) { funcA(p); #pragma omp barrier funcB(p); } function funcA(p[N]){ #pragma acc parallel } function funcB(p[N]){ #pragma acc parallel }
  • 22. 22 Synchronization Example #pragma omp parallel for for (i=0; i<N; i++) { #pragma omp critical A[i] = rand(); A[i] *= 2; } parallelRand(A); #pragma acc parallel loop for (i=0; i<N; i++) { A[i] *= 2; }
  • 24. 24 How to Write Portable Code (OMP) #ifdef GPU #pragma omp target omp teams distribute parallel for reduction(max:error) collapse(2) schedule(static,1) #elif defined(CPU) #pragma omp parallel for reduction(max:error) #elif defined(SOMETHING_ELSE) #pragma omp … endif for( int j = 1; j < n-1; j++) { #if defined(CPU) && defined(USE_SIMD) #pragma omp simd #endif for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Ifdefs can be used to choose particular directives per device at compile-time
  • 25. 25 How to Write Portable Code (OMP) #pragma omp #ifdef GPU target teams distribute #endif parallel for reduction(max:error) #ifdef GPU collapse(2) schedule(static,1) endif for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Creative ifdefs might clean up the code, but still one target at a time.
  • 26. 26 How to Write Portable Code (OMP) usegpu = 1; #pragma omp target teams distribute parallel for reduction(max:error) #ifdef GPU collapse(2) schedule(static,1) endif if(target:usegpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } The OpenMP if clause can help some too (4.5 improves this). Note: This example assumes that a compiler will choose to generate 1 team when not in a target, making it the same as a standard “parallel for.”
  • 27. 27 How to Write Portable Code (ACC) #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } } Developer presents the desire to parallelize to the compiler, compiler handles the rest.
  • 28. 28 How to Write Portable Code (ACC) #pragma acc parallel loop reduction(max:error) { for( int j = 1; j < n-1; j++) { #pragma acc loop reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } } Developer asserts the parallelism of the loops to the compiler, compiler makes decision about scheduling.
  • 29. 29 Execution Time (Smaller is Better) 1.00X 5.12X 0.12X 1.01X 2.47X 0.96X 4.00X 17.42X 4.96X 23.03X 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 Original CPU Threaded GPU Threaded GPU Teams GPU Split GPU Collapse GPU Split Sched GPU Collapse Sched OpenACC (CPU) OpenACC (GPU) 893 Same directives (unoptimized) ExecutionTime(seconds) NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz – See GTC16 S6510 for additional information
  • 30. 30 Compiler Portability (CPU) OpenMP Numerous well-tested implementations • PGI, IBM, Intel, GCC, Cray, … OpenACC CPU implementations beginning to emerge • X86: PGI • ARM: PathScale • Power: Coming soon
  • 31. 31 Compiler Portability (Offload) OpenMP Few mature implementations • Intel (Phi) • Cray (GPU, Phi?) • GCC (Phi, GPUs in development) • Clang (Multiple targets in development) OpenACC Multiple mature implementations • PGI (NVIDIA & AMD) • PathScale (NVIDIA & AMD) • Cray (NVIDIA) • GCC (in development)
  • 33. 33 Conclusions OpenMP & OpenACC, while similar, are still quite different in their approach Each approach has clear tradeoffs with no clear “winner” It should be possible to translate between the two, but the process may not be automatic 4/6/2016