Scaling API-first – The story of a global engineering organization
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe
1. OpenACC on AMD GPUs and APUs
with the PGI Accelerator Compilers
Michael Wolfe
Michael.Wolfe@pgroup.com
http://www.pgroup.com
APU13
San Jose, November, 2013
10. Accelerator Architecture Features
Potentially separate memory (relatively small)
High bandwidth memory interface
Many degrees of parallelism
MIMD parallelism across many cores
SIMD parallelism within a core
Multithreading for latency tolerance
Asynchronous with host
Performance from Parallelism
slower clock, less ILP, simpler control unit, smaller caches
11. OpenACC
Open Programming Standard for Parallel Computing
“PGI OpenACC will enable programmers to easily develop portable applications that
maximize the performance and power efficiency benefits of the hybrid CPU/GPU
architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the
OpenMP Working Group on Accelerators, as well as many others. We look forward to
releasing a version of this proposal in the next release of OpenMP.”
--Michael Wong, CEO OpenMP Directives Board
12. OpenACC Overview
Directive-based
Parallel Computation
Data Management
#pragma acc data copyin( a[0:n] )
copy( b(0:n] ) create( tmp[0:n] )
{
for( int i = 0; i < iters; ++i ){
relax( a, b, tmp, n );
relax( b, a, tmp, n );
}
}
relax(float *x,float *y,float *t,int n){
#pragma acc data
present( x[0:n], y[0:n], t[0:n] )
{
#pragma acc parallel loop
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc parallel loop
for( int j = 1; j < n-1; ++j
x[j] = 0.25f*(t[j-1]+t[j+1] +
y[n-j+1] + y[n-j-1]);
}
}
13. OpenACC compared to OpenMP
Data parallelism
Thread parallelism
Parallel per region
Fixed number of threads
Flexible || mapping
Fixed || thread mapping
Structured parallelism
Tasks and loops
Performance portability
?
14. PGI OpenACC Implementation
C, C++, Fortran
pgcc, pgc++, pgfortran
Command line options
-acc
-ta=radeon
-ta=radeon,host
-ta=radeon,nvidia
Planner
maps program ||ism to
hardware ||ism
Code Generator
OpenCL API
Runtime
initialization
data management
kernel launches
21. OpenACC on AMD GPUs and APUs
OpenACC is designed for performance portability
PGI Accelerator compilers provide evidence
Target-specific tuning still underway
Open Beta compilers available now
Product version in January 2014