4. Real HPC Users
Most Used Applications in HECToR
Application % of total jobs Language Prog. Model
VASP 17% Fortran MPI+OpenMP
CP2K 7% Fortran MPI+OpenMP
Unified Model (UM) 7% Fortran MPI
GROMACS 4% C++ MPI+OpenMP
Large code-bases
Complex algorithms implemented
Mixture of different Fortran flavours
4 / 83
10. Re-compiling the code is no longer enough to
continue improving the performance
10 / 83
11. Porting Applications To New Architectures
Programming CUDA (Host Code)
1 float a_host[n], b_host[n];
2 // Allocate
3 cudaMalloc((void*)&a, n * sizeof(float));
4 cudaMalloc((void*)&b, n * sizeof(float));
5 // Transfer
6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice);
7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice);
8 // Define grid shape
9 blocks = 100
10 threads = 128
11 // Execute
12 kernel<<<blocks,threads>>>(a, b, c);
13 // Copy-back
14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);
15 // Clean
16 cudaFree(a);
17 cudaFree(b);
11 / 83
12. Porting Applications To New Architectures
Programming CUDA (Kernel Source)
1 // Kernel code
2 __global__ void kernel(float *a, float *b, float c)
3 {
4 // Get the index of this thread
5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x;
6 // Do the computation
7 b[index] = a[index] * c;
8 // Wait for all threads in the block to finish
9 __syncthreads();
10 }
12 / 83
14. Why not use directive-based approaches for
these new heterogeneous architectures?
14 / 83
15. Overview of Our Work
We can’t solve problems by using the same kind of
thinking we used when we created them.
Albert Einstein
The field is undergoing rapid changes: we have to adapt to
them
1. Hybrid MPI+OpenMP (2008)
→ Usage of directives in cluster environments
2. OpenMP extensions (2009)
→ Extensions of OpenMP/La Laguna C (llc) for
heterogeneous architectures
3. Directives for accelerators (2011)
→ Specific accelerator-oriented directives
→ OpenACC (December 2011)
15 / 83
16. Outline
Hybrid MPI+OpenMP
llc and llCoMP
Hybrid llCoMP
Computational Results
Technical Drawbacks
OpenMP-to-GPU
Directives for Accelerators
Conclusions
Future Work and Final Remarks
17. La Laguna C: llc
What is
Directive-based approach to distributed memory environments
OpenMP compatible
Additional set of extensions to address particular features
Implemented FORALL loops, Pipelines, Farms . . .
Reference
[48] Dorta, A. J. Extensi´n del modelo de OpenMP a memoria
o
distribuida. PhD Thesis, Universidad de La Laguna, December 2008.
17 / 83
19. A Hybrid OpenMP+MPI Implementation
Same llc code, extended llCoMP implementation
Directives are replaced by a set of parallel patterns
Improved performance on multicore systems
→ Better usage of inter-core memories (i.e cache)
→ Lower memory requirements when using replicated memory
on MPI
Translation
19 / 83
22. Technical Drawbacks
llCoMP
The original design of llCoMP StS was not flexible enough
Traditional two-pass compiler
Excessive effort to implement new features
Need more advanced features to implement GPU code
generation
22 / 83
24. Outline
Hybrid MPI+OpenMP
OpenMP-to-GPU
Related Work
Yet Another Compiler Framework (YaCF)
Computational Results
Technical Drawbacks
Directives for Accelerators
Conclusions
Future Work and Final Remarks
26. Related Work
Other OpenMP-to-GPU translators: OpenMPC
[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programming
and tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conference
on Supercomputing. IEEE Computer Society, pp. 1–11.
Other Compiler Frameworks: Cetus, LLVM
[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compiler
infrastructure for source-to-source transformation. In Languages and Compilers
for Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume
2958 of LNCS(2003), pp. 539-553.
[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong
program analysis & transformation. In Proceedings of the international
symposium on Code generation and optimization: feedback-directed and runtime
optimization, CGO’04. IEEE Computer Society, pp. 75–47.
26 / 83
27. YaCF: Yet Another Compiler Framework
Application programmer writes llc code
Focus on data and algorithm
Architecture independent
Only needs to specify where the parallelism is
System engineer writes template code
Focus on non-functional code
Can reuse code from different patterns (i.e inheritance)
27 / 83
29. Main Software Design Patterns
Implementing search and replacement in the IR
Filter: Looks for an specific pattern on the IR
→ E.g Looks for a pragma omp parallel construct
Mutator: Looks for a node and transforms the IR
→ E.g Applies loop transformations (nesting, flattening, . . . )
→ E.g Replaces a pragma omp for by a CUDA kernel call
Can be composed to solve more complex problems
29 / 83
30. Dynamic Language and Tools
Key Idea: Features Should Require Only a Few Lines of Code
30 / 83
32. CUDA Back-end
Generates a CUDA kernel and memory transfers from the
information obtained during the analysis
Supported syntax
parallel, for and their condensed form implemented
New directives to support manual optimizations (e.g
interchange)
Syntax taken from an OpenMP proposal by BSC, UJI and
others (#pragma omp target)
copy in, copy out enable users to provide memory transfer
information
Generated code is human-readable
32 / 83
33. Example
Update Loop from the Molecular Dynamics Code
1 ...
2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a)
3 #pragma omp parallel for default(shared) private(i, j)
firstprivate(rmass, dt)
4 for (i = 0; i < np; i++) {
5 for (j = 0; j < nd; j++) {
6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j];
7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
8 a[i][j] = f[i][j] * rmass;
9 }
10 }
33 / 83
40. Technical Drawbacks
Limited to Compile-time Optimizations
Some features require runtime information
→ Kernel grid configuration
Orphaned directives were not possible
→ Would require an inter-procedural analysis module
Some templates were too complex
→ And would need to be replicated to support OpenCL
40 / 83
44. Related Work (I)
hiCUDA
Translates each directive into a CUDA call
It is able to use the GPU Shared Memory
Only works with NVIDIA devices
The programmer still needs to know hardware details
Code Example:
1 ...
2 #pragma hicuda global alloc c [*] [*] copyin
3
4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16)
5 #pragma hicuda loop_partition over_tblock over_thread
6 for (i = 0; i < N; i++) {
7 #pragma hicuda loop_partition over_tblock over_thread
8 for (j = 0; j < N; j++) {
9 double sum = 0.0;
10 ...
44 / 83
45. Related Work (II)
PGI Accelerator Model
Higher level (directive-based) approach
Fortran and C are supported
Code Example:
1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m])
2 {
3 #pragma acc region
4 for (j = 0; j < n; j++)
5 for (i = 0; i < l; i++) {
6 double sum = 0.0;
7 for (k = 0; k < m; k++)
8 sum += b[i + k * l] * c[k + j * m];
9 a[i + j * l] = sum;
10 }
11 }
45 / 83
46. Our Ongoing Work at that Time: llcl
Extending llc with support for heterogeneous platforms
Compiler + Runtime implementation
→ The Compiler generates runtime code
→ The Runtime handles memory coherence and drives
execution
Compiler optimizations directed by an XML file
More generic/higher level approach - not tied to GPUs
46 / 83
47. llcl: Directives
1 double *a, *b, *c;
2 ...
3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m],
4 c[m * n], l, m, n) copy_out(a[n * l])
5 {
6 int i, j, k;
7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k)
8 for (i = 0; i < l; i++)
9 for (j = 0; j < n; j++) {
10 a[i + j * l] = 0.0;
11 for (k = 0; k < m; k++)
12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
13 }
14 }
15 ...
47 / 83
51. OpenACC: Directives
1 double *a, *b, *c;
2 ...
3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)
copy_out(a[n * l])
4 {
5 int i, j, k;
6 #pragma acc kernels loop private(i, j, k)
7 for (i = 0; i < l; i++)
8 for (j = 0; j < n; j++) {
9 a[i + j * l] = 0.0;
10 for (k = 0; k < m; k++)
11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
12 }
13 }
14 ...
51 / 83
52. Related Work
OpenACC Implementations (After Announcement)
PGI - Released on February 2012
CAPS - Released on March 2012
Cray - To be released
→ Access to beta release available
We had a first experimental implementation in January 2012
52 / 83
53. accULL: Our OpenACC Implementation
accULL = YaCF + Frangollo
It is a two-layer based implementation:
Compiler + Runtime Library
53 / 83
54. Frangollo: the Runtime
Implementation
Lightweight
Standard C++ and STL code
CUDA component written using the CUDA Driver API
OpenCL component written using the C OpenCL interface
Experimental features can be enabled/disabled at compile time
Handles
1. Device discovery, initialization, . . .
2. Memory coherence (registered variables)
3. Manage kernel execution (including grid shape)
54 / 83
56. Memory Management
1 // Creates a context to handle memory coherence
2 ctxt_id = FRG__createContext("name", ...)
3 ...
4 // Register a variable within the context
5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...);
6 ...
7 // Execute the kernel
8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...)
9 ...
10 // Finish the context and concyle variables
11 FRG__destroyContext(ctxt_id);
56 / 83
57. Kernel Execution
Loading the kernel
Context may have from zero to N named kernels associated
Runtime loads different versions of the kernel for each device
Kernel is loaded depending on the platform where it is executed
Grid shape
Grid shape is estimated using compute intensity (CI):
Nmem /(Cost × Nflops )
→ E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth
144Gb/s, Cost 3.5
Low CI → favors memory accesses
High CI → favors computation
57 / 83
58. Implementing OpenACC
Putting all together
1. The compiler driver generates Frangollo interface calls from
OpenACC directives
→ Converts data region directives into context creation
→ Generates Host and Device synchronization
2. Extracts the kernel code
3. Frangollo implements OpenACC API calls
→ acc init, acc malloc/acc free
4. Implements some optimizations
→ Compiler: loop invariant, skewing, strip-mining, interchange
→ Kernel extraction: divergence reduction, data-dependency
analysis (basic)
→ Runtime: grid shape estimation, optimized reduction kernels
58 / 83
65. M×M (II)
1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N])
2 {
3 #pragma acc kernels loop private(i)
4 for (i = 0; i < L; i++)
5 #pragma acc loop private(j)
6 for (j = 0; j < N; j++)
7 a[i * L + j] = 0.0;
8 /* Iterates over blocks */
9 for (ii = 0; ii < L; ii += tile_size)
10 for (jj = 0; jj < N; jj += tile_size)
11 for (kk = 0; kk < M; kk += tile_size) {
12 /* Iterates inside a block */
13 #pragma acc kernels loop private(i)
14 for (j = jj; j < min(N, jj+tile_size); j++)
15 #pragma acc loop private(j)
16 for (i = ii; i < min(L, ii+tile_size); i++)
17 for (k = kk; k < min(M, kk+tile_size); k++)
18 a[i * L + j] += (b[i * L + k] * c[k * M + j]);
19 }
20 }
65 / 83
66. M×M (III)
1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...)
2 {
3 #pragma acc kernels loop private(i) gang(32)
4 for (i = 0; i < L; i++)
5 #pragma acc loop private(j) worker(32)
6 for (j = 0; j < N; j++)
7 a[i * L + j] = 0.0;
8 /* Iterates over blocks */
9 for (ii = 0; ii < L; ii += tile_size)
10 for (jj = 0; jj < N; jj += tile_size)
11 for (kk = 0; kk < M; kk += tile_size) {
12 /* Iterates inside a block */
13 #pragma acc kernels loop private(i) gang(32)
14 for (j = jj; j < min(N, jj+tile_size); j++)
15 #pragma acc loop private(j) worker(32)
16 for (i = ii; i < min(L, ii+tile_size); i++)
17 for (k = kk; k < min(M, kk+tile_size); k++)
18 a[i*L+j] += (b[i*L+k] * c[k*M+j]);
19 }
20 }
66 / 83
67. About Grid Shape and Loop Scheduling Clauses
Optimal gang/worker (i.e, grid shape) values vary
Among OpenACC implementations
Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?)
What happens if we implement a non-GPU accelerator?
Our implementation ignores gang/worker, leaves decision to
runtime
→ User can influence the decision with an environment variable
It is possible to enable the gang/worker clauses in our
implementation
→ Gang/worker feeds a Strip-mining transformation forcing
block/threads (WIP)
67 / 83
70. Needleman-Wunsch (NW)
NW is a nonlinear global optimization method for DNA
sequence alignments
The potential pairs of sequences are organized in a 2D matrix
The method uses Dynamic Programming to find the optimum
alignment
70 / 83
74. Directive-based Programming
Support for accelerators in the OpenMP standard may be
added in the future
→ In the meantime, OpenACC can be used to port codes to
GPUs
→ It is possible to combine OpenACC with OpenMP
Generated code does not always match native-code
performance
→ But leverages the development effort providing enough
performance
accULL is an interesting research-oriented implementation of
OpenACC
→ First non-commercial OpenACC implementation
→ It is a flexible framework to explore optimizations, new
platforms, . . .
74 / 83
77. accULL Still Has Some Opportunities
Study support for multiple devices (either transparently or in
OpenACC)
Design an MPI component for the runtime
Integration with other projects
Improve the performance of the generated code (e.g using
Polyhedral models)
Enhance the support for Extrae/Paraver (experimental tracing
already built-in)
77 / 83
78. Re-use our Know-how
Integrate OpenACC and OMPSs?
Current OMPSs implementation does not automatically
generate kernel code
Integrating OpenACCsyntax within tasks would enable
automatic code generation
Improve portability in accelerator platforms
Leverage development effort
78 / 83
79. Contributions
Reyes, R. and de Sande, F. Automatic code generation for GPUs in
llc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.
349-356.
Reyes, R. and de Sande, F. Optimization stategies in different CUDA
architectures using llCoMP. Microprocessors and Microsystems -
Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.
Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. accULL: an
o
OpenACC implementation with CUDA and OpenCL support. In
Euro-Par 2012 Parallel Processing - 18th International Conference,
vol. 7484 of LNCS, pp. 871-882.
Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. A Preliminary
o
Evaluation of OpenACC Implementations. The Journal of
Supercomputing (In Press)
79 / 83
80. Other contributions
accULL has been released as an Open Source Project
→ http://cap.pcg.ull.es/accull
accULL is currently being evaluated by VectorFabrics
Provided feedback to CAPS which seems to be used in their
current version
Contacted by members of the OpenACC committee
Two HPC-Europa2 visits by our team master students
80 / 83
81. Acknowledgements
Spanish MEC
Plan Nacional de I+D+i, contracts TIN2008-06570-C04-03
and TIN2011-24598
Canary Islands Government ACIISI
Contract SolSubC200801000285
TEXT Project (FP7-261580)
HPC-EUROPA2 (project number: 228398)
Universitat Jaume I de Castell´n
o
Universidad de La Laguna
All members of GCAP
81 / 83