Directive-based approach to Heterogeneous Computing

Directive-based approach to heterogeneous
computing

Ruyman Reyes Castro

High Performance Computing Group
University of La Laguna

December 19, 2012

TOP500 Performance Development List

2 / 83

Applications Used in HPC Centers

Usage of HECToR by Area of Expertise

3 / 83

Real HPC Users

Most Used Applications in HECToR
Application % of total jobs Language Prog. Model
VASP 17% Fortran MPI+OpenMP
CP2K 7% Fortran MPI+OpenMP
Unified Model (UM) 7% Fortran MPI
GROMACS 4% C++ MPI+OpenMP

Large code-bases
Complex algorithms implemented
Mixture of different Fortran flavours

4 / 83

Knowledge of Programming
Survey conducted in the Swiss National Supercomputing Centre
(2011)

5 / 83

Are application developers using
the proper tools?

6 / 83

Complexity Arises (I)

7 / 83

Directives: Enhancing Legacy Code (I)

OpenMP Example
1 ...
2 #pragma omp parallel for default(shared) private(i, j)
firstprivate(rmass, dt)
3 for (i = 0; i < np; i++) {
4 for (j = 0; j < nd; j++) {
5 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j];
6 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
7 a[i][j] = f[i][j] * rmass;
8 }
9 }
10 ...

8 / 83

Complexity Arises (II)

9 / 83

Re-compiling the code is no longer enough to
continue improving the performance

10 / 83

Porting Applications To New Architectures

Programming CUDA (Host Code)
1 float a_host[n], b_host[n];
2 // Allocate
3 cudaMalloc((void*)&a, n * sizeof(float));
4 cudaMalloc((void*)&b, n * sizeof(float));
5 // Transfer
6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice);
7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice);
8 // Define grid shape
9 blocks = 100
10 threads = 128
11 // Execute
12 kernel<<<blocks,threads>>>(a, b, c);
13 // Copy-back
14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);
15 // Clean
16 cudaFree(a);
17 cudaFree(b);

11 / 83

Porting Applications To New Architectures

Programming CUDA (Kernel Source)
1 // Kernel code
2 __global__ void kernel(float *a, float *b, float c)
3 {
4 // Get the index of this thread
5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x;
6 // Do the computation
7 b[index] = a[index] * c;
8 // Wait for all threads in the block to finish
9 __syncthreads();
10 }

12 / 83

Programmers need faster ways to migrate existing code

13 / 83

Why not use directive-based approaches for
these new heterogeneous architectures?

14 / 83

Overview of Our Work

We can’t solve problems by using the same kind of
thinking we used when we created them.
Albert Einstein

The ﬁeld is undergoing rapid changes: we have to adapt to
them
1. Hybrid MPI+OpenMP (2008)
→ Usage of directives in cluster environments
2. OpenMP extensions (2009)
→ Extensions of OpenMP/La Laguna C (llc) for
heterogeneous architectures
3. Directives for accelerators (2011)
→ Speciﬁc accelerator-oriented directives
→ OpenACC (December 2011)

15 / 83

Outline

Hybrid MPI+OpenMP
llc and llCoMP
Hybrid llCoMP
Computational Results
Technical Drawbacks

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks

La Laguna C: llc

What is
Directive-based approach to distributed memory environments
OpenMP compatible
Additional set of extensions to address particular features
Implemented FORALL loops, Pipelines, Farms . . .

Reference
[48] Dorta, A. J. Extensi´n del modelo de OpenMP a memoria
o
distribuida. PhD Thesis, Universidad de La Laguna, December 2008.

17 / 83

Chronological Perspective (Late 2008)

Cores per Socket - System Share Accelerator - System Share

18 / 83

A Hybrid OpenMP+MPI Implementation

Same llc code, extended llCoMP implementation
Directives are replaced by a set of parallel patterns
Improved performance on multicore systems
→ Better usage of inter-core memories (i.e cache)
→ Lower memory requirements when using replicated memory
on MPI

Translation

19 / 83

llc Code Example

llc Implementation of the Mandelbrot Set Computation
1 ...
2 #pragma omp parallel for default(shared) reduction(+:numoutside)
private(i, j, ztemp, z) shared(nt, c)
3 #pragma llc reduction_type (int)
4 for(i = 0; i < npoints; i++) {
5 z.creal = c[i].creal; z.cimag = c[i].cimag;
6 for (j = 0; j < MAXITER; j++) {
7 ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal;
8 z.cimag = z.creal * z.cimag * 2 + c[i].cimag;
9 z.creal = ztemp;
10 if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) {
11 numoutside++;
12 break;
13 }
14 }
15 ...

20 / 83

Hybrid MPI+OpenMP performance

21 / 83

Technical Drawbacks

llCoMP
The original design of llCoMP StS was not ﬂexible enough
Traditional two-pass compiler
Excessive eﬀort to implement new features
Need more advanced features to implement GPU code
generation

22 / 83

Back to the Drawing Board

23 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU
Related Work
Yet Another Compiler Framework (YaCF)
Computational Results
Technical Drawbacks


Conclusions


Chronological Perspective (Late 2009)


25 / 83

Related Work

Other OpenMP-to-GPU translators: OpenMPC
[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programming
and tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conference
on Supercomputing. IEEE Computer Society, pp. 1–11.

Other Compiler Frameworks: Cetus, LLVM
[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compiler
infrastructure for source-to-source transformation. In Languages and Compilers
for Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume
2958 of LNCS(2003), pp. 539-553.

[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong
program analysis & transformation. In Proceedings of the international
symposium on Code generation and optimization: feedback-directed and runtime
optimization, CGO’04. IEEE Computer Society, pp. 75–47.

26 / 83

YaCF: Yet Another Compiler Framework

Application programmer writes llc code
Focus on data and algorithm
Architecture independent
Only needs to specify where the parallelism is

System engineer writes template code
Focus on non-functional code
Can reuse code from diﬀerent patterns (i.e inheritance)

27 / 83

YaCF Software Architecture

28 / 83

Main Software Design Patterns

Implementing search and replacement in the IR
Filter: Looks for an speciﬁc pattern on the IR
→ E.g Looks for a pragma omp parallel construct
Mutator: Looks for a node and transforms the IR
→ E.g Applies loop transformations (nesting, ﬂattening, . . . )
→ E.g Replaces a pragma omp for by a CUDA kernel call
Can be composed to solve more complex problems

29 / 83

Dynamic Language and Tools

Key Idea: Features Should Require Only a Few Lines of Code

30 / 83

Template Patterns

Ease back-end implementation
1 <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">
2 %for var in var_list:
3 cudaMalloc((void **) (&${prefix}${var.name}${suffix}),
4 ${var.numelems} * sizeof(${var.type}));
5 cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name},
6 ${var.numelems} * sizeof(${var.type}),
7 cudaMemcpyHostToDevice);
8 %endfor
9 </%def>

31 / 83

CUDA Back-end

Generates a CUDA kernel and memory transfers from the
information obtained during the analysis

Supported syntax
parallel, for and their condensed form implemented
New directives to support manual optimizations (e.g
interchange)
Syntax taken from an OpenMP proposal by BSC, UJI and
others (#pragma omp target)
copy in, copy out enable users to provide memory transfer
information
Generated code is human-readable

32 / 83

Example

Update Loop from the Molecular Dynamics Code
1 ...
2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a)
3 #pragma omp parallel for default(shared) private(i, j)
firstprivate(rmass, dt)
4 for (i = 0; i < np; i++) {
5 for (j = 0; j < nd; j++) {
6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j];
7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
8 a[i][j] = f[i][j] * rmass;
9 }
10 }

33 / 83

Translation process

34 / 83

The Jacobi Iterative Method

1 error = 0.0;
2
3
4 {
5
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
9
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
13 error += resid * resid;
14 }
15 }
16 }
17 k++;
18 error = sqrt(error) / (double) (n * m);

35 / 83

Jacobi OpenMP Source

1 error = 0.0;
2
3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
4 {
5 #pragma omp for
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
9 #pragma omp for reduction(+:error)
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
14 }
15 }
16 }
17 k++;

36 / 83

Jacobi llCoMP v1

1 error = 0.0;
2 #pragma omp target device(cuda)
4 {
5 #pragma omp for
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
14 }
15 }
16 }
17 k++;

37 / 83

Jacobi llCoMP v2

1 error = 0.0;
2 #pragma omp target device(cuda) copy_in(u, f) copy_out(f)
4 {
5 #pragma omp for
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
14 }
15 }
16 }
17 k++;

38 / 83

Jacobi Iterative Method

39 / 83

Technical Drawbacks

Limited to Compile-time Optimizations
Some features require runtime information
→ Kernel grid conﬁguration
Orphaned directives were not possible
→ Would require an inter-procedural analysis module
Some templates were too complex
→ And would need to be replicated to support OpenCL

40 / 83

Back to the Drawing Board

41 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU

Related Work
OpenACC
Accelerator ULL (accULL)
Results

Conclusions


Chronological Perspective (2011)


43 / 83

Related Work (I)

hiCUDA
Translates each directive into a CUDA call
It is able to use the GPU Shared Memory
Only works with NVIDIA devices
The programmer still needs to know hardware details

Code Example:
1 ...
2 #pragma hicuda global alloc c [*] [*] copyin
3
4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16)
5 #pragma hicuda loop_partition over_tblock over_thread
6 for (i = 0; i < N; i++) {
7 #pragma hicuda loop_partition over_tblock over_thread
8 for (j = 0; j < N; j++) {
9 double sum = 0.0;
10 ...
44 / 83

Related Work (II)

PGI Accelerator Model
Higher level (directive-based) approach
Fortran and C are supported
Code Example:
1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m])
2 {
3 #pragma acc region
4 for (j = 0; j < n; j++)
5 for (i = 0; i < l; i++) {
6 double sum = 0.0;
7 for (k = 0; k < m; k++)
8 sum += b[i + k * l] * c[k + j * m];
9 a[i + j * l] = sum;
10 }
11 }

45 / 83

Our Ongoing Work at that Time: llcl

Extending llc with support for heterogeneous platforms
Compiler + Runtime implementation
→ The Compiler generates runtime code
→ The Runtime handles memory coherence and drives
execution
Compiler optimizations directed by an XML ﬁle
More generic/higher level approach - not tied to GPUs

46 / 83

llcl: Directives

1 double *a, *b, *c;
2 ...
3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m],
4 c[m * n], l, m, n) copy_out(a[n * l])
5 {
6 int i, j, k;
7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k)
8 for (i = 0; i < l; i++)
9 for (j = 0; j < n; j++) {
10 a[i + j * l] = 0.0;
11 for (k = 0; k < m; k++)
12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
13 }
14 }
15 ...

47 / 83

llcl: XML Platform Description File

1 <xml>
2 <platform name="default">
3 <region name="compute">
4 <element name="compute_1" class="loop">
5 <mutator name="Loop.LoopInterchange"/>
6 <target device="cuda"/>
7 <target device="opencl"/>
8 </element>
9 </region>
10 </platform>
11 </xml>

48 / 83

OpenACC Announcement

49 / 83

OpenACC Announcement

50 / 83

OpenACC: Directives

1 double *a, *b, *c;
2 ...
3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)
copy_out(a[n * l])
4 {
5 int i, j, k;
6 #pragma acc kernels loop private(i, j, k)
7 for (i = 0; i < l; i++)
8 for (j = 0; j < n; j++) {
9 a[i + j * l] = 0.0;
10 for (k = 0; k < m; k++)
11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
12 }
13 }
14 ...

51 / 83

Related Work

OpenACC Implementations (After Announcement)
PGI - Released on February 2012
CAPS - Released on March 2012
Cray - To be released
→ Access to beta release available
We had a ﬁrst experimental implementation in January 2012

52 / 83

accULL: Our OpenACC Implementation

accULL = YaCF + Frangollo
It is a two-layer based implementation:
Compiler + Runtime Library

53 / 83

Frangollo: the Runtime

Implementation
Lightweight
Standard C++ and STL code
CUDA component written using the CUDA Driver API
OpenCL component written using the C OpenCL interface
Experimental features can be enabled/disabled at compile time

Handles
1. Device discovery, initialization, . . .
2. Memory coherence (registered variables)
3. Manage kernel execution (including grid shape)

54 / 83

Frangollo Layered Structure

55 / 83

Memory Management

1 // Creates a context to handle memory coherence
2 ctxt_id = FRG__createContext("name", ...)
3 ...
4 // Register a variable within the context
5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...);
6 ...
7 // Execute the kernel
8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...)
9 ...
10 // Finish the context and concyle variables
11 FRG__destroyContext(ctxt_id);

56 / 83

Kernel Execution

Loading the kernel
Context may have from zero to N named kernels associated
Runtime loads diﬀerent versions of the kernel for each device
Kernel is loaded depending on the platform where it is executed

Grid shape
Grid shape is estimated using compute intensity (CI):
Nmem /(Cost × Nﬂops )
→ E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth
144Gb/s, Cost 3.5
Low CI → favors memory accesses
High CI → favors computation

57 / 83

Implementing OpenACC

Putting all together
1. The compiler driver generates Frangollo interface calls from
OpenACC directives
→ Converts data region directives into context creation
→ Generates Host and Device synchronization
2. Extracts the kernel code
3. Frangollo implements OpenACC API calls
→ acc init, acc malloc/acc free
4. Implements some optimizations
→ Compiler: loop invariant, skewing, strip-mining, interchange
→ Kernel extraction: divergence reduction, data-dependency
analysis (basic)
→ Runtime: grid shape estimation, optimized reduction kernels

58 / 83

Building an OpenACC Code with accULL

59 / 83

Compilance with OpenACC Standard

Table: Compliance with the OpenACC 1.0 standard (directives)
Construct Supported by
kernels PGI, HMPP, accULL
loop PGI, HMPP, accULL
kernels loop PGI, HMPP, accULL
parallel PGI, HMPP
update Implemented
copy, copyin, copyout, . . . PGI, HMPP, accULL
pcopy, pcopyin, pcopyout ,. . . PGI, HMPP, accULL
async PGI
deviceptr clause PGI
host accULL
collapse accULL

Table: Compliance with the OpenACC 1.0 standard (API)
API Call Supported by
acc init PGI, HMPP, accULL
acc set device PGI, HMPP, accULL(no eﬀect)
acc get device PGI, HMPP, accULL
60 / 83

Experimental Platforms

Garoe: A Desktop computer
Intel Core i7 930 processor (2.80 GHz), 4Gb RAM
2 GPU devices attached:
Tesla C1060
Tesla C2050 (Fermi)

Peco: A cluster node
Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors,
24Gb RAM
Attached a Tesla C2050 (Fermi)

Drago: A shared memory system
4 Intel Xeon E7 4850 CPU, 6Gb RAM
Accelerator platform: Intel OpenCL SDK 1.5, running on the
CPU 61 / 83

Software

Compiler versions (Pre-OpenACC)
PGI Compiler Toolkit 12.2 with the PGI Accelerator
Programming Model 1.3
hiCUDA: 0.9

Compiler versions (OpenACC)
PGI Compiler Toolkit 12.6
CAPS HMPP: 3.2.3

62 / 83

Matrix Multiplication (M × M) (I)

1 #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N])
2 {
3 #pragma acc kernels loop private(i, j) collapse(2)
4 for (i = 0; i < L; i++)
5 for (j = 0; j < N; j++)
6 a[i * L + j] = 0.0;
7 /* Iterates over blocks */
8 for (ii = 0; ii < L; ii += tile_size)
9 for (jj = 0; jj < N; jj += tile_size)
10 for (kk = 0; kk < M; kk += tile_size) {
11 /* Iterates inside a block */
12 #pragma acc kernels loop collapse(2) private(i,j,k)
13 for (j = jj; j < min(N, jj+tile_size); j++)
14 for (i = ii; i < min(L, ii+tile_size); i++)
15 for (k = kk; k < min(M, kk+tile_size); k++)
16 a[i*L+j] += (b[i*L+k] * c[k*M+j]);
17 }
18 }

63 / 83

Floating Point Performance for M×M in Peco

64 / 83

M×M (II)

1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N])
2 {
3 #pragma acc kernels loop private(i)
4 for (i = 0; i < L; i++)
5 #pragma acc loop private(j)
6 for (j = 0; j < N; j++)
7 a[i * L + j] = 0.0;
13 #pragma acc kernels loop private(i)
15 #pragma acc loop private(j)
18 a[i * L + j] += (b[i * L + k] * c[k * M + j]);
19 }
20 }
65 / 83

M×M (III)

1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...)
2 {
3 #pragma acc kernels loop private(i) gang(32)
4 for (i = 0; i < L; i++)
5 #pragma acc loop private(j) worker(32)
6 for (j = 0; j < N; j++)
7 a[i * L + j] = 0.0;
13 #pragma acc kernels loop private(i) gang(32)
15 #pragma acc loop private(j) worker(32)
18 a[i*L+j] += (b[i*L+k] * c[k*M+j]);
19 }
20 }
66 / 83

About Grid Shape and Loop Scheduling Clauses

Optimal gang/worker (i.e, grid shape) values vary
Among OpenACC implementations
Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?)
What happens if we implement a non-GPU accelerator?
Our implementation ignores gang/worker, leaves decision to
runtime
→ User can inﬂuence the decision with an environment variable

It is possible to enable the gang/worker clauses in our
implementation
→ Gang/worker feeds a Strip-mining transformation forcing
block/threads (WIP)

67 / 83

Eﬀect of Varying Gang/Worker

68 / 83

OpenMP vs Frangollo+OpenCL in Drago

69 / 83

Needleman-Wunsch (NW)

NW is a nonlinear global optimization method for DNA
sequence alignments
The potential pairs of sequences are organized in a 2D matrix
The method uses Dynamic Programming to ﬁnd the optimum
alignment

70 / 83

Performance Comparison of NW in Garoe

71 / 83

Overall Comparison

72 / 83

Outline

Hybrid MPI+OpenMP

OpenMP-to-GPU


Conclusions


Directive-based Programming

Support for accelerators in the OpenMP standard may be
added in the future
→ In the meantime, OpenACC can be used to port codes to
GPUs
→ It is possible to combine OpenACC with OpenMP
Generated code does not always match native-code
performance
→ But leverages the development eﬀort providing enough
performance
accULL is an interesting research-oriented implementation of
OpenACC
→ First non-commercial OpenACC implementation
→ It is a ﬂexible framework to explore optimizations, new
platforms, . . .

74 / 83

Back to the Drawing Board?

76 / 83

accULL Still Has Some Opportunities

Study support for multiple devices (either transparently or in
OpenACC)
Design an MPI component for the runtime
Integration with other projects
Improve the performance of the generated code (e.g using
Polyhedral models)
Enhance the support for Extrae/Paraver (experimental tracing
already built-in)

77 / 83

Re-use our Know-how

Integrate OpenACC and OMPSs?
Current OMPSs implementation does not automatically
generate kernel code
Integrating OpenACCsyntax within tasks would enable
automatic code generation
Improve portability in accelerator platforms
Leverage development eﬀort

78 / 83

Contributions

Reyes, R. and de Sande, F. Automatic code generation for GPUs in
llc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.
349-356.
Reyes, R. and de Sande, F. Optimization stategies in diﬀerent CUDA
architectures using llCoMP. Microprocessors and Microsystems -
Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.
Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. accULL: an
o
OpenACC implementation with CUDA and OpenCL support. In
Euro-Par 2012 Parallel Processing - 18th International Conference,
vol. 7484 of LNCS, pp. 871-882.
Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. A Preliminary
o
Evaluation of OpenACC Implementations. The Journal of
Supercomputing (In Press)

79 / 83

Other contributions
accULL has been released as an Open Source Project
→ http://cap.pcg.ull.es/accull
accULL is currently being evaluated by VectorFabrics
Provided feedback to CAPS which seems to be used in their
current version
Contacted by members of the OpenACC committee
Two HPC-Europa2 visits by our team master students

80 / 83

Acknowledgements
Spanish MEC
Plan Nacional de I+D+i, contracts TIN2008-06570-C04-03
and TIN2011-24598
Canary Islands Government ACIISI
Contract SolSubC200801000285
TEXT Project (FP7-261580)
HPC-EUROPA2 (project number: 228398)
Universitat Jaume I de Castell´n
o
Universidad de La Laguna
All members of GCAP

81 / 83

Thank you for your attention!

82 / 83

Directive-based approach to Heterogeneous Computing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Directive-based approach to Heterogeneous Computing

Similaire à Directive-based approach to Heterogeneous Computing (20)

Directive-based approach to Heterogeneous Computing