SlideShare une entreprise Scribd logo
1  sur  83
Télécharger pour lire hors ligne
Directive-based approach to heterogeneous
computing

  Ruyman Reyes Castro


  High Performance Computing Group
  University of La Laguna


  December 19, 2012
TOP500 Performance Development List




                                      2 / 83
Applications Used in HPC Centers

Usage of HECToR by Area of Expertise




                                        3 / 83
Real HPC Users



Most Used Applications in HECToR
        Application     % of total jobs   Language   Prog. Model
          VASP               17%           Fortran   MPI+OpenMP
          CP2K                7%           Fortran   MPI+OpenMP
    Unified Model (UM)         7%           Fortran       MPI
        GROMACS               4%            C++      MPI+OpenMP


   Large code-bases
   Complex algorithms implemented
   Mixture of different Fortran flavours




                                                                   4 / 83
Knowledge of Programming
Survey conducted in the Swiss National Supercomputing Centre
(2011)




                                                               5 / 83
Are application developers using
       the proper tools?




                                   6 / 83
Complexity Arises (I)




                        7 / 83
Directives: Enhancing Legacy Code (I)



     OpenMP Example
 1    ...
 2   #pragma omp parallel for default(shared) private(i, j)
          firstprivate(rmass, dt)
 3    for (i = 0; i < np; i++) {
 4       for (j = 0; j < nd; j++) {
 5         pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j];
 6         vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
 7         a[i][j] = f[i][j] * rmass;
 8       }
 9     }
10    ...




                                                                         8 / 83
Complexity Arises (II)




                         9 / 83
Re-compiling the code is no longer enough to
    continue improving the performance




                                               10 / 83
Porting Applications To New Architectures

     Programming CUDA (Host Code)
 1   float a_host[n], b_host[n];
 2   // Allocate
 3   cudaMalloc((void*)&a, n * sizeof(float));
 4   cudaMalloc((void*)&b, n * sizeof(float));
 5   // Transfer
 6   cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice);
 7   cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice);
 8   // Define grid shape
 9   blocks = 100
10   threads = 128
11   // Execute
12   kernel<<<blocks,threads>>>(a, b, c);
13   // Copy-back
14   cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);
15   // Clean
16   cudaFree(a);
17   cudaFree(b);

                                                                         11 / 83
Porting Applications To New Architectures



     Programming CUDA (Kernel Source)
 1   // Kernel code
 2   __global__ void kernel(float *a, float *b, float c)
 3   {
 4    // Get the index of this thread
 5    unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x;
 6    // Do the computation
 7    b[index] = a[index] * c;
 8    // Wait for all threads in the block to finish
 9    __syncthreads();
10   }




                                                                      12 / 83
Programmers need faster ways to migrate existing code




                                                        13 / 83
Why not use directive-based approaches for
 these new heterogeneous architectures?




                                             14 / 83
Overview of Our Work

    We can’t solve problems by using the same kind of
    thinking we used when we created them.
                                                    Albert Einstein

The field is undergoing rapid changes: we have to adapt to
them
 1. Hybrid MPI+OpenMP (2008)
    → Usage of directives in cluster environments
 2. OpenMP extensions (2009)
    → Extensions of OpenMP/La Laguna C (llc) for
    heterogeneous architectures
 3. Directives for accelerators (2011)
    → Specific accelerator-oriented directives
    → OpenACC (December 2011)

                                                                      15 / 83
Outline


Hybrid MPI+OpenMP
   llc and llCoMP
   Hybrid llCoMP
   Computational Results
   Technical Drawbacks

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks
La Laguna C: llc


What is
    Directive-based approach to distributed memory environments
    OpenMP compatible
    Additional set of extensions to address particular features
    Implemented FORALL loops, Pipelines, Farms . . .




Reference
[48] Dorta, A. J. Extensi´n del modelo de OpenMP a memoria
                         o
distribuida. PhD Thesis, Universidad de La Laguna, December 2008.



                                                                    17 / 83
Chronological Perspective (Late 2008)



Cores per Socket - System Share   Accelerator - System Share




                                                               18 / 83
A Hybrid OpenMP+MPI Implementation

Same llc code, extended llCoMP implementation
    Directives are replaced by a set of parallel patterns
    Improved performance on multicore systems
    → Better usage of inter-core memories (i.e cache)
    → Lower memory requirements when using replicated memory
    on MPI

Translation




                                                               19 / 83
llc Code Example

     llc Implementation of the Mandelbrot Set Computation
 1    ...
 2   #pragma omp parallel for default(shared) reduction(+:numoutside)
          private(i, j, ztemp, z) shared(nt, c)
 3   #pragma llc reduction_type (int)
 4       for(i = 0; i < npoints; i++) {
 5         z.creal = c[i].creal; z.cimag = c[i].cimag;
 6         for (j = 0; j < MAXITER; j++) {
 7           ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal;
 8           z.cimag = z.creal * z.cimag * 2 + c[i].cimag;
 9           z.creal = ztemp;
10           if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) {
11             numoutside++;
12             break;
13           }
14         }
15    ...


                                                                         20 / 83
Hybrid MPI+OpenMP performance




                                21 / 83
Technical Drawbacks




llCoMP
   The original design of llCoMP StS was not flexible enough
   Traditional two-pass compiler
   Excessive effort to implement new features
   Need more advanced features to implement GPU code
   generation




                                                              22 / 83
Back to the Drawing Board




                            23 / 83
Outline


Hybrid MPI+OpenMP

OpenMP-to-GPU
  Related Work
  Yet Another Compiler Framework (YaCF)
  Computational Results
  Technical Drawbacks

Directives for Accelerators

Conclusions

Future Work and Final Remarks
Chronological Perspective (Late 2009)



Cores per Socket - System Share   Accelerator - System Share




                                                               25 / 83
Related Work


Other OpenMP-to-GPU translators: OpenMPC
[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programming
and tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conference
on Supercomputing. IEEE Computer Society, pp. 1–11.

Other Compiler Frameworks: Cetus, LLVM
[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compiler
infrastructure for source-to-source transformation. In Languages and Compilers
for Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume
2958 of LNCS(2003), pp. 539-553.

[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong
program analysis & transformation. In Proceedings of the international
symposium on Code generation and optimization: feedback-directed and runtime
optimization, CGO’04. IEEE Computer Society, pp. 75–47.



                                                                                 26 / 83
YaCF: Yet Another Compiler Framework



Application programmer writes llc code
    Focus on data and algorithm
    Architecture independent
    Only needs to specify where the parallelism is

System engineer writes template code
    Focus on non-functional code
    Can reuse code from different patterns (i.e inheritance)




                                                              27 / 83
YaCF Software Architecture




                             28 / 83
Main Software Design Patterns




Implementing search and replacement in the IR
    Filter: Looks for an specific pattern on the IR
    → E.g Looks for a pragma omp parallel construct
    Mutator: Looks for a node and transforms the IR
    → E.g Applies loop transformations (nesting, flattening, . . . )
    → E.g Replaces a pragma omp for by a CUDA kernel call
    Can be composed to solve more complex problems




                                                                      29 / 83
Dynamic Language and Tools

Key Idea: Features Should Require Only a Few Lines of Code




                                                             30 / 83
Template Patterns



    Ease back-end implementation
1   <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">
2   %for var in var_list:
3     cudaMalloc((void **) (&${prefix}${var.name}${suffix}),
4                     ${var.numelems} * sizeof(${var.type}));
5     cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name},
6                      ${var.numelems} * sizeof(${var.type}),
7                      cudaMemcpyHostToDevice);
8   %endfor
9   </%def>




                                                                       31 / 83
CUDA Back-end


    Generates a CUDA kernel and memory transfers from the
            information obtained during the analysis

Supported syntax
    parallel, for and their condensed form implemented
    New directives to support manual optimizations (e.g
    interchange)
    Syntax taken from an OpenMP proposal by BSC, UJI and
    others (#pragma omp target)
    copy in, copy out enable users to provide memory transfer
    information
    Generated code is human-readable

                                                                32 / 83
Example



     Update Loop from the Molecular Dynamics Code
 1    ...
 2   #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a)
 3   #pragma omp parallel for default(shared) private(i, j)
          firstprivate(rmass, dt)
 4     for (i = 0; i < np; i++) {
 5       for (j = 0; j < nd; j++) {
 6         pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j];
 7         vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
 8         a[i][j] = f[i][j] * rmass;
 9       }
10     }




                                                                       33 / 83
Translation process




                      34 / 83
The Jacobi Iterative Method

 1   error = 0.0;
 2
 3
 4   {
 5
 6       for (i = 0; i < m; i++)
 7         for (j = 0; j < n; j++)
 8           uold[i][j] = u[i][j];
 9
10       for (i = 0; i < (m - 2); i++) {
11         for (j = 0; j < (n - 2); j++) {
12           resid = ...
13           error += resid * resid;
14         }
15       }
16   }
17   k++;
18   error = sqrt(error) / (double) (n * m);


                                               35 / 83
Jacobi OpenMP Source

 1   error = 0.0;
 2
 3   #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
 4   {
 5     #pragma omp for
 6     for (i = 0; i < m; i++)
 7       for (j = 0; j < n; j++)
 8         uold[i][j] = u[i][j];
 9     #pragma omp for reduction(+:error)
10     for (i = 0; i < (m - 2); i++) {
11       for (j = 0; j < (n - 2); j++) {
12         resid = ...
13         error += resid * resid;
14       }
15     }
16   }
17   k++;
18   error = sqrt(error) / (double) (n * m);


                                                                      36 / 83
Jacobi llCoMP v1

 1   error = 0.0;
 2   #pragma omp target device(cuda)
 3   #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
 4   {
 5     #pragma omp for
 6     for (i = 0; i < m; i++)
 7       for (j = 0; j < n; j++)
 8         uold[i][j] = u[i][j];
 9     #pragma omp for reduction(+:error)
10     for (i = 0; i < (m - 2); i++) {
11       for (j = 0; j < (n - 2); j++) {
12         resid = ...
13         error += resid * resid;
14       }
15     }
16   }
17   k++;
18   error = sqrt(error) / (double) (n * m);


                                                                      37 / 83
Jacobi llCoMP v2

 1   error = 0.0;
 2   #pragma omp target device(cuda) copy_in(u, f) copy_out(f)
 3   #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
 4   {
 5     #pragma omp for
 6     for (i = 0; i < m; i++)
 7       for (j = 0; j < n; j++)
 8         uold[i][j] = u[i][j];
 9     #pragma omp for reduction(+:error)
10     for (i = 0; i < (m - 2); i++) {
11       for (j = 0; j < (n - 2); j++) {
12         resid = ...
13         error += resid * resid;
14       }
15     }
16   }
17   k++;
18   error = sqrt(error) / (double) (n * m);


                                                                      38 / 83
Jacobi Iterative Method




                          39 / 83
Technical Drawbacks




Limited to Compile-time Optimizations
    Some features require runtime information
    → Kernel grid configuration
    Orphaned directives were not possible
    → Would require an inter-procedural analysis module
    Some templates were too complex
    → And would need to be replicated to support OpenCL




                                                          40 / 83
Back to the Drawing Board




                            41 / 83
Outline


Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for Accelerators
   Related Work
   OpenACC
   Accelerator ULL (accULL)
   Results

Conclusions

Future Work and Final Remarks
Chronological Perspective (2011)



Cores per Socket - System Share   Accelerator - System Share




                                                               43 / 83
Related Work (I)

     hiCUDA
         Translates each directive into a CUDA call
         It is able to use the GPU Shared Memory
         Only works with NVIDIA devices
         The programmer still needs to know hardware details

     Code Example:
 1   ...
 2   #pragma hicuda global alloc c [*] [*] copyin
 3
 4   #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16)
 5      #pragma hicuda loop_partition over_tblock over_thread
 6      for (i = 0; i < N; i++) {
 7      #pragma hicuda loop_partition over_tblock over_thread
 8      for (j = 0; j < N; j++) {
 9         double sum = 0.0;
10       ...
                                                                   44 / 83
Related Work (II)

     PGI Accelerator Model
         Higher level (directive-based) approach
         Fortran and C are supported
     Code Example:
 1   #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m])
 2   {
 3     #pragma acc region
 4     for (j = 0; j < n; j++)
 5         for (i = 0; i < l; i++) {
 6           double sum = 0.0;
 7           for (k = 0; k < m; k++)
 8             sum += b[i + k * l] * c[k + j * m];
 9           a[i + j * l] = sum;
10         }
11   }


                                                                  45 / 83
Our Ongoing Work at that Time: llcl




Extending llc with support for heterogeneous platforms
Compiler + Runtime implementation
→ The Compiler generates runtime code
→ The Runtime handles memory coherence and drives
execution
Compiler optimizations directed by an XML file
More generic/higher level approach - not tied to GPUs




                                                         46 / 83
llcl: Directives


 1   double *a, *b, *c;
 2   ...
 3   #pragma llc context name("mxm") copy_in(a[n * l], b[l * m], 
 4                      c[m * n], l, m, n) copy_out(a[n * l])
 5   {
 6    int i, j, k;
 7    #pragma llc for shared(a, b, c, l, m, n) private(i, j, k)
 8    for (i = 0; i < l; i++)
 9     for (j = 0; j < n; j++) {
10        a[i + j * l] = 0.0;
11        for (k = 0; k < m; k++)
12          a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
13       }
14   }
15   ...




                                                                         47 / 83
llcl: XML Platform Description File



 1   <xml>
 2   <platform name="default">
 3    <region name="compute">
 4      <element name="compute_1" class="loop">
 5        <mutator name="Loop.LoopInterchange"/>
 6         <target device="cuda"/>
 7          <target device="opencl"/>
 8         </element>
 9    </region>
10   </platform>
11   </xml>




                                                   48 / 83
OpenACC Announcement




                       49 / 83
OpenACC Announcement




                       50 / 83
OpenACC: Directives


 1   double *a, *b, *c;
 2   ...
 3   #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)
          copy_out(a[n * l])
 4   {
 5    int i, j, k;
 6    #pragma acc kernels loop private(i, j, k)
 7    for (i = 0; i < l; i++)
 8     for (j = 0; j < n; j++) {
 9        a[i + j * l] = 0.0;
10        for (k = 0; k < m; k++)
11          a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
12       }
13   }
14   ...




                                                                         51 / 83
Related Work




OpenACC Implementations (After Announcement)
    PGI - Released on February 2012
    CAPS - Released on March 2012
    Cray - To be released
    → Access to beta release available
We had a first experimental implementation in January 2012




                                                            52 / 83
accULL: Our OpenACC Implementation



accULL = YaCF + Frangollo
It is a two-layer based implementation:
                     Compiler + Runtime Library




                                                  53 / 83
Frangollo: the Runtime


Implementation
    Lightweight
    Standard C++ and STL code
    CUDA component written using the CUDA Driver API
    OpenCL component written using the C OpenCL interface
    Experimental features can be enabled/disabled at compile time

Handles
 1. Device discovery, initialization, . . .
 2. Memory coherence (registered variables)
 3. Manage kernel execution (including grid shape)


                                                                    54 / 83
Frangollo Layered Structure




                              55 / 83
Memory Management



 1   // Creates a context to handle memory coherence
 2   ctxt_id = FRG__createContext("name", ...)
 3   ...
 4   // Register a variable within the context
 5   FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...);
 6   ...
 7   // Execute the kernel
 8   FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...)
 9   ...
10   // Finish the context and concyle variables
11   FRG__destroyContext(ctxt_id);




                                                                        56 / 83
Kernel Execution

Loading the kernel
    Context may have from zero to N named kernels associated
    Runtime loads different versions of the kernel for each device
    Kernel is loaded depending on the platform where it is executed

Grid shape
    Grid shape is estimated using compute intensity (CI):
    Nmem /(Cost × Nflops )
    → E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth
    144Gb/s, Cost 3.5
    Low CI → favors memory accesses
    High CI → favors computation

                                                                      57 / 83
Implementing OpenACC

Putting all together
 1. The compiler driver generates Frangollo interface calls from
    OpenACC directives
    → Converts data region directives into context creation
    → Generates Host and Device synchronization
 2. Extracts the kernel code
 3. Frangollo implements OpenACC API calls
    → acc init, acc malloc/acc free
 4. Implements some optimizations
    → Compiler: loop invariant, skewing, strip-mining, interchange
    → Kernel extraction: divergence reduction, data-dependency
    analysis (basic)
    → Runtime: grid shape estimation, optimized reduction kernels


                                                                     58 / 83
Building an OpenACC Code with accULL




                                       59 / 83
Compilance with OpenACC Standard

Table: Compliance with the OpenACC 1.0 standard (directives)
                 Construct                 Supported by
                  kernels                PGI, HMPP, accULL
                   loop                  PGI, HMPP, accULL
               kernels loop              PGI, HMPP, accULL
                 parallel                    PGI, HMPP
                  update                    Implemented
        copy, copyin, copyout, . . .     PGI, HMPP, accULL
       pcopy, pcopyin, pcopyout ,. . .   PGI, HMPP, accULL
                   async                         PGI
             deviceptr clause                    PGI
                   host                        accULL
                 collapse                      accULL


  Table: Compliance with the OpenACC 1.0 standard (API)
            API Call               Supported by
            acc init             PGI, HMPP, accULL
         acc set device     PGI, HMPP, accULL(no effect)
         acc get device          PGI, HMPP, accULL
                                                               60 / 83
Experimental Platforms

Garoe: A Desktop computer
    Intel Core i7 930 processor (2.80 GHz), 4Gb RAM
    2 GPU devices attached:
        Tesla C1060
        Tesla C2050 (Fermi)

Peco: A cluster node
    Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors,
    24Gb RAM
    Attached a Tesla C2050 (Fermi)

Drago: A shared memory system
    4 Intel Xeon E7 4850 CPU, 6Gb RAM
    Accelerator platform: Intel OpenCL SDK 1.5, running on the
    CPU                                                          61 / 83
Software



Compiler versions (Pre-OpenACC)
    PGI Compiler Toolkit 12.2 with the PGI Accelerator
    Programming Model 1.3
    hiCUDA: 0.9

Compiler versions (OpenACC)
    PGI Compiler Toolkit 12.6
    CAPS HMPP: 3.2.3




                                                         62 / 83
Matrix Multiplication (M × M) (I)

 1   #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N])
 2   {
 3   #pragma acc kernels loop private(i, j) collapse(2)
 4   for (i = 0; i < L; i++)
 5     for (j = 0; j < N; j++)
 6       a[i * L + j] = 0.0;
 7   /* Iterates over blocks */
 8   for (ii = 0; ii < L; ii += tile_size)
 9    for (jj = 0; jj < N; jj += tile_size)
10     for (kk = 0; kk < M; kk += tile_size) {
11      /* Iterates inside a block */
12      #pragma acc kernels loop collapse(2) private(i,j,k)
13      for (j = jj; j < min(N, jj+tile_size); j++)
14       for (i = ii; i < min(L, ii+tile_size); i++)
15        for (k = kk; k < min(M, kk+tile_size); k++)
16         a[i*L+j] += (b[i*L+k] * c[k*M+j]);
17      }
18   }


                                                                        63 / 83
Floating Point Performance for M×M in Peco




                                             64 / 83
M×M (II)

 1   #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N])
 2   {
 3   #pragma acc kernels loop private(i)
 4   for (i = 0; i < L; i++)
 5    #pragma acc loop private(j)
 6     for (j = 0; j < N; j++)
 7       a[i * L + j] = 0.0;
 8   /* Iterates over blocks */
 9   for (ii = 0; ii < L; ii += tile_size)
10    for (jj = 0; jj < N; jj += tile_size)
11     for (kk = 0; kk < M; kk += tile_size) {
12      /* Iterates inside a block */
13      #pragma acc kernels loop private(i)
14      for (j = jj; j < min(N, jj+tile_size); j++)
15      #pragma acc loop private(j)
16       for (i = ii; i < min(L, ii+tile_size); i++)
17        for (k = kk; k < min(M, kk+tile_size); k++)
18         a[i * L + j] += (b[i * L + k] * c[k * M + j]);
19      }
20   }
                                                            65 / 83
M×M (III)

 1   #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...)
 2   {
 3   #pragma acc kernels loop private(i) gang(32)
 4   for (i = 0; i < L; i++)
 5    #pragma acc loop private(j) worker(32)
 6     for (j = 0; j < N; j++)
 7       a[i * L + j] = 0.0;
 8   /* Iterates over blocks */
 9   for (ii = 0; ii < L; ii += tile_size)
10    for (jj = 0; jj < N; jj += tile_size)
11     for (kk = 0; kk < M; kk += tile_size) {
12      /* Iterates inside a block */
13      #pragma acc kernels loop private(i) gang(32)
14      for (j = jj; j < min(N, jj+tile_size); j++)
15      #pragma acc loop private(j) worker(32)
16       for (i = ii; i < min(L, ii+tile_size); i++)
17        for (k = kk; k < min(M, kk+tile_size); k++)
18         a[i*L+j] += (b[i*L+k] * c[k*M+j]);
19      }
20   }
                                                                66 / 83
About Grid Shape and Loop Scheduling Clauses


Optimal gang/worker (i.e, grid shape) values vary
    Among OpenACC implementations
    Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?)
    What happens if we implement a non-GPU accelerator?
    Our implementation ignores gang/worker, leaves decision to
    runtime
    → User can influence the decision with an environment variable

    It is possible to enable the gang/worker clauses in our
    implementation
    → Gang/worker feeds a Strip-mining transformation forcing
    block/threads (WIP)


                                                                    67 / 83
Effect of Varying Gang/Worker




                               68 / 83
OpenMP vs Frangollo+OpenCL in Drago




                                      69 / 83
Needleman-Wunsch (NW)




NW is a nonlinear global optimization method for DNA
sequence alignments
The potential pairs of sequences are organized in a 2D matrix
The method uses Dynamic Programming to find the optimum
alignment


                                                                70 / 83
Performance Comparison of NW in Garoe




                                        71 / 83
Overall Comparison




                     72 / 83
Outline



Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks
Directive-based Programming

Support for accelerators in the OpenMP standard may be
added in the future
→ In the meantime, OpenACC can be used to port codes to
GPUs
→ It is possible to combine OpenACC with OpenMP
Generated code does not always match native-code
performance
→ But leverages the development effort providing enough
performance
accULL is an interesting research-oriented implementation of
OpenACC
→ First non-commercial OpenACC implementation
→ It is a flexible framework to explore optimizations, new
platforms, . . .

                                                               74 / 83
Outline



Hybrid MPI+OpenMP

OpenMP-to-GPU

Directives for Accelerators

Conclusions

Future Work and Final Remarks
Back to the Drawing Board?




                             76 / 83
accULL Still Has Some Opportunities



Study support for multiple devices (either transparently or in
OpenACC)
Design an MPI component for the runtime
Integration with other projects
Improve the performance of the generated code (e.g using
Polyhedral models)
Enhance the support for Extrae/Paraver (experimental tracing
already built-in)




                                                                 77 / 83
Re-use our Know-how




Integrate OpenACC and OMPSs?
   Current OMPSs implementation does not automatically
   generate kernel code
   Integrating OpenACCsyntax within tasks would enable
   automatic code generation
   Improve portability in accelerator platforms
   Leverage development effort




                                                         78 / 83
Contributions


Reyes, R. and de Sande, F. Automatic code generation for GPUs in
llc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.
349-356.
Reyes, R. and de Sande, F. Optimization stategies in different CUDA
architectures using llCoMP. Microprocessors and Microsystems -
Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.
Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. accULL: an
                           o
OpenACC implementation with CUDA and OpenCL support. In
Euro-Par 2012 Parallel Processing - 18th International Conference,
vol. 7484 of LNCS, pp. 871-882.
Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. A Preliminary
                           o
Evaluation of OpenACC Implementations. The Journal of
Supercomputing (In Press)



                                                                     79 / 83
Other contributions
    accULL has been released as an Open Source Project
    → http://cap.pcg.ull.es/accull
    accULL is currently being evaluated by VectorFabrics
    Provided feedback to CAPS which seems to be used in their
    current version
    Contacted by members of the OpenACC committee
    Two HPC-Europa2 visits by our team master students




                                                                80 / 83
Acknowledgements
   Spanish MEC
   Plan Nacional de I+D+i, contracts TIN2008-06570-C04-03
   and TIN2011-24598
   Canary Islands Government ACIISI
   Contract SolSubC200801000285
   TEXT Project (FP7-261580)
   HPC-EUROPA2 (project number: 228398)
   Universitat Jaume I de Castell´n
                                 o
   Universidad de La Laguna
   All members of GCAP




                                                            81 / 83
Thank you for your attention!




                                82 / 83
Directive-based approach to heterogeneous
computing

  Ruyman Reyes Castro


  High Performance Computing Group
  University of La Laguna


  December 19, 2012

Contenu connexe

Tendances

Thesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides ENThesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides ENMarco Santambrogio
 
introduction of c langauge(I unit)
introduction of c langauge(I unit)introduction of c langauge(I unit)
introduction of c langauge(I unit)Prashant Sharma
 
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...Fahad Cheema
 
Dsp lab manual
Dsp lab manualDsp lab manual
Dsp lab manualamanabr
 
Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal Recovery
Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal  RecoveryProjected Nesterov's Proximal-Gradient Algorithm for Sparse Signal  Recovery
Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal RecoveryAleksandar Dogandžić
 
iFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsiFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsShinpei Hayashi
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012Jussara F.M.
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
Sentence-to-Code Traceability Recovery with Domain Ontologies
Sentence-to-Code Traceability Recovery with Domain OntologiesSentence-to-Code Traceability Recovery with Domain Ontologies
Sentence-to-Code Traceability Recovery with Domain OntologiesShinpei Hayashi
 
Chapter Seven(1)
Chapter Seven(1)Chapter Seven(1)
Chapter Seven(1)bolovv
 
Simulation of Wireless Communication Systems
Simulation of Wireless Communication SystemsSimulation of Wireless Communication Systems
Simulation of Wireless Communication SystemsBernd-Peter Paris
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
Code Difference Visualization by a Call Tree
Code Difference Visualization by a Call TreeCode Difference Visualization by a Call Tree
Code Difference Visualization by a Call TreeKamiya Toshihiro
 

Tendances (20)

Thesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides ENThesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides EN
 
introduction of c langauge(I unit)
introduction of c langauge(I unit)introduction of c langauge(I unit)
introduction of c langauge(I unit)
 
Nug2004 yhe
Nug2004 yheNug2004 yhe
Nug2004 yhe
 
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
 
Security Tools Foss
Security Tools FossSecurity Tools Foss
Security Tools Foss
 
Dsp lab manual
Dsp lab manualDsp lab manual
Dsp lab manual
 
Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal Recovery
Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal  RecoveryProjected Nesterov's Proximal-Gradient Algorithm for Sparse Signal  Recovery
Projected Nesterov's Proximal-Gradient Algorithm for Sparse Signal Recovery
 
iFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsiFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature Implementations
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
Sentence-to-Code Traceability Recovery with Domain Ontologies
Sentence-to-Code Traceability Recovery with Domain OntologiesSentence-to-Code Traceability Recovery with Domain Ontologies
Sentence-to-Code Traceability Recovery with Domain Ontologies
 
Compiler tricks
Compiler tricksCompiler tricks
Compiler tricks
 
Chapter Seven(1)
Chapter Seven(1)Chapter Seven(1)
Chapter Seven(1)
 
3DD 1e SyCers
3DD 1e SyCers3DD 1e SyCers
3DD 1e SyCers
 
Simulation of Wireless Communication Systems
Simulation of Wireless Communication SystemsSimulation of Wireless Communication Systems
Simulation of Wireless Communication Systems
 
Run time
Run timeRun time
Run time
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
Code Difference Visualization by a Call Tree
Code Difference Visualization by a Call TreeCode Difference Visualization by a Call Tree
Code Difference Visualization by a Call Tree
 
C programming session10
C programming  session10C programming  session10
C programming session10
 

Similaire à Directive-based approach to Heterogeneous Computing

Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Codelbergmans
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codelbergmans
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 

Similaire à Directive-based approach to Heterogeneous Computing (20)

Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
C programming session9 -
C programming  session9 -C programming  session9 -
C programming session9 -
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
main
mainmain
main
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Mod02 compilers
Mod02 compilersMod02 compilers
Mod02 compilers
 
Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Code
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet code
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
Multicore
MulticoreMulticore
Multicore
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 

Directive-based approach to Heterogeneous Computing

  • 1. Directive-based approach to heterogeneous computing Ruyman Reyes Castro High Performance Computing Group University of La Laguna December 19, 2012
  • 3. Applications Used in HPC Centers Usage of HECToR by Area of Expertise 3 / 83
  • 4. Real HPC Users Most Used Applications in HECToR Application % of total jobs Language Prog. Model VASP 17% Fortran MPI+OpenMP CP2K 7% Fortran MPI+OpenMP Unified Model (UM) 7% Fortran MPI GROMACS 4% C++ MPI+OpenMP Large code-bases Complex algorithms implemented Mixture of different Fortran flavours 4 / 83
  • 5. Knowledge of Programming Survey conducted in the Swiss National Supercomputing Centre (2011) 5 / 83
  • 6. Are application developers using the proper tools? 6 / 83
  • 8. Directives: Enhancing Legacy Code (I) OpenMP Example 1 ... 2 #pragma omp parallel for default(shared) private(i, j) firstprivate(rmass, dt) 3 for (i = 0; i < np; i++) { 4 for (j = 0; j < nd; j++) { 5 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j]; 6 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]); 7 a[i][j] = f[i][j] * rmass; 8 } 9 } 10 ... 8 / 83
  • 10. Re-compiling the code is no longer enough to continue improving the performance 10 / 83
  • 11. Porting Applications To New Architectures Programming CUDA (Host Code) 1 float a_host[n], b_host[n]; 2 // Allocate 3 cudaMalloc((void*)&a, n * sizeof(float)); 4 cudaMalloc((void*)&b, n * sizeof(float)); 5 // Transfer 6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice); 7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice); 8 // Define grid shape 9 blocks = 100 10 threads = 128 11 // Execute 12 kernel<<<blocks,threads>>>(a, b, c); 13 // Copy-back 14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost); 15 // Clean 16 cudaFree(a); 17 cudaFree(b); 11 / 83
  • 12. Porting Applications To New Architectures Programming CUDA (Kernel Source) 1 // Kernel code 2 __global__ void kernel(float *a, float *b, float c) 3 { 4 // Get the index of this thread 5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x; 6 // Do the computation 7 b[index] = a[index] * c; 8 // Wait for all threads in the block to finish 9 __syncthreads(); 10 } 12 / 83
  • 13. Programmers need faster ways to migrate existing code 13 / 83
  • 14. Why not use directive-based approaches for these new heterogeneous architectures? 14 / 83
  • 15. Overview of Our Work We can’t solve problems by using the same kind of thinking we used when we created them. Albert Einstein The field is undergoing rapid changes: we have to adapt to them 1. Hybrid MPI+OpenMP (2008) → Usage of directives in cluster environments 2. OpenMP extensions (2009) → Extensions of OpenMP/La Laguna C (llc) for heterogeneous architectures 3. Directives for accelerators (2011) → Specific accelerator-oriented directives → OpenACC (December 2011) 15 / 83
  • 16. Outline Hybrid MPI+OpenMP llc and llCoMP Hybrid llCoMP Computational Results Technical Drawbacks OpenMP-to-GPU Directives for Accelerators Conclusions Future Work and Final Remarks
  • 17. La Laguna C: llc What is Directive-based approach to distributed memory environments OpenMP compatible Additional set of extensions to address particular features Implemented FORALL loops, Pipelines, Farms . . . Reference [48] Dorta, A. J. Extensi´n del modelo de OpenMP a memoria o distribuida. PhD Thesis, Universidad de La Laguna, December 2008. 17 / 83
  • 18. Chronological Perspective (Late 2008) Cores per Socket - System Share Accelerator - System Share 18 / 83
  • 19. A Hybrid OpenMP+MPI Implementation Same llc code, extended llCoMP implementation Directives are replaced by a set of parallel patterns Improved performance on multicore systems → Better usage of inter-core memories (i.e cache) → Lower memory requirements when using replicated memory on MPI Translation 19 / 83
  • 20. llc Code Example llc Implementation of the Mandelbrot Set Computation 1 ... 2 #pragma omp parallel for default(shared) reduction(+:numoutside) private(i, j, ztemp, z) shared(nt, c) 3 #pragma llc reduction_type (int) 4 for(i = 0; i < npoints; i++) { 5 z.creal = c[i].creal; z.cimag = c[i].cimag; 6 for (j = 0; j < MAXITER; j++) { 7 ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal; 8 z.cimag = z.creal * z.cimag * 2 + c[i].cimag; 9 z.creal = ztemp; 10 if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) { 11 numoutside++; 12 break; 13 } 14 } 15 ... 20 / 83
  • 22. Technical Drawbacks llCoMP The original design of llCoMP StS was not flexible enough Traditional two-pass compiler Excessive effort to implement new features Need more advanced features to implement GPU code generation 22 / 83
  • 23. Back to the Drawing Board 23 / 83
  • 24. Outline Hybrid MPI+OpenMP OpenMP-to-GPU Related Work Yet Another Compiler Framework (YaCF) Computational Results Technical Drawbacks Directives for Accelerators Conclusions Future Work and Final Remarks
  • 25. Chronological Perspective (Late 2009) Cores per Socket - System Share Accelerator - System Share 25 / 83
  • 26. Related Work Other OpenMP-to-GPU translators: OpenMPC [82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conference on Supercomputing. IEEE Computer Society, pp. 1–11. Other Compiler Frameworks: Cetus, LLVM [84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compiler infrastructure for source-to-source transformation. In Languages and Compilers for Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume 2958 of LNCS(2003), pp. 539-553. [81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO’04. IEEE Computer Society, pp. 75–47. 26 / 83
  • 27. YaCF: Yet Another Compiler Framework Application programmer writes llc code Focus on data and algorithm Architecture independent Only needs to specify where the parallelism is System engineer writes template code Focus on non-functional code Can reuse code from different patterns (i.e inheritance) 27 / 83
  • 29. Main Software Design Patterns Implementing search and replacement in the IR Filter: Looks for an specific pattern on the IR → E.g Looks for a pragma omp parallel construct Mutator: Looks for a node and transforms the IR → E.g Applies loop transformations (nesting, flattening, . . . ) → E.g Replaces a pragma omp for by a CUDA kernel call Can be composed to solve more complex problems 29 / 83
  • 30. Dynamic Language and Tools Key Idea: Features Should Require Only a Few Lines of Code 30 / 83
  • 31. Template Patterns Ease back-end implementation 1 <%def name="initialization(var_list, prefix = ’’, suffix = ’’)"> 2 %for var in var_list: 3 cudaMalloc((void **) (&${prefix}${var.name}${suffix}), 4 ${var.numelems} * sizeof(${var.type})); 5 cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name}, 6 ${var.numelems} * sizeof(${var.type}), 7 cudaMemcpyHostToDevice); 8 %endfor 9 </%def> 31 / 83
  • 32. CUDA Back-end Generates a CUDA kernel and memory transfers from the information obtained during the analysis Supported syntax parallel, for and their condensed form implemented New directives to support manual optimizations (e.g interchange) Syntax taken from an OpenMP proposal by BSC, UJI and others (#pragma omp target) copy in, copy out enable users to provide memory transfer information Generated code is human-readable 32 / 83
  • 33. Example Update Loop from the Molecular Dynamics Code 1 ... 2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a) 3 #pragma omp parallel for default(shared) private(i, j) firstprivate(rmass, dt) 4 for (i = 0; i < np; i++) { 5 for (j = 0; j < nd; j++) { 6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j]; 7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]); 8 a[i][j] = f[i][j] * rmass; 9 } 10 } 33 / 83
  • 35. The Jacobi Iterative Method 1 error = 0.0; 2 3 4 { 5 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 10 for (i = 0; i < (m - 2); i++) { 11 for (j = 0; j < (n - 2); j++) { 12 resid = ... 13 error += resid * resid; 14 } 15 } 16 } 17 k++; 18 error = sqrt(error) / (double) (n * m); 35 / 83
  • 36. Jacobi OpenMP Source 1 error = 0.0; 2 3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4 { 5 #pragma omp for 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 #pragma omp for reduction(+:error) 10 for (i = 0; i < (m - 2); i++) { 11 for (j = 0; j < (n - 2); j++) { 12 resid = ... 13 error += resid * resid; 14 } 15 } 16 } 17 k++; 18 error = sqrt(error) / (double) (n * m); 36 / 83
  • 37. Jacobi llCoMP v1 1 error = 0.0; 2 #pragma omp target device(cuda) 3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4 { 5 #pragma omp for 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 #pragma omp for reduction(+:error) 10 for (i = 0; i < (m - 2); i++) { 11 for (j = 0; j < (n - 2); j++) { 12 resid = ... 13 error += resid * resid; 14 } 15 } 16 } 17 k++; 18 error = sqrt(error) / (double) (n * m); 37 / 83
  • 38. Jacobi llCoMP v2 1 error = 0.0; 2 #pragma omp target device(cuda) copy_in(u, f) copy_out(f) 3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid) 4 { 5 #pragma omp for 6 for (i = 0; i < m; i++) 7 for (j = 0; j < n; j++) 8 uold[i][j] = u[i][j]; 9 #pragma omp for reduction(+:error) 10 for (i = 0; i < (m - 2); i++) { 11 for (j = 0; j < (n - 2); j++) { 12 resid = ... 13 error += resid * resid; 14 } 15 } 16 } 17 k++; 18 error = sqrt(error) / (double) (n * m); 38 / 83
  • 40. Technical Drawbacks Limited to Compile-time Optimizations Some features require runtime information → Kernel grid configuration Orphaned directives were not possible → Would require an inter-procedural analysis module Some templates were too complex → And would need to be replicated to support OpenCL 40 / 83
  • 41. Back to the Drawing Board 41 / 83
  • 42. Outline Hybrid MPI+OpenMP OpenMP-to-GPU Directives for Accelerators Related Work OpenACC Accelerator ULL (accULL) Results Conclusions Future Work and Final Remarks
  • 43. Chronological Perspective (2011) Cores per Socket - System Share Accelerator - System Share 43 / 83
  • 44. Related Work (I) hiCUDA Translates each directive into a CUDA call It is able to use the GPU Shared Memory Only works with NVIDIA devices The programmer still needs to know hardware details Code Example: 1 ... 2 #pragma hicuda global alloc c [*] [*] copyin 3 4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16) 5 #pragma hicuda loop_partition over_tblock over_thread 6 for (i = 0; i < N; i++) { 7 #pragma hicuda loop_partition over_tblock over_thread 8 for (j = 0; j < N; j++) { 9 double sum = 0.0; 10 ... 44 / 83
  • 45. Related Work (II) PGI Accelerator Model Higher level (directive-based) approach Fortran and C are supported Code Example: 1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m]) 2 { 3 #pragma acc region 4 for (j = 0; j < n; j++) 5 for (i = 0; i < l; i++) { 6 double sum = 0.0; 7 for (k = 0; k < m; k++) 8 sum += b[i + k * l] * c[k + j * m]; 9 a[i + j * l] = sum; 10 } 11 } 45 / 83
  • 46. Our Ongoing Work at that Time: llcl Extending llc with support for heterogeneous platforms Compiler + Runtime implementation → The Compiler generates runtime code → The Runtime handles memory coherence and drives execution Compiler optimizations directed by an XML file More generic/higher level approach - not tied to GPUs 46 / 83
  • 47. llcl: Directives 1 double *a, *b, *c; 2 ... 3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m], 4 c[m * n], l, m, n) copy_out(a[n * l]) 5 { 6 int i, j, k; 7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k) 8 for (i = 0; i < l; i++) 9 for (j = 0; j < n; j++) { 10 a[i + j * l] = 0.0; 11 for (k = 0; k < m; k++) 12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m]; 13 } 14 } 15 ... 47 / 83
  • 48. llcl: XML Platform Description File 1 <xml> 2 <platform name="default"> 3 <region name="compute"> 4 <element name="compute_1" class="loop"> 5 <mutator name="Loop.LoopInterchange"/> 6 <target device="cuda"/> 7 <target device="opencl"/> 8 </element> 9 </region> 10 </platform> 11 </xml> 48 / 83
  • 51. OpenACC: Directives 1 double *a, *b, *c; 2 ... 3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n) copy_out(a[n * l]) 4 { 5 int i, j, k; 6 #pragma acc kernels loop private(i, j, k) 7 for (i = 0; i < l; i++) 8 for (j = 0; j < n; j++) { 9 a[i + j * l] = 0.0; 10 for (k = 0; k < m; k++) 11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m]; 12 } 13 } 14 ... 51 / 83
  • 52. Related Work OpenACC Implementations (After Announcement) PGI - Released on February 2012 CAPS - Released on March 2012 Cray - To be released → Access to beta release available We had a first experimental implementation in January 2012 52 / 83
  • 53. accULL: Our OpenACC Implementation accULL = YaCF + Frangollo It is a two-layer based implementation: Compiler + Runtime Library 53 / 83
  • 54. Frangollo: the Runtime Implementation Lightweight Standard C++ and STL code CUDA component written using the CUDA Driver API OpenCL component written using the C OpenCL interface Experimental features can be enabled/disabled at compile time Handles 1. Device discovery, initialization, . . . 2. Memory coherence (registered variables) 3. Manage kernel execution (including grid shape) 54 / 83
  • 56. Memory Management 1 // Creates a context to handle memory coherence 2 ctxt_id = FRG__createContext("name", ...) 3 ... 4 // Register a variable within the context 5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...); 6 ... 7 // Execute the kernel 8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...) 9 ... 10 // Finish the context and concyle variables 11 FRG__destroyContext(ctxt_id); 56 / 83
  • 57. Kernel Execution Loading the kernel Context may have from zero to N named kernels associated Runtime loads different versions of the kernel for each device Kernel is loaded depending on the platform where it is executed Grid shape Grid shape is estimated using compute intensity (CI): Nmem /(Cost × Nflops ) → E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth 144Gb/s, Cost 3.5 Low CI → favors memory accesses High CI → favors computation 57 / 83
  • 58. Implementing OpenACC Putting all together 1. The compiler driver generates Frangollo interface calls from OpenACC directives → Converts data region directives into context creation → Generates Host and Device synchronization 2. Extracts the kernel code 3. Frangollo implements OpenACC API calls → acc init, acc malloc/acc free 4. Implements some optimizations → Compiler: loop invariant, skewing, strip-mining, interchange → Kernel extraction: divergence reduction, data-dependency analysis (basic) → Runtime: grid shape estimation, optimized reduction kernels 58 / 83
  • 59. Building an OpenACC Code with accULL 59 / 83
  • 60. Compilance with OpenACC Standard Table: Compliance with the OpenACC 1.0 standard (directives) Construct Supported by kernels PGI, HMPP, accULL loop PGI, HMPP, accULL kernels loop PGI, HMPP, accULL parallel PGI, HMPP update Implemented copy, copyin, copyout, . . . PGI, HMPP, accULL pcopy, pcopyin, pcopyout ,. . . PGI, HMPP, accULL async PGI deviceptr clause PGI host accULL collapse accULL Table: Compliance with the OpenACC 1.0 standard (API) API Call Supported by acc init PGI, HMPP, accULL acc set device PGI, HMPP, accULL(no effect) acc get device PGI, HMPP, accULL 60 / 83
  • 61. Experimental Platforms Garoe: A Desktop computer Intel Core i7 930 processor (2.80 GHz), 4Gb RAM 2 GPU devices attached: Tesla C1060 Tesla C2050 (Fermi) Peco: A cluster node Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors, 24Gb RAM Attached a Tesla C2050 (Fermi) Drago: A shared memory system 4 Intel Xeon E7 4850 CPU, 6Gb RAM Accelerator platform: Intel OpenCL SDK 1.5, running on the CPU 61 / 83
  • 62. Software Compiler versions (Pre-OpenACC) PGI Compiler Toolkit 12.2 with the PGI Accelerator Programming Model 1.3 hiCUDA: 0.9 Compiler versions (OpenACC) PGI Compiler Toolkit 12.6 CAPS HMPP: 3.2.3 62 / 83
  • 63. Matrix Multiplication (M × M) (I) 1 #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N]) 2 { 3 #pragma acc kernels loop private(i, j) collapse(2) 4 for (i = 0; i < L; i++) 5 for (j = 0; j < N; j++) 6 a[i * L + j] = 0.0; 7 /* Iterates over blocks */ 8 for (ii = 0; ii < L; ii += tile_size) 9 for (jj = 0; jj < N; jj += tile_size) 10 for (kk = 0; kk < M; kk += tile_size) { 11 /* Iterates inside a block */ 12 #pragma acc kernels loop collapse(2) private(i,j,k) 13 for (j = jj; j < min(N, jj+tile_size); j++) 14 for (i = ii; i < min(L, ii+tile_size); i++) 15 for (k = kk; k < min(M, kk+tile_size); k++) 16 a[i*L+j] += (b[i*L+k] * c[k*M+j]); 17 } 18 } 63 / 83
  • 64. Floating Point Performance for M×M in Peco 64 / 83
  • 65. M×M (II) 1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N]) 2 { 3 #pragma acc kernels loop private(i) 4 for (i = 0; i < L; i++) 5 #pragma acc loop private(j) 6 for (j = 0; j < N; j++) 7 a[i * L + j] = 0.0; 8 /* Iterates over blocks */ 9 for (ii = 0; ii < L; ii += tile_size) 10 for (jj = 0; jj < N; jj += tile_size) 11 for (kk = 0; kk < M; kk += tile_size) { 12 /* Iterates inside a block */ 13 #pragma acc kernels loop private(i) 14 for (j = jj; j < min(N, jj+tile_size); j++) 15 #pragma acc loop private(j) 16 for (i = ii; i < min(L, ii+tile_size); i++) 17 for (k = kk; k < min(M, kk+tile_size); k++) 18 a[i * L + j] += (b[i * L + k] * c[k * M + j]); 19 } 20 } 65 / 83
  • 66. M×M (III) 1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...) 2 { 3 #pragma acc kernels loop private(i) gang(32) 4 for (i = 0; i < L; i++) 5 #pragma acc loop private(j) worker(32) 6 for (j = 0; j < N; j++) 7 a[i * L + j] = 0.0; 8 /* Iterates over blocks */ 9 for (ii = 0; ii < L; ii += tile_size) 10 for (jj = 0; jj < N; jj += tile_size) 11 for (kk = 0; kk < M; kk += tile_size) { 12 /* Iterates inside a block */ 13 #pragma acc kernels loop private(i) gang(32) 14 for (j = jj; j < min(N, jj+tile_size); j++) 15 #pragma acc loop private(j) worker(32) 16 for (i = ii; i < min(L, ii+tile_size); i++) 17 for (k = kk; k < min(M, kk+tile_size); k++) 18 a[i*L+j] += (b[i*L+k] * c[k*M+j]); 19 } 20 } 66 / 83
  • 67. About Grid Shape and Loop Scheduling Clauses Optimal gang/worker (i.e, grid shape) values vary Among OpenACC implementations Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?) What happens if we implement a non-GPU accelerator? Our implementation ignores gang/worker, leaves decision to runtime → User can influence the decision with an environment variable It is possible to enable the gang/worker clauses in our implementation → Gang/worker feeds a Strip-mining transformation forcing block/threads (WIP) 67 / 83
  • 68. Effect of Varying Gang/Worker 68 / 83
  • 69. OpenMP vs Frangollo+OpenCL in Drago 69 / 83
  • 70. Needleman-Wunsch (NW) NW is a nonlinear global optimization method for DNA sequence alignments The potential pairs of sequences are organized in a 2D matrix The method uses Dynamic Programming to find the optimum alignment 70 / 83
  • 71. Performance Comparison of NW in Garoe 71 / 83
  • 73. Outline Hybrid MPI+OpenMP OpenMP-to-GPU Directives for Accelerators Conclusions Future Work and Final Remarks
  • 74. Directive-based Programming Support for accelerators in the OpenMP standard may be added in the future → In the meantime, OpenACC can be used to port codes to GPUs → It is possible to combine OpenACC with OpenMP Generated code does not always match native-code performance → But leverages the development effort providing enough performance accULL is an interesting research-oriented implementation of OpenACC → First non-commercial OpenACC implementation → It is a flexible framework to explore optimizations, new platforms, . . . 74 / 83
  • 75. Outline Hybrid MPI+OpenMP OpenMP-to-GPU Directives for Accelerators Conclusions Future Work and Final Remarks
  • 76. Back to the Drawing Board? 76 / 83
  • 77. accULL Still Has Some Opportunities Study support for multiple devices (either transparently or in OpenACC) Design an MPI component for the runtime Integration with other projects Improve the performance of the generated code (e.g using Polyhedral models) Enhance the support for Extrae/Paraver (experimental tracing already built-in) 77 / 83
  • 78. Re-use our Know-how Integrate OpenACC and OMPSs? Current OMPSs implementation does not automatically generate kernel code Integrating OpenACCsyntax within tasks would enable automatic code generation Improve portability in accelerator platforms Leverage development effort 78 / 83
  • 79. Contributions Reyes, R. and de Sande, F. Automatic code generation for GPUs in llc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp. 349-356. Reyes, R. and de Sande, F. Optimization stategies in different CUDA architectures using llCoMP. Microprocessors and Microsystems - Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87. Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. accULL: an o OpenACC implementation with CUDA and OpenCL support. In Euro-Par 2012 Parallel Processing - 18th International Conference, vol. 7484 of LNCS, pp. 871-882. Reyes, R., Fumero, J. J., L´pez, I. and de Sande, F. A Preliminary o Evaluation of OpenACC Implementations. The Journal of Supercomputing (In Press) 79 / 83
  • 80. Other contributions accULL has been released as an Open Source Project → http://cap.pcg.ull.es/accull accULL is currently being evaluated by VectorFabrics Provided feedback to CAPS which seems to be used in their current version Contacted by members of the OpenACC committee Two HPC-Europa2 visits by our team master students 80 / 83
  • 81. Acknowledgements Spanish MEC Plan Nacional de I+D+i, contracts TIN2008-06570-C04-03 and TIN2011-24598 Canary Islands Government ACIISI Contract SolSubC200801000285 TEXT Project (FP7-261580) HPC-EUROPA2 (project number: 228398) Universitat Jaume I de Castell´n o Universidad de La Laguna All members of GCAP 81 / 83
  • 82. Thank you for your attention! 82 / 83
  • 83. Directive-based approach to heterogeneous computing Ruyman Reyes Castro High Performance Computing Group University of La Laguna December 19, 2012