Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.

  • Identifiez-vous pour voir les commentaires

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

  1. 1. Jeff Larkin, August 2017 DOE Performance Portability Workshop Early Results of OpenMP 4.5 Portability on NVIDIA GPUs
  2. 2. 2 Background Since the last performance portability workshop, several OpenMP implementations for NVIDIA GPUs have emerged or matured As of August 2017, can these implementations deliver on performance, portability, and performance portability? • Will OpenMP Target code be portable between compilers? • Will OpenMP Target code be portable with the host? I will compare results using 4 compilers: CLANG, Cray, GCC, and XL 8/23/2017
  3. 3. 3 OpenMP In Clang Multi-vendor effort to implement OpenMP in Clang (including offloading) Runtime based on open-sourced runtime from Intel. Current status: much improved since last year! Version used: clang/20170629 Compiler Options: -O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME 8/23/2017
  4. 4. 4 OpenMP In Cray Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to market for NVIDIA GPUs. Observation: Does not adhere to OpenMP as strictly as the others. Version used: 8.5.5 Compiler Options: None Required Note: Cray performance results were obtained on an X86 + P100 system, unlike the other compilers. Only GPU performance is being compared. 8/23/2017
  5. 5. 5 OpenMP In GCC Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs Runtime also based on open-sourced runtime from Intel Current status: Mature on CPU, Very immature on GPU Version used: 7.1.1 20170718 (experimental) Compiler Options: -O3 -fopenmp -foffload="-lm" 8/23/2017
  6. 6. 6 OpenMP In XL IBM’s compiler suite, which now includes offloading to NVIDIA GPUs. Same(ish) runtime as CLANG, but compilation by IBM’s compiler Version used: xl/20170727-beta Compiler Options: -O3 -qsmp -qoffload 8/23/2017
  7. 7. 7 Case Study: Jacobi Iteration
  8. 8. 8 Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎 A(i,j) A(i+1,j)A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖, 𝑗 = 𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1 4
  9. 9. 9 Teams & Distribute
  10. 10. 10 Teaming Up #pragma omp target data map(to:Anew) map(A) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target teams distribute parallel for reduction(max:error) map(error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error); iter++; } Explicitly maps arrays for the entire while loop. • Spawns thread teams • Distributes iterations to those teams • Workshares within those teams.
  11. 11. 13 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 7.786044 1.851838 42.543545 8.930509 17.8542 11.487634 0 5 10 15 20 25 30 35 40 45 CLANG Cray GCC GCC simd XL XL simd Data Kernels Other
  12. 12. 14 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 7.786044 1.851838 8.930509 17.8542 11.487634 0 2 4 6 8 10 12 14 16 18 20 CLANG Cray GCC simd XL XL simd Data Kernels Other
  13. 13. 15 Increasing Parallelism
  14. 14. 16 Increasing Parallelism Currently both our distributed and workshared parallelism comes from the same loop. • We could collapse them together • We could move the PARALLEL to the inner loop The COLLAPSE(N) clause • Turns the next N loops into one, linearized loop. • This will give us more parallelism to distribute, if we so choose. 8/23/2017
  15. 15. 17 Collapse #pragma omp target teams distribute parallel for reduction(max:error) map(error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Collapse the two loops into one and then parallelize this new loop across both teams and threads.
  16. 16. 18 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 1.490654 1.820148 41.812337 3.706288 0 5 10 15 20 25 30 35 40 45 CLANG Cray GCC XL Data Kernels Other
  17. 17. 19 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 1.490654 1.820148 3.706288 0 0.5 1 1.5 2 2.5 3 3.5 4 CLANG Cray XL Data Kernels Other
  18. 18. 20 Splitting Teams & Parallel #pragma omp target teams distribute map(error) for( int j = 1; j < n-1; j++) { #pragma omp parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Distribute the “j” loop over teams. Workshare the “i” loop over threads.
  19. 19. 21 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 2.30662 1.94593 49.474303 12.261814 14.559997 0 10 20 30 40 50 60 CLANG Cray GCC GCC simd XL Data Kernels Other
  20. 20. 22 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 2.30662 1.94593 12.261814 14.559997 0 2 4 6 8 10 12 14 16 CLANG Cray GCC simd XL Data Kernels Other
  21. 21. 23 Host Fallback
  22. 22. 24 Fallback to the Host Processor Most OpenMP users would like to write 1 set of directives for host and device, but is this really possible? Using the “if” clause, offloading can be enabled/disabled at runtime. #pragma omp target teams distribute parallel for reduction(max:error) map(error) collapse(2) if(target:use_gpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Compiler must build CPU & GPU codes and select at runtime.
  23. 23. 25 Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 0% 20% 40% 60% 80% 100% 120% Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split XLGCCCrayCLANG
  24. 24. 26 Conclusions
  25. 25. 27 Conclusions OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the past year and are ready for real use. • Will OpenMP Target code be portable between compilers? Maybe. Compilers are of various levels of maturity. SIMD support/requirement inconsistent. • Will OpenMP Target code be portable with the host? Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC and Cray did poorly.

×