SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Jeff Larkin, August 2017 DOE Performance Portability Workshop
Early Results of OpenMP 4.5 Portability
on NVIDIA GPUs
2
Background
Since the last performance portability workshop, several OpenMP
implementations for NVIDIA GPUs have emerged or matured
As of August 2017, can these implementations deliver on performance,
portability, and performance portability?
• Will OpenMP Target code be portable between compilers?
• Will OpenMP Target code be portable with the host?
I will compare results using 4 compilers: CLANG, Cray, GCC, and XL
8/23/2017
3
OpenMP In Clang
Multi-vendor effort to implement OpenMP in Clang (including offloading)
Runtime based on open-sourced runtime from Intel.
Current status: much improved since last year!
Version used: clang/20170629
Compiler Options:
-O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME
8/23/2017
4
OpenMP In Cray
Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to
market for NVIDIA GPUs.
Observation: Does not adhere to OpenMP as strictly as the others.
Version used: 8.5.5
Compiler Options: None Required
Note: Cray performance results were obtained on an X86 + P100 system, unlike the
other compilers. Only GPU performance is being compared.
8/23/2017
5
OpenMP In GCC
Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs
Runtime also based on open-sourced runtime from Intel
Current status: Mature on CPU, Very immature on GPU
Version used: 7.1.1 20170718 (experimental)
Compiler Options:
-O3 -fopenmp -foffload="-lm"
8/23/2017
6
OpenMP In XL
IBM’s compiler suite, which now includes offloading to NVIDIA GPUs.
Same(ish) runtime as CLANG, but compilation by IBM’s compiler
Version used: xl/20170727-beta
Compiler Options:
-O3 -qsmp -qoffload
8/23/2017
7
Case Study: Jacobi Iteration
8
Example: Jacobi Iteration
Iteratively converges to correct value (e.g. Temperature), by computing new
values at each point from the average of neighboring points.
Common, useful algorithm
Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎
A(i,j)
A(i+1,j)A(i-1,j)
A(i,j-1)
A(i,j+1)
𝐴 𝑘+1 𝑖, 𝑗 =
𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1
4
9
Teams & Distribute
10
Teaming Up
#pragma omp target data map(to:Anew) map(A)
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp target teams distribute parallel for reduction(max:error) map(error)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);
iter++;
}
Explicitly maps arrays
for the entire while
loop.
• Spawns thread teams
• Distributes iterations
to those teams
• Workshares within
those teams.
13
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
7.786044 1.851838 42.543545 8.930509 17.8542 11.487634
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC GCC simd XL XL simd
Data Kernels Other
14
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
7.786044 1.851838 8.930509 17.8542 11.487634
0
2
4
6
8
10
12
14
16
18
20
CLANG Cray GCC simd XL XL simd
Data Kernels Other
15
Increasing Parallelism
16
Increasing Parallelism
Currently both our distributed and workshared parallelism comes from the same
loop.
• We could collapse them together
• We could move the PARALLEL to the inner loop
The COLLAPSE(N) clause
• Turns the next N loops into one, linearized loop.
• This will give us more parallelism to distribute, if we so choose.
8/23/2017
17
Collapse
#pragma omp target teams distribute parallel for reduction(max:error) map(error) 
collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Collapse the two loops
into one and then
parallelize this new
loop across both teams
and threads.
18
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
1.490654 1.820148 41.812337 3.706288
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC XL
Data Kernels Other
19
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
1.490654 1.820148 3.706288
0
0.5
1
1.5
2
2.5
3
3.5
4
CLANG Cray XL
Data Kernels Other
20
Splitting Teams & Parallel
#pragma omp target teams distribute map(error)
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Distribute the “j” loop
over teams.
Workshare the “i” loop
over threads.
21
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
2.30662 1.94593 49.474303 12.261814 14.559997
0
10
20
30
40
50
60
CLANG Cray GCC GCC simd XL
Data Kernels Other
22
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
2.30662 1.94593 12.261814 14.559997
0
2
4
6
8
10
12
14
16
CLANG Cray GCC simd XL
Data Kernels Other
23
Host Fallback
24
Fallback to the Host Processor
Most OpenMP users would like to write 1 set of directives for host and device,
but is this really possible?
Using the “if” clause, offloading can be enabled/disabled at runtime.
#pragma omp target teams distribute parallel for reduction(max:error) map(error) 
collapse(2) if(target:use_gpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Compiler must build CPU & GPU
codes and select at runtime.
25
Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
0%
20%
40%
60%
80%
100%
120%
Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split
XLGCCCrayCLANG
26
Conclusions
27
Conclusions
OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the
past year and are ready for real use.
• Will OpenMP Target code be portable between compilers?
Maybe. Compilers are of various levels of maturity. SIMD support/requirement
inconsistent.
• Will OpenMP Target code be portable with the host?
Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC
and Cray did poorly.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

Contenu connexe

Tendances

HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning FrameworksTakeo Imai
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
Heap sort &amp; bubble sort
Heap sort &amp; bubble sortHeap sort &amp; bubble sort
Heap sort &amp; bubble sortShanmuga Raju
 
Post-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exercisePost-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exerciseIntel IT Center
 
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...InfluxData
 
Yampa AFRP Introduction
Yampa AFRP IntroductionYampa AFRP Introduction
Yampa AFRP IntroductionChengHui Weng
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
SPU Optimizations - Part 2
SPU Optimizations - Part 2SPU Optimizations - Part 2
SPU Optimizations - Part 2Naughty Dog
 
FPGA Implementation of a GA
FPGA Implementation of a GAFPGA Implementation of a GA
FPGA Implementation of a GAHocine Merabti
 
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)Igalia
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloadedmistercteam
 
lecture 6
lecture 6lecture 6
lecture 6sajinsc
 
Firefox content
Firefox contentFirefox content
Firefox contentUbald Agi
 

Tendances (18)

HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Yacf
YacfYacf
Yacf
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
Heap sort &amp; bubble sort
Heap sort &amp; bubble sortHeap sort &amp; bubble sort
Heap sort &amp; bubble sort
 
Post-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exercisePost-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exercise
 
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
 
Yampa AFRP Introduction
Yampa AFRP IntroductionYampa AFRP Introduction
Yampa AFRP Introduction
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
SPU Optimizations - Part 2
SPU Optimizations - Part 2SPU Optimizations - Part 2
SPU Optimizations - Part 2
 
FPGA Implementation of a GA
FPGA Implementation of a GAFPGA Implementation of a GA
FPGA Implementation of a GA
 
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
lecture 6
lecture 6lecture 6
lecture 6
 
Firefox content
Firefox contentFirefox content
Firefox content
 

Similaire à Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Cypher for Gremlin
Cypher for GremlinCypher for Gremlin
Cypher for GremlinopenCypher
 
Subtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff HammondSubtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff HammondPatrick Diehl
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Cdiscount
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 

Similaire à Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs (20)

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Cypher for Gremlin
Cypher for GremlinCypher for Gremlin
Cypher for Gremlin
 
Subtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff HammondSubtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff Hammond
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Clojure basics
Clojure basicsClojure basics
Clojure basics
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
presentation
presentationpresentation
presentation
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 

Plus de Jeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

Plus de Jeff Larkin (12)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

  • 1. Jeff Larkin, August 2017 DOE Performance Portability Workshop Early Results of OpenMP 4.5 Portability on NVIDIA GPUs
  • 2. 2 Background Since the last performance portability workshop, several OpenMP implementations for NVIDIA GPUs have emerged or matured As of August 2017, can these implementations deliver on performance, portability, and performance portability? • Will OpenMP Target code be portable between compilers? • Will OpenMP Target code be portable with the host? I will compare results using 4 compilers: CLANG, Cray, GCC, and XL 8/23/2017
  • 3. 3 OpenMP In Clang Multi-vendor effort to implement OpenMP in Clang (including offloading) Runtime based on open-sourced runtime from Intel. Current status: much improved since last year! Version used: clang/20170629 Compiler Options: -O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME 8/23/2017
  • 4. 4 OpenMP In Cray Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to market for NVIDIA GPUs. Observation: Does not adhere to OpenMP as strictly as the others. Version used: 8.5.5 Compiler Options: None Required Note: Cray performance results were obtained on an X86 + P100 system, unlike the other compilers. Only GPU performance is being compared. 8/23/2017
  • 5. 5 OpenMP In GCC Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs Runtime also based on open-sourced runtime from Intel Current status: Mature on CPU, Very immature on GPU Version used: 7.1.1 20170718 (experimental) Compiler Options: -O3 -fopenmp -foffload="-lm" 8/23/2017
  • 6. 6 OpenMP In XL IBM’s compiler suite, which now includes offloading to NVIDIA GPUs. Same(ish) runtime as CLANG, but compilation by IBM’s compiler Version used: xl/20170727-beta Compiler Options: -O3 -qsmp -qoffload 8/23/2017
  • 8. 8 Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎 A(i,j) A(i+1,j)A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖, 𝑗 = 𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1 4
  • 10. 10 Teaming Up #pragma omp target data map(to:Anew) map(A) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target teams distribute parallel for reduction(max:error) map(error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error); iter++; } Explicitly maps arrays for the entire while loop. • Spawns thread teams • Distributes iterations to those teams • Workshares within those teams.
  • 11. 13 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 7.786044 1.851838 42.543545 8.930509 17.8542 11.487634 0 5 10 15 20 25 30 35 40 45 CLANG Cray GCC GCC simd XL XL simd Data Kernels Other
  • 12. 14 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 7.786044 1.851838 8.930509 17.8542 11.487634 0 2 4 6 8 10 12 14 16 18 20 CLANG Cray GCC simd XL XL simd Data Kernels Other
  • 14. 16 Increasing Parallelism Currently both our distributed and workshared parallelism comes from the same loop. • We could collapse them together • We could move the PARALLEL to the inner loop The COLLAPSE(N) clause • Turns the next N loops into one, linearized loop. • This will give us more parallelism to distribute, if we so choose. 8/23/2017
  • 15. 17 Collapse #pragma omp target teams distribute parallel for reduction(max:error) map(error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Collapse the two loops into one and then parallelize this new loop across both teams and threads.
  • 16. 18 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 1.490654 1.820148 41.812337 3.706288 0 5 10 15 20 25 30 35 40 45 CLANG Cray GCC XL Data Kernels Other
  • 17. 19 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 1.490654 1.820148 3.706288 0 0.5 1 1.5 2 2.5 3 3.5 4 CLANG Cray XL Data Kernels Other
  • 18. 20 Splitting Teams & Parallel #pragma omp target teams distribute map(error) for( int j = 1; j < n-1; j++) { #pragma omp parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Distribute the “j” loop over teams. Workshare the “i” loop over threads.
  • 19. 21 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 2.30662 1.94593 49.474303 12.261814 14.559997 0 10 20 30 40 50 60 CLANG Cray GCC GCC simd XL Data Kernels Other
  • 20. 22 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 2.30662 1.94593 12.261814 14.559997 0 2 4 6 8 10 12 14 16 CLANG Cray GCC simd XL Data Kernels Other
  • 22. 24 Fallback to the Host Processor Most OpenMP users would like to write 1 set of directives for host and device, but is this really possible? Using the “if” clause, offloading can be enabled/disabled at runtime. #pragma omp target teams distribute parallel for reduction(max:error) map(error) collapse(2) if(target:use_gpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Compiler must build CPU & GPU codes and select at runtime.
  • 23. 25 Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 0% 20% 40% 60% 80% 100% 120% Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split XLGCCCrayCLANG
  • 25. 27 Conclusions OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the past year and are ready for real use. • Will OpenMP Target code be portable between compilers? Maybe. Compilers are of various levels of maturity. SIMD support/requirement inconsistent. • Will OpenMP Target code be portable with the host? Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC and Cray did poorly.