SlideShare a Scribd company logo
1 of 44
Download to read offline
spcl.inf.ethz.ch
@spcl_eth
Automatic Accelerator Compilation of the COSMO Physics Core
Tobias Grosser, Siddharth Bhat, Torsten Hoefler
December 2017
Albert Cohen, Sven Verdoolaege,
Oleksandre Zinenko
Polly Labs, ENS Paris
Johannes Doerfert
Uni. Saarbruecken
Roman Gereev,
Ural Federal University
Hongin Zheng, Alexandre Isonard
Xilinx
Swiss Universities / PASC
Qualcomm, ARM, Xilinx
… many others
spcl.inf.ethz.ch
@spcl_eth
2
Weather
Physics Simulations
Machine Learning
Graphics
spcl.inf.ethz.ch
@spcl_eth
row = 0;
output_image_ptr = output_image;
output_image_ptr += (NN * dead_rows);
for (r = 0; r < NN - KK + 1; r++) {
output_image_offset = output_image_ptr;
output_image_offset += dead_cols;
col = 0;
for (c = 0; c < NN - KK + 1; c++) {
input_image_ptr = input_image;
input_image_ptr += (NN * row);
kernel_ptr = kernel;
S0: *output_image_offset = 0;
for (i = 0; i < KK; i++) {
input_image_offset = input_image_ptr;
input_image_offset += col;
kernel_offset = kernel_ptr;
for (j = 0; j < KK; j++) {
S1: temp1 = *input_image_offset++;
S1: temp2 = *kernel_offset++;
S1: *output_image_offset += temp1 * temp2;
}
kernel_ptr += KK;
input_image_ptr += NN;
}
S2: *output_image_offset = ((*output_image_offset)/
normal_factor);
output_image_offset++ ;
col++;
}
output_image_ptr += NN;
row++;
}
}
Fortran
C/C++ CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Multi-Core & SIMD CPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
Accelerator
Sequential Software Parallel Hardware
Development
Time
Maintenance
Cost
Performance
Tuning
3
spcl.inf.ethz.ch
@spcl_eth
4
COSMO: Weather and Climate Model
• 500.000 Lines of Fortran
• 18.000 Loops
• 19 Years of Knowledge
• Used in Switzerland, Russia,
Germany, Poland, Italy, Israel,
Greece, Romania, …
spcl.inf.ethz.ch
@spcl_eth
COSMO – Climate Modeling
5
• Global (low-resolution model)
• Up to 5000 nodes
• Runs “monthly”
Piz Daint, Lugano, Switzerland
spcl.inf.ethz.ch
@spcl_eth
COSMO – Weather Forecast
6
• Regional model
• High-resolution
• Runs “hourly”
(20 instances in parallel)
• Today: 40 Nodes * 8 GPU
• Manual translation to GPUs
3 Year,
Multi-person Project
Can we automate this
GPU mapping?
spcl.inf.ethz.ch
@spcl_eth
7
The LLVM Compiler
Targets
CPU
Intel / AMD
PowerPC
ARM / MIPS
GPU
NVIDIA
AMD / ARM
FPGA
Xilinx
Altera
TargetsTargets
Static Languages
C / C++
Fortran
Go / D / C# / …
Compute
Languages
Julia (MatLab style)
Dynamic
JavaScript
Java
Frontends
COSMO
spcl.inf.ethz.ch
@spcl_eth
Iteration Space
0 1 2 3 4 5
j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
Program Code
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S(i,j);
i = 0, j = 1
i = 4, j = 4i = 4, j = 3i = 4, j = 2
i = 3, j = 3
i = 4, j = 0
i = 3, j = 0
i = 2, j = 0
i = 1, j = 0
i = 4, j = 1
i = 2, j = 1
i = 1, j = 1
i = 2, j = 2
i = 3, j = 1i = 3, j = 2
Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation
Tobias Grosser, Armin Groesslinger, and Christian Lengauer in Parallel Processing Letters (PPL), April, 2012 8
Polyhedral Model – In a nutshell
spcl.inf.ethz.ch
@spcl_eth
Static Control Parts - SCoPs
 Structured Control
 IF-conditions
 Counted FOR-loops (Fortran style)
 Multi-dimensional array accesses (and scalars)
 Loop-conditions and IF-conditions are Presburger Formula
 Loop increments are constant (non-parametric)
 Array subscript expressions are piecewise-affine
 Can be modeled precisely with Presburger Sets
9
spcl.inf.ethz.ch
@spcl_eth
Polyhedral Model of Static Control Part
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S: B[i][j] = A[i][j] + A[i][j+1];
• Iteration Space (Domain)
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑁 ∧ 0 ≤ 𝑗 ≤ 𝑖
• Schedule
𝜃𝑆 = { 𝑆 𝑖, 𝑗 → 𝑖, 𝑗 }
• Access Relation
• Reads: {𝑆 𝑖, 𝑗 → 𝐴 𝑖, 𝑗 ; 𝑆 𝑖, 𝑗 → 𝐴(𝑖, 𝑗 + 1)}
• Writes: {𝑆 𝑖, 𝑗 → 𝐵 𝑖, 𝑗 }
10
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } →
𝑖
4
, 𝑗, 𝑖 𝑚𝑜𝑑 4
Code
for (i = 0; i <= n; i++)
for (j = 0; j <= i; j++)
S(i, j);
Polyhedral Schedule: Original
11
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } →
𝑖
4
, 𝑗, 𝑖 𝑚𝑜𝑑 4
Code
for (c0 = 0; c0 <= n; c0++)
for (c1 = 0; c1 <= c0; c1++)
S(c0, c1);
Polyhedral Schedule: Original
12
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑗, 𝑖 } →
𝑖
4
, 𝑗, 𝑖 𝑚𝑜𝑑 4
Code
for (c0 = 0; c0 <= n; c0++)
for (c1 = c0; c1 <= n; c1++)
S(c1, c0);
Polyhedral Schedule: Interchanged
13
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 →
𝑖
4
, 𝑗, 𝑖 𝑚𝑜𝑑 4 }
Code
for (c0 = 0; c0 <= floord(n, 4); c0++)
for (c1 = 0; c1 <= min(n, 4 * c0 + 3); c1++)
for (c2 = max(0, -4 * c0 + c1);
c1 <= min(3, n - 4 * c0); c2++)
S(4 * c0 + c2, c1);
Polyhedral Schedule: Strip-mined
14
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 →
𝑖
4
,
𝑗
4
, 𝑖 𝑚𝑜𝑑 4, 𝑗 𝑚𝑜𝑑 4 }
Code
for (c0 = 0; c0 <= floord(n, 4); c0++)
for (c1 = 0; c1 <= c0; c1++)
for (c2 = 0; c2 <= min(3, n - 4 * c0); c2++)
for (c3 = 0; c3 <= min(3, 4 * c0 – 4 * c1 + c2); c3++)
S(4 * c0 + c2, 4 * c1 + c3);
Polyhedral Schedule: Blocked
15
spcl.inf.ethz.ch
@spcl_eth
0
1
2
0
1
2
0 1 2 3 0 1 2 3
0
1
10
16
Mapping Computation to Device
0
0 1
1
Device Blocks & Threads
Iteration Space
𝐵𝐼𝐷 = { 𝑖, 𝑗 →
𝑖
4
% 2,
𝑗
3
% 2 }
i
j
𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }
spcl.inf.ethz.ch
@spcl_eth
17
Polly-ACC: Architecture
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
spcl.inf.ethz.ch
@spcl_eth
18
Kernels to Programs – Data Transfers
void heat(int n, float A[n], float hot, float cold) {
float B[n] = {0};
for (int i = 0; i < n; i++)
A[i] = cold;
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
} }
OpenCL Kernel
Host code
With unknown
side effects
CUDA GPU
CUDA GPU
CUDA GPU
CUDA GPU
spcl.inf.ethz.ch
@spcl_eth
19
Data Transfer – Per Kernel
Host Memory
initialize()
setCenter()
average()
average()
average()
D → 𝐻
D → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
time
𝐻 → 𝐷 𝐷 → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
Device Memory
spcl.inf.ethz.ch
@spcl_eth
20
Data Transfer – Inter Kernel Caching
Host Memory
𝐷 → 𝐻
Host Memory
initialize()
setCenter()
average()
average()
average()
time
𝐻 → 𝐷
Device Memory
spcl.inf.ethz.ch
@spcl_eth
21
EvaluationEvaluation – Polly ACC
Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)
spcl.inf.ethz.ch
@spcl_eth
0:00
0:28
0:57
1:26
1:55
2:24
2:52
3:21
3:50
4:19
Workstation
icc icc -openmp clang Polly ACC
2x speedup
vs. multi-thread CPU
22
Lattice Boltzmann (SPEC 2006)
4x speedup
vs. single-thread CPU
spcl.inf.ethz.ch
@spcl_eth
23
Cactus ADM (SPEC 2006) - Performance
spcl.inf.ethz.ch
@spcl_eth
24
Statistics - COSMO
 Number of Loops
 18,093 Total
 9,760 Static Control Loops (Modeled precisely by Polly)
 15,245 Non-Affine Memory Accesses (Approximated by Polly)
 11.154 Loops after precise modeling, less e.g. due to:
• Infeasible assumptions taken, or modeling timeouts
 Largest set of loops: 72 loops
 Reasons why loops cannot be modeled
 Function calls with side-effects
 Uncomputable loops bounds (data-dependent loop bounds?)
Siddharth Bhat
spcl.inf.ethz.ch
@spcl_eth
init_radition
organize_radition
fesft
 opt_th
 opt_so
 inv_th
 coe_th
 inv_so
 coe_so
25
Radiation Computation in COSMO (call graph)
Hot Functions:
Must be inlined and interchanged
Hot Functions:
Must be inlined and interchanged
Compute kernels
in all functions
spcl.inf.ethz.ch
@spcl_eth
Interprocedural Loop Interchange for GPU Execution (inv_th)
26
#ifdef _OPENACC
!$acc parallel
!$acc loop gang vector
DO j1 = ki1sc, ki1ec
CALL coe_th_gpu(pduh2oc (j1, ki3sc), pduh2of(j1, ki3sc), pduco2(j1, ki3sc),
pduo3(j1, ki3sc), …, pa2f(j1), pa3c(j1), pa3f(j1))
ENDDO
!$acc end parallel
#else
CALL coe_th (pduh2oc, pduh2of, pduco2, pduo3, palogp, palogt, podsc, podsf, podac, podaf,
…, pa3c, pa3f)
#endif
Pulled out parallel loop for
OpenACC Annotations
spcl.inf.ethz.ch
@spcl_eth
Optical Effect on Solar Layer (inv_th
27
DO j3 = ki3sc+1, ki3ec
CALL coe_th (j3) { ! Determine effect of the layer in *coe_th*
! Optical depth of gases
DO j1 = ki1sc, ki1ec
…
IF (kco2 /= 0) THEN
zodgf = zodgf + pduco2(j1 ,j3)* (cobi(kco2,kspec,2)* EXP ( coali(kco2,kspec,2) *
palogp(j1 ,j3) -cobti(kco2,kspec,2) * palogt(j1 ,j3)))
ENDIF
…
zeps=SQRT(zodgf*zodgf)
…
ENDDO
}
DO j1 = ki1sc, ki1ec ! Set RHS
…
ENDDO
DO j1 = ki1sc, ki1ec ! Elimination and storage of utility variables
…
ENDDO
ENDDO ! End of vertical loop over layers
Outer loop is sequential
Inner loop is parallel
Sequential Dependences
Inner loop is parallel
Inner loop is parallel
spcl.inf.ethz.ch
@spcl_eth
Optical Effect on Solar Layer – After interchange
28
!> Turn loop structure with multiple ip loops inside a
!> single k loop into perfectly nested k-ip loop on GPU.
#ifdef _OPENACC
!$acc parallel
!$acc loop gang vector
DO j1 = ki1sc, ki1ec
!$acc loop seq
DO j3 = ki3sc+1, ki3ec ! Loop over vertical
! Determine effects of layer in *coe_so*
CALL coe_so_gpu(pduh2oc (j1,j3) , pduh2of (j1,j3) , …, pa4c(j1), pa4f(j1), pa5c(j1), pa5f(j1))
! Elimination
…
ztd1 = 1.0_dp/(1.0_dp-pa5f(j1)*(pca2(j1,j3)*ztu6(j1,j3-1)+pcc2(j1,j3)*ztu8(j1,j3-1)))
ztu9(j1,j3) = pa5c(j1)*pcd1(j1,j3)+ztd6*ztu3(j1,j3) + ztd7*ztu5(j1,j3)
ENDDO
END DO ! Vertical loop
!$acc end parallel
Inner loop is sequential
Outer loop is parallel
spcl.inf.ethz.ch
@spcl_eth
Life Range Reordering (IMPACT’16 Verdoolaege et. al)
29
sequential
parallel
parallel
sequential
Privatization needed
for parallel execution
False dependences
prevent interchange
Scalable
Scheduling
spcl.inf.ethz.ch
@spcl_eth
30
Polly-ACC: Architecture
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
spcl.inf.ethz.ch
@spcl_eth
31
Polly-ACC: Architecture
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
Intrinsics to model
Multi-dimensional
strided arrays
Better ways to link
with NVIDIA
libdevice
Scalable
Modeling
Scalable
Scheduling
Unified
Memory
OpenCL + SPIR-V
Backend
spcl.inf.ethz.ch
@spcl_eth
32
Memory on CPU + GPU Hybrid Machine
System DRAM System GDDR5
Automatic
spcl.inf.ethz.ch
@spcl_eth
Performance
1
10
100
1000
Dragonegg + LLVM (CPU
only)
Cray (CPU only) Polly-ACC (P100) Manual OpenACC (P100)
COSMO
COSMO
5x speedup
33
4.3x speedup
All important loop
transformations
performed
Headroom:
- Kernel compilation
(1.5s)
- Register usage
(2x)
- Block-size tuning
- Unified-memory
overhead?
22x speedup
spcl.inf.ethz.ch
@spcl_eth
Per-Kernel Performane
% Total Calls Time-per-call
29.98% 1.22414s 939 1.0221ms
FUNC___radiation_rg_MOD_inv_th_SCOP_0_KERNEL_0
19.19% 783.48ms 580 1.1786ms
FUNC___radiation_rg_MOD_inv_so_SCOP_0_KERNEL_1
8.50% 347.10ms 140 146.62us
FUNC___radiation_rg_MOD_fesft_dp_SCOP_11_KERNEL_0
... ~ 50 more
34
Per-Kernel Time
is short
Many Small Kernels
Still way-longer than
openacc kernels
spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
35
Maximilian
Falkenstein
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i
spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
36
Maximilian
Falkenstein
for (intX c = 2; c < N+M; c++)
#pragma simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i,c-i) = A(i-1,c-1) + A(i,c-i-1)
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i i + j
spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
37
Maximilian
Falkenstein
for (intX c = 2; c < N+M; c++)
#pragma simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i,c-i) = A(i-1,c-1) + A(i,c-i-1)
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i i + j
What is X?
N + M larger
than 32 bit
TODAY
• Use 64-bit
• Hope it’s enough
spcl.inf.ethz.ch
@spcl_eth
What type would be optimal?
38
Server or
Workstation
CPU
Embedded or
HPC
GPU
Embedded
CPU
FPGA
64-bit 32-bit 32/16 bit minimal
Today Tomorrow
COSMO
Can we always get
32 bit types?
spcl.inf.ethz.ch
@spcl_eth
Precise Solution
39
for (intX c = 2; c < N+M; c++)
# simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i, c-i) = A(i-1, c-1) + A(i, c-i-1)
-
-
c i
1
Domain: { (c) : 2 <= c < N + M
INT_MIN <= N, M <= INT_MAX }
f0() = c - i
f1() = c - i - 1
1) calc: min(fX()), max(fX()) under Domain
2) choose type accordingly
spcl.inf.ethz.ch
@spcl_eth
40
ILP Solver
• Minimal Types
• Potentially Costly
Approximations*
• s(a+b) ≤
max(s(a), s(b)) + 1
• Good, if smaller than
native type
* Earlier uses in GCC and Polly
Preconditions
• Assume values
fit into 32 bit
• Derive required
pre-conditions
+
-
c i
1
spcl.inf.ethz.ch
@spcl_eth
Type Distribution for LNT SCOPS
41
32 + epsilon is
almost always
enough!
spcl.inf.ethz.ch
@spcl_eth
Compile Time Overhead
0
5
10
15
20
25
30
No Types Solver Solver + Approx Solver + Approx (8 bit)
GPU Code Generation (5000 lines of code)
GPU Code Generation
42
Less than 10% overhead
vs. no types.
Less
than
10%
spcl.inf.ethz.ch
@spcl_eth
Automatic Compilation to FPGA ?
 Automatic Translation: Floating Point Fixed Point
 COSMO mostly floating point (single precision / double precision)
 SPIR-V flow to Xilinx HLS Tools
 Translate LLVM-IR directly to Verilog
 How to get cache coherence
 Visited Xilinx: Could share some of their software toolchain for Enzian
 How to schedule kernels
 Partial reconfiguration
 Data Caching
 Can we keep data in b-ram
 Kernel Size
 Can we reduce the size of all kernels?
 …
43
spcl.inf.ethz.ch
@spcl_eth
Conclusion
44
Optimal & Correct Types
Automatic
Unified Memory
Transfers
Complex Loop
Transformations
Hybrid
Mapping

More Related Content

What's hot

Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemMarina Kolpakova
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4nomaddo
 
Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...
Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...
Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...Amos Zaslavsky
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 명신 김
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVELinaro
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkJaewook. Kang
 
Juan josefumeroarray14
Juan josefumeroarray14Juan josefumeroarray14
Juan josefumeroarray14Juan Fumero
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
Multi-threading your way out
Multi-threading your way outMulti-threading your way out
Multi-threading your way out.NET Crowd
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaJuan Fumero
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareDaniel Blezek
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++Microsoft
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
A Simple Communication System Design Lab #1 with MATLAB Simulink
A Simple Communication System Design Lab #1 with MATLAB Simulink A Simple Communication System Design Lab #1 with MATLAB Simulink
A Simple Communication System Design Lab #1 with MATLAB Simulink Jaewook. Kang
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...J On The Beach
 

What's hot (20)

Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4
 
Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...
Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...
Enrichment lecture EE Technion (parts A&B) also including the subject of VHDL...
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB Simulink
 
Juan josefumeroarray14
Juan josefumeroarray14Juan josefumeroarray14
Juan josefumeroarray14
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Multi-threading your way out
Multi-threading your way outMulti-threading your way out
Multi-threading your way out
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
Yacf
YacfYacf
Yacf
 
A Simple Communication System Design Lab #1 with MATLAB Simulink
A Simple Communication System Design Lab #1 with MATLAB Simulink A Simple Communication System Design Lab #1 with MATLAB Simulink
A Simple Communication System Design Lab #1 with MATLAB Simulink
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
 

Similar to Compilation of COSMO for GPU using LLVM

Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015ihji
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
Data Structure: Algorithm and analysis
Data Structure: Algorithm and analysisData Structure: Algorithm and analysis
Data Structure: Algorithm and analysisDr. Rajdeep Chatterjee
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingEelco Visser
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting SpatialFAO
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate CompilersFunctional Thursday
 
Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語ikdysfm
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaLisa Hua
 
Sparse Matrix and Polynomial
Sparse Matrix and PolynomialSparse Matrix and Polynomial
Sparse Matrix and PolynomialAroosa Rajput
 
Introducción a Elixir
Introducción a ElixirIntroducción a Elixir
Introducción a ElixirSvet Ivantchev
 
C Code and the Art of Obfuscation
C Code and the Art of ObfuscationC Code and the Art of Obfuscation
C Code and the Art of Obfuscationguest9006ab
 
talk at Virginia Bioinformatics Institute, December 5, 2013
talk at Virginia Bioinformatics Institute, December 5, 2013talk at Virginia Bioinformatics Institute, December 5, 2013
talk at Virginia Bioinformatics Institute, December 5, 2013ericupnorth
 
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by OraclesEfficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by OraclesVissarion Fisikopoulos
 

Similar to Compilation of COSMO for GPU using LLVM (20)

Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Data Structure: Algorithm and analysis
Data Structure: Algorithm and analysisData Structure: Algorithm and analysis
Data Structure: Algorithm and analysis
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
Function Approx2009
Function Approx2009Function Approx2009
Function Approx2009
 
Stop Monkeys Fall
Stop Monkeys FallStop Monkeys Fall
Stop Monkeys Fall
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
4th Semester Electronic and Communication Engineering (June/July-2015) Questi...
4th Semester Electronic and Communication Engineering (June/July-2015) Questi...4th Semester Electronic and Communication Engineering (June/July-2015) Questi...
4th Semester Electronic and Communication Engineering (June/July-2015) Questi...
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term Rewriting
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting Spatial
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
 
Complexity.pdf
Complexity.pdfComplexity.pdf
Complexity.pdf
 
Sparse Matrix and Polynomial
Sparse Matrix and PolynomialSparse Matrix and Polynomial
Sparse Matrix and Polynomial
 
Introducción a Elixir
Introducción a ElixirIntroducción a Elixir
Introducción a Elixir
 
C Code and the Art of Obfuscation
C Code and the Art of ObfuscationC Code and the Art of Obfuscation
C Code and the Art of Obfuscation
 
talk at Virginia Bioinformatics Institute, December 5, 2013
talk at Virginia Bioinformatics Institute, December 5, 2013talk at Virginia Bioinformatics Institute, December 5, 2013
talk at Virginia Bioinformatics Institute, December 5, 2013
 
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by OraclesEfficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Compilation of COSMO for GPU using LLVM

  • 1. spcl.inf.ethz.ch @spcl_eth Automatic Accelerator Compilation of the COSMO Physics Core Tobias Grosser, Siddharth Bhat, Torsten Hoefler December 2017 Albert Cohen, Sven Verdoolaege, Oleksandre Zinenko Polly Labs, ENS Paris Johannes Doerfert Uni. Saarbruecken Roman Gereev, Ural Federal University Hongin Zheng, Alexandre Isonard Xilinx Swiss Universities / PASC Qualcomm, ARM, Xilinx … many others
  • 3. spcl.inf.ethz.ch @spcl_eth row = 0; output_image_ptr = output_image; output_image_ptr += (NN * dead_rows); for (r = 0; r < NN - KK + 1; r++) { output_image_offset = output_image_ptr; output_image_offset += dead_cols; col = 0; for (c = 0; c < NN - KK + 1; c++) { input_image_ptr = input_image; input_image_ptr += (NN * row); kernel_ptr = kernel; S0: *output_image_offset = 0; for (i = 0; i < KK; i++) { input_image_offset = input_image_ptr; input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < KK; j++) { S1: temp1 = *input_image_offset++; S1: temp2 = *kernel_offset++; S1: *output_image_offset += temp1 * temp2; } kernel_ptr += KK; input_image_ptr += NN; } S2: *output_image_offset = ((*output_image_offset)/ normal_factor); output_image_offset++ ; col++; } output_image_ptr += NN; row++; } } Fortran C/C++ CPU CPU CPU CPU CPU CPU CPU CPU Multi-Core & SIMD CPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU GPUGPU Accelerator Sequential Software Parallel Hardware Development Time Maintenance Cost Performance Tuning 3
  • 4. spcl.inf.ethz.ch @spcl_eth 4 COSMO: Weather and Climate Model • 500.000 Lines of Fortran • 18.000 Loops • 19 Years of Knowledge • Used in Switzerland, Russia, Germany, Poland, Italy, Israel, Greece, Romania, …
  • 5. spcl.inf.ethz.ch @spcl_eth COSMO – Climate Modeling 5 • Global (low-resolution model) • Up to 5000 nodes • Runs “monthly” Piz Daint, Lugano, Switzerland
  • 6. spcl.inf.ethz.ch @spcl_eth COSMO – Weather Forecast 6 • Regional model • High-resolution • Runs “hourly” (20 instances in parallel) • Today: 40 Nodes * 8 GPU • Manual translation to GPUs 3 Year, Multi-person Project Can we automate this GPU mapping?
  • 7. spcl.inf.ethz.ch @spcl_eth 7 The LLVM Compiler Targets CPU Intel / AMD PowerPC ARM / MIPS GPU NVIDIA AMD / ARM FPGA Xilinx Altera TargetsTargets Static Languages C / C++ Fortran Go / D / C# / … Compute Languages Julia (MatLab style) Dynamic JavaScript Java Frontends COSMO
  • 8. spcl.inf.ethz.ch @spcl_eth Iteration Space 0 1 2 3 4 5 j i 5 4 3 2 1 0 N = 4 j ≤ i i ≤ N = 4 0 ≤ j 0 ≤ i D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i } Program Code for (i = 0; i <= N; i++) for (j = 0; j <= i; j++) S(i,j); i = 0, j = 1 i = 4, j = 4i = 4, j = 3i = 4, j = 2 i = 3, j = 3 i = 4, j = 0 i = 3, j = 0 i = 2, j = 0 i = 1, j = 0 i = 4, j = 1 i = 2, j = 1 i = 1, j = 1 i = 2, j = 2 i = 3, j = 1i = 3, j = 2 Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation Tobias Grosser, Armin Groesslinger, and Christian Lengauer in Parallel Processing Letters (PPL), April, 2012 8 Polyhedral Model – In a nutshell
  • 9. spcl.inf.ethz.ch @spcl_eth Static Control Parts - SCoPs  Structured Control  IF-conditions  Counted FOR-loops (Fortran style)  Multi-dimensional array accesses (and scalars)  Loop-conditions and IF-conditions are Presburger Formula  Loop increments are constant (non-parametric)  Array subscript expressions are piecewise-affine  Can be modeled precisely with Presburger Sets 9
  • 10. spcl.inf.ethz.ch @spcl_eth Polyhedral Model of Static Control Part for (i = 0; i <= N; i++) for (j = 0; j <= i; j++) S: B[i][j] = A[i][j] + A[i][j+1]; • Iteration Space (Domain) 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑁 ∧ 0 ≤ 𝑗 ≤ 𝑖 • Schedule 𝜃𝑆 = { 𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } • Access Relation • Reads: {𝑆 𝑖, 𝑗 → 𝐴 𝑖, 𝑗 ; 𝑆 𝑖, 𝑗 → 𝐴(𝑖, 𝑗 + 1)} • Writes: {𝑆 𝑖, 𝑗 → 𝐵 𝑖, 𝑗 } 10
  • 11. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 Code for (i = 0; i <= n; i++) for (j = 0; j <= i; j++) S(i, j); Polyhedral Schedule: Original 11
  • 12. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 Code for (c0 = 0; c0 <= n; c0++) for (c1 = 0; c1 <= c0; c1++) S(c0, c1); Polyhedral Schedule: Original 12
  • 13. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑗, 𝑖 } → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 Code for (c0 = 0; c0 <= n; c0++) for (c1 = c0; c1 <= n; c1++) S(c1, c0); Polyhedral Schedule: Interchanged 13
  • 14. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖 4 , 𝑗, 𝑖 𝑚𝑜𝑑 4 } Code for (c0 = 0; c0 <= floord(n, 4); c0++) for (c1 = 0; c1 <= min(n, 4 * c0 + 3); c1++) for (c2 = max(0, -4 * c0 + c1); c1 <= min(3, n - 4 * c0); c2++) S(4 * c0 + c2, c1); Polyhedral Schedule: Strip-mined 14
  • 15. spcl.inf.ethz.ch @spcl_eth Model 𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖 𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖 4 , 𝑗 4 , 𝑖 𝑚𝑜𝑑 4, 𝑗 𝑚𝑜𝑑 4 } Code for (c0 = 0; c0 <= floord(n, 4); c0++) for (c1 = 0; c1 <= c0; c1++) for (c2 = 0; c2 <= min(3, n - 4 * c0); c2++) for (c3 = 0; c3 <= min(3, 4 * c0 – 4 * c1 + c2); c3++) S(4 * c0 + c2, 4 * c1 + c3); Polyhedral Schedule: Blocked 15
  • 16. spcl.inf.ethz.ch @spcl_eth 0 1 2 0 1 2 0 1 2 3 0 1 2 3 0 1 10 16 Mapping Computation to Device 0 0 1 1 Device Blocks & Threads Iteration Space 𝐵𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 4 % 2, 𝑗 3 % 2 } i j 𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }
  • 17. spcl.inf.ethz.ch @spcl_eth 17 Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
  • 18. spcl.inf.ethz.ch @spcl_eth 18 Kernels to Programs – Data Transfers void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; for (int i = 0; i < n; i++) A[i] = cold; setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } } OpenCL Kernel Host code With unknown side effects CUDA GPU CUDA GPU CUDA GPU CUDA GPU
  • 19. spcl.inf.ethz.ch @spcl_eth 19 Data Transfer – Per Kernel Host Memory initialize() setCenter() average() average() average() D → 𝐻 D → 𝐻 𝐻 → 𝐷 𝐷 → 𝐻 time 𝐻 → 𝐷 𝐷 → 𝐻 𝐻 → 𝐷 𝐷 → 𝐻 Device Memory
  • 20. spcl.inf.ethz.ch @spcl_eth 20 Data Transfer – Inter Kernel Caching Host Memory 𝐷 → 𝐻 Host Memory initialize() setCenter() average() average() average() time 𝐻 → 𝐷 Device Memory
  • 21. spcl.inf.ethz.ch @spcl_eth 21 EvaluationEvaluation – Polly ACC Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)
  • 22. spcl.inf.ethz.ch @spcl_eth 0:00 0:28 0:57 1:26 1:55 2:24 2:52 3:21 3:50 4:19 Workstation icc icc -openmp clang Polly ACC 2x speedup vs. multi-thread CPU 22 Lattice Boltzmann (SPEC 2006) 4x speedup vs. single-thread CPU
  • 24. spcl.inf.ethz.ch @spcl_eth 24 Statistics - COSMO  Number of Loops  18,093 Total  9,760 Static Control Loops (Modeled precisely by Polly)  15,245 Non-Affine Memory Accesses (Approximated by Polly)  11.154 Loops after precise modeling, less e.g. due to: • Infeasible assumptions taken, or modeling timeouts  Largest set of loops: 72 loops  Reasons why loops cannot be modeled  Function calls with side-effects  Uncomputable loops bounds (data-dependent loop bounds?) Siddharth Bhat
  • 25. spcl.inf.ethz.ch @spcl_eth init_radition organize_radition fesft  opt_th  opt_so  inv_th  coe_th  inv_so  coe_so 25 Radiation Computation in COSMO (call graph) Hot Functions: Must be inlined and interchanged Hot Functions: Must be inlined and interchanged Compute kernels in all functions
  • 26. spcl.inf.ethz.ch @spcl_eth Interprocedural Loop Interchange for GPU Execution (inv_th) 26 #ifdef _OPENACC !$acc parallel !$acc loop gang vector DO j1 = ki1sc, ki1ec CALL coe_th_gpu(pduh2oc (j1, ki3sc), pduh2of(j1, ki3sc), pduco2(j1, ki3sc), pduo3(j1, ki3sc), …, pa2f(j1), pa3c(j1), pa3f(j1)) ENDDO !$acc end parallel #else CALL coe_th (pduh2oc, pduh2of, pduco2, pduo3, palogp, palogt, podsc, podsf, podac, podaf, …, pa3c, pa3f) #endif Pulled out parallel loop for OpenACC Annotations
  • 27. spcl.inf.ethz.ch @spcl_eth Optical Effect on Solar Layer (inv_th 27 DO j3 = ki3sc+1, ki3ec CALL coe_th (j3) { ! Determine effect of the layer in *coe_th* ! Optical depth of gases DO j1 = ki1sc, ki1ec … IF (kco2 /= 0) THEN zodgf = zodgf + pduco2(j1 ,j3)* (cobi(kco2,kspec,2)* EXP ( coali(kco2,kspec,2) * palogp(j1 ,j3) -cobti(kco2,kspec,2) * palogt(j1 ,j3))) ENDIF … zeps=SQRT(zodgf*zodgf) … ENDDO } DO j1 = ki1sc, ki1ec ! Set RHS … ENDDO DO j1 = ki1sc, ki1ec ! Elimination and storage of utility variables … ENDDO ENDDO ! End of vertical loop over layers Outer loop is sequential Inner loop is parallel Sequential Dependences Inner loop is parallel Inner loop is parallel
  • 28. spcl.inf.ethz.ch @spcl_eth Optical Effect on Solar Layer – After interchange 28 !> Turn loop structure with multiple ip loops inside a !> single k loop into perfectly nested k-ip loop on GPU. #ifdef _OPENACC !$acc parallel !$acc loop gang vector DO j1 = ki1sc, ki1ec !$acc loop seq DO j3 = ki3sc+1, ki3ec ! Loop over vertical ! Determine effects of layer in *coe_so* CALL coe_so_gpu(pduh2oc (j1,j3) , pduh2of (j1,j3) , …, pa4c(j1), pa4f(j1), pa5c(j1), pa5f(j1)) ! Elimination … ztd1 = 1.0_dp/(1.0_dp-pa5f(j1)*(pca2(j1,j3)*ztu6(j1,j3-1)+pcc2(j1,j3)*ztu8(j1,j3-1))) ztu9(j1,j3) = pa5c(j1)*pcd1(j1,j3)+ztd6*ztu3(j1,j3) + ztd7*ztu5(j1,j3) ENDDO END DO ! Vertical loop !$acc end parallel Inner loop is sequential Outer loop is parallel
  • 29. spcl.inf.ethz.ch @spcl_eth Life Range Reordering (IMPACT’16 Verdoolaege et. al) 29 sequential parallel parallel sequential Privatization needed for parallel execution False dependences prevent interchange Scalable Scheduling
  • 30. spcl.inf.ethz.ch @spcl_eth 30 Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
  • 31. spcl.inf.ethz.ch @spcl_eth 31 Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul Intrinsics to model Multi-dimensional strided arrays Better ways to link with NVIDIA libdevice Scalable Modeling Scalable Scheduling Unified Memory OpenCL + SPIR-V Backend
  • 32. spcl.inf.ethz.ch @spcl_eth 32 Memory on CPU + GPU Hybrid Machine System DRAM System GDDR5 Automatic
  • 33. spcl.inf.ethz.ch @spcl_eth Performance 1 10 100 1000 Dragonegg + LLVM (CPU only) Cray (CPU only) Polly-ACC (P100) Manual OpenACC (P100) COSMO COSMO 5x speedup 33 4.3x speedup All important loop transformations performed Headroom: - Kernel compilation (1.5s) - Register usage (2x) - Block-size tuning - Unified-memory overhead? 22x speedup
  • 34. spcl.inf.ethz.ch @spcl_eth Per-Kernel Performane % Total Calls Time-per-call 29.98% 1.22414s 939 1.0221ms FUNC___radiation_rg_MOD_inv_th_SCOP_0_KERNEL_0 19.19% 783.48ms 580 1.1786ms FUNC___radiation_rg_MOD_inv_so_SCOP_0_KERNEL_1 8.50% 347.10ms 140 146.62us FUNC___radiation_rg_MOD_fesft_dp_SCOP_11_KERNEL_0 ... ~ 50 more 34 Per-Kernel Time is short Many Small Kernels Still way-longer than openacc kernels
  • 35. spcl.inf.ethz.ch @spcl_eth Correct Types for Loop Transformations 35 Maximilian Falkenstein for (int32 i = 1; i < N; i++) for (int32 j = 1; j <= M; j++) A(i,j) = A(i-1,j) + A(i,j-1) j i
  • 36. spcl.inf.ethz.ch @spcl_eth Correct Types for Loop Transformations 36 Maximilian Falkenstein for (intX c = 2; c < N+M; c++) #pragma simd for (intX i = max(1, c-M); i <= min(N, c-1); i++) A(i,c-i) = A(i-1,c-1) + A(i,c-i-1) for (int32 i = 1; i < N; i++) for (int32 j = 1; j <= M; j++) A(i,j) = A(i-1,j) + A(i,j-1) j i i + j
  • 37. spcl.inf.ethz.ch @spcl_eth Correct Types for Loop Transformations 37 Maximilian Falkenstein for (intX c = 2; c < N+M; c++) #pragma simd for (intX i = max(1, c-M); i <= min(N, c-1); i++) A(i,c-i) = A(i-1,c-1) + A(i,c-i-1) for (int32 i = 1; i < N; i++) for (int32 j = 1; j <= M; j++) A(i,j) = A(i-1,j) + A(i,j-1) j i i + j What is X? N + M larger than 32 bit TODAY • Use 64-bit • Hope it’s enough
  • 38. spcl.inf.ethz.ch @spcl_eth What type would be optimal? 38 Server or Workstation CPU Embedded or HPC GPU Embedded CPU FPGA 64-bit 32-bit 32/16 bit minimal Today Tomorrow COSMO Can we always get 32 bit types?
  • 39. spcl.inf.ethz.ch @spcl_eth Precise Solution 39 for (intX c = 2; c < N+M; c++) # simd for (intX i = max(1, c-M); i <= min(N, c-1); i++) A(i, c-i) = A(i-1, c-1) + A(i, c-i-1) - - c i 1 Domain: { (c) : 2 <= c < N + M INT_MIN <= N, M <= INT_MAX } f0() = c - i f1() = c - i - 1 1) calc: min(fX()), max(fX()) under Domain 2) choose type accordingly
  • 40. spcl.inf.ethz.ch @spcl_eth 40 ILP Solver • Minimal Types • Potentially Costly Approximations* • s(a+b) ≤ max(s(a), s(b)) + 1 • Good, if smaller than native type * Earlier uses in GCC and Polly Preconditions • Assume values fit into 32 bit • Derive required pre-conditions + - c i 1
  • 41. spcl.inf.ethz.ch @spcl_eth Type Distribution for LNT SCOPS 41 32 + epsilon is almost always enough!
  • 42. spcl.inf.ethz.ch @spcl_eth Compile Time Overhead 0 5 10 15 20 25 30 No Types Solver Solver + Approx Solver + Approx (8 bit) GPU Code Generation (5000 lines of code) GPU Code Generation 42 Less than 10% overhead vs. no types. Less than 10%
  • 43. spcl.inf.ethz.ch @spcl_eth Automatic Compilation to FPGA ?  Automatic Translation: Floating Point Fixed Point  COSMO mostly floating point (single precision / double precision)  SPIR-V flow to Xilinx HLS Tools  Translate LLVM-IR directly to Verilog  How to get cache coherence  Visited Xilinx: Could share some of their software toolchain for Enzian  How to schedule kernels  Partial reconfiguration  Data Caching  Can we keep data in b-ram  Kernel Size  Can we reduce the size of all kernels?  … 43
  • 44. spcl.inf.ethz.ch @spcl_eth Conclusion 44 Optimal & Correct Types Automatic Unified Memory Transfers Complex Loop Transformations Hybrid Mapping