Compilation of COSMO for GPU using LLVM

spcl.inf.ethz.ch
@spcl_eth
Automatic Accelerator Compilation of the COSMO Physics Core
Tobias Grosser, Siddharth Bhat, Torsten Hoefler
December 2017
Albert Cohen, Sven Verdoolaege,
Oleksandre Zinenko
Polly Labs, ENS Paris
Johannes Doerfert
Uni. Saarbruecken
Roman Gereev,
Ural Federal University
Hongin Zheng, Alexandre Isonard
Xilinx
Swiss Universities / PASC
Qualcomm, ARM, Xilinx
… many others

spcl.inf.ethz.ch
@spcl_eth
2
Weather
Physics Simulations
Machine Learning
Graphics

spcl.inf.ethz.ch
@spcl_eth
row = 0;
output_image_ptr = output_image;
output_image_ptr += (NN * dead_rows);
for (r = 0; r < NN - KK + 1; r++) {
output_image_offset = output_image_ptr;
output_image_offset += dead_cols;
col = 0;
for (c = 0; c < NN - KK + 1; c++) {
input_image_ptr = input_image;
input_image_ptr += (NN * row);
kernel_ptr = kernel;
S0: *output_image_offset = 0;
for (i = 0; i < KK; i++) {
input_image_offset = input_image_ptr;
input_image_offset += col;
kernel_offset = kernel_ptr;
for (j = 0; j < KK; j++) {
S1: temp1 = *input_image_offset++;
S1: temp2 = *kernel_offset++;
S1: *output_image_offset += temp1 * temp2;
}
kernel_ptr += KK;
input_image_ptr += NN;
}
S2: *output_image_offset = ((*output_image_offset)/
normal_factor);
output_image_offset++ ;
col++;
}
output_image_ptr += NN;
row++;
}
}
Fortran
C/C++ CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Multi-Core & SIMD CPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
Accelerator
Sequential Software Parallel Hardware
Development
Time
Maintenance
Cost
Performance
Tuning
3

spcl.inf.ethz.ch
@spcl_eth
4
COSMO: Weather and Climate Model
• 500.000 Lines of Fortran
• 18.000 Loops
• 19 Years of Knowledge
• Used in Switzerland, Russia,
Germany, Poland, Italy, Israel,
Greece, Romania, …

spcl.inf.ethz.ch
@spcl_eth
COSMO – Climate Modeling
5
• Global (low-resolution model)
• Up to 5000 nodes
• Runs “monthly”
Piz Daint, Lugano, Switzerland

spcl.inf.ethz.ch
@spcl_eth
COSMO – Weather Forecast
6
• Regional model
• High-resolution
• Runs “hourly”
(20 instances in parallel)
• Today: 40 Nodes * 8 GPU
• Manual translation to GPUs
3 Year,
Multi-person Project
Can we automate this
GPU mapping?

spcl.inf.ethz.ch
@spcl_eth
7
The LLVM Compiler
Targets
CPU
Intel / AMD
PowerPC
ARM / MIPS
GPU
NVIDIA
AMD / ARM
FPGA
Xilinx
Altera
TargetsTargets
Static Languages
C / C++
Fortran
Go / D / C# / …
Compute
Languages
Julia (MatLab style)
Dynamic
JavaScript
Java
Frontends
COSMO

spcl.inf.ethz.ch
@spcl_eth
Iteration Space
0 1 2 3 4 5
j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
Program Code
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S(i,j);
i = 0, j = 1
i = 4, j = 4i = 4, j = 3i = 4, j = 2
i = 3, j = 3
i = 4, j = 0
i = 3, j = 0
i = 2, j = 0
i = 1, j = 0
i = 4, j = 1
i = 2, j = 1
i = 1, j = 1
i = 2, j = 2
i = 3, j = 1i = 3, j = 2
Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation
Tobias Grosser, Armin Groesslinger, and Christian Lengauer in Parallel Processing Letters (PPL), April, 2012 8
Polyhedral Model – In a nutshell

spcl.inf.ethz.ch
@spcl_eth
Static Control Parts - SCoPs
 Structured Control
 IF-conditions
 Counted FOR-loops (Fortran style)
 Multi-dimensional array accesses (and scalars)
 Loop-conditions and IF-conditions are Presburger Formula
 Loop increments are constant (non-parametric)
 Array subscript expressions are piecewise-affine
 Can be modeled precisely with Presburger Sets
9

spcl.inf.ethz.ch
@spcl_eth
Polyhedral Model of Static Control Part
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S: B[i][j] = A[i][j] + A[i][j+1];
• Iteration Space (Domain)
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑁 ∧ 0 ≤ 𝑗 ≤ 𝑖
• Schedule
𝜃𝑆 = { 𝑆 𝑖, 𝑗 → 𝑖, 𝑗 }
• Access Relation
• Reads: {𝑆 𝑖, 𝑗 → 𝐴 𝑖, 𝑗 ; 𝑆 𝑖, 𝑗 → 𝐴(𝑖, 𝑗 + 1)}
• Writes: {𝑆 𝑖, 𝑗 → 𝐵 𝑖, 𝑗 }
10

spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } →
𝑖
4
, 𝑗, 𝑖 𝑚𝑜𝑑 4
Code
for (i = 0; i <= n; i++)
for (j = 0; j <= i; j++)
S(i, j);
Polyhedral Schedule: Original
11

spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑖, 𝑗 } →
𝑖
4
Code
for (c0 = 0; c0 <= n; c0++)
for (c1 = 0; c1 <= c0; c1++)
S(c0, c1);
Polyhedral Schedule: Original
12

spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 → 𝑗, 𝑖 } →
𝑖
4
Code
for (c0 = 0; c0 <= n; c0++)
for (c1 = c0; c1 <= n; c1++)
S(c1, c0);
Polyhedral Schedule: Interchanged
13

spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 →
𝑖
4
, 𝑗, 𝑖 𝑚𝑜𝑑 4 }
Code
for (c0 = 0; c0 <= floord(n, 4); c0++)
for (c1 = 0; c1 <= min(n, 4 * c0 + 3); c1++)
for (c2 = max(0, -4 * c0 + c1);
c1 <= min(3, n - 4 * c0); c2++)
S(4 * c0 + c2, c1);
Polyhedral Schedule: Strip-mined
14

spcl.inf.ethz.ch
@spcl_eth
Model
𝐼𝑆 = 𝑆 𝑖, 𝑗 0 ≤ 𝑖 ≤ 𝑛 ∧ 0 ≤ 𝑗 ≤ 𝑖
𝜃𝑆 = {𝑆 𝑖, 𝑗 →
𝑖
4
,
𝑗
4
, 𝑖 𝑚𝑜𝑑 4, 𝑗 𝑚𝑜𝑑 4 }
Code
for (c0 = 0; c0 <= floord(n, 4); c0++)
for (c1 = 0; c1 <= c0; c1++)
for (c2 = 0; c2 <= min(3, n - 4 * c0); c2++)
for (c3 = 0; c3 <= min(3, 4 * c0 – 4 * c1 + c2); c3++)
S(4 * c0 + c2, 4 * c1 + c3);
Polyhedral Schedule: Blocked
15

spcl.inf.ethz.ch
@spcl_eth
0
1
2
0
1
2
0 1 2 3 0 1 2 3
0
1
10
16
Mapping Computation to Device
0
0 1
1
Device Blocks & Threads
Iteration Space
𝐵𝐼𝐷 = { 𝑖, 𝑗 →
𝑖
4
% 2,
𝑗
3
% 2 }
i
j
𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }

spcl.inf.ethz.ch
@spcl_eth
17
Polly-ACC: Architecture
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul

spcl.inf.ethz.ch
@spcl_eth
18
Kernels to Programs – Data Transfers
void heat(int n, float A[n], float hot, float cold) {
float B[n] = {0};
for (int i = 0; i < n; i++)
A[i] = cold;
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
} }
OpenCL Kernel
Host code
With unknown
side effects
CUDA GPU
CUDA GPU
CUDA GPU
CUDA GPU

spcl.inf.ethz.ch
@spcl_eth
19
Data Transfer – Per Kernel
Host Memory
initialize()
setCenter()
average()
average()
average()
D → 𝐻
D → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
time
𝐻 → 𝐷 𝐷 → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
Device Memory

spcl.inf.ethz.ch
@spcl_eth
20
Data Transfer – Inter Kernel Caching
Host Memory
𝐷 → 𝐻
Host Memory
initialize()
setCenter()
average()
average()
average()
time
𝐻 → 𝐷
Device Memory

spcl.inf.ethz.ch
@spcl_eth
21
EvaluationEvaluation – Polly ACC
Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)

spcl.inf.ethz.ch
@spcl_eth
0:00
0:28
0:57
1:26
1:55
2:24
2:52
3:21
3:50
4:19
Workstation
icc icc -openmp clang Polly ACC
2x speedup
vs. multi-thread CPU
22
Lattice Boltzmann (SPEC 2006)
4x speedup
vs. single-thread CPU

spcl.inf.ethz.ch
@spcl_eth
23
Cactus ADM (SPEC 2006) - Performance

spcl.inf.ethz.ch
@spcl_eth
24
Statistics - COSMO
 Number of Loops
 18,093 Total
 9,760 Static Control Loops (Modeled precisely by Polly)
 15,245 Non-Affine Memory Accesses (Approximated by Polly)
 11.154 Loops after precise modeling, less e.g. due to:
• Infeasible assumptions taken, or modeling timeouts
 Largest set of loops: 72 loops
 Reasons why loops cannot be modeled
 Function calls with side-effects
 Uncomputable loops bounds (data-dependent loop bounds?)
Siddharth Bhat

spcl.inf.ethz.ch
@spcl_eth
init_radition
organize_radition
fesft
 opt_th
 opt_so
 inv_th
 coe_th
 inv_so
 coe_so
25
Radiation Computation in COSMO (call graph)
Hot Functions:
Must be inlined and interchanged
Hot Functions:
Must be inlined and interchanged
Compute kernels
in all functions

spcl.inf.ethz.ch
@spcl_eth
Interprocedural Loop Interchange for GPU Execution (inv_th)
26
#ifdef _OPENACC
!$acc parallel
!$acc loop gang vector
DO j1 = ki1sc, ki1ec
CALL coe_th_gpu(pduh2oc (j1, ki3sc), pduh2of(j1, ki3sc), pduco2(j1, ki3sc),
pduo3(j1, ki3sc), …, pa2f(j1), pa3c(j1), pa3f(j1))
ENDDO
!$acc end parallel
#else
CALL coe_th (pduh2oc, pduh2of, pduco2, pduo3, palogp, palogt, podsc, podsf, podac, podaf,
…, pa3c, pa3f)
#endif
Pulled out parallel loop for
OpenACC Annotations

spcl.inf.ethz.ch
@spcl_eth
Optical Effect on Solar Layer (inv_th
27
DO j3 = ki3sc+1, ki3ec
CALL coe_th (j3) { ! Determine effect of the layer in *coe_th*
! Optical depth of gases
…
IF (kco2 /= 0) THEN
zodgf = zodgf + pduco2(j1 ,j3)* (cobi(kco2,kspec,2)* EXP ( coali(kco2,kspec,2) *
palogp(j1 ,j3) -cobti(kco2,kspec,2) * palogt(j1 ,j3)))
ENDIF
…
zeps=SQRT(zodgf*zodgf)
…
ENDDO
}
DO j1 = ki1sc, ki1ec ! Set RHS
…
ENDDO
DO j1 = ki1sc, ki1ec ! Elimination and storage of utility variables
…
ENDDO
ENDDO ! End of vertical loop over layers
Outer loop is sequential
Inner loop is parallel
Sequential Dependences

spcl.inf.ethz.ch
@spcl_eth
Optical Effect on Solar Layer – After interchange
28
!> Turn loop structure with multiple ip loops inside a
!> single k loop into perfectly nested k-ip loop on GPU.
#ifdef _OPENACC
!$acc parallel
!$acc loop gang vector
!$acc loop seq
DO j3 = ki3sc+1, ki3ec ! Loop over vertical
! Determine effects of layer in *coe_so*
CALL coe_so_gpu(pduh2oc (j1,j3) , pduh2of (j1,j3) , …, pa4c(j1), pa4f(j1), pa5c(j1), pa5f(j1))
! Elimination
…
ztd1 = 1.0_dp/(1.0_dp-pa5f(j1)*(pca2(j1,j3)*ztu6(j1,j3-1)+pcc2(j1,j3)*ztu8(j1,j3-1)))
ztu9(j1,j3) = pa5c(j1)*pcd1(j1,j3)+ztd6*ztu3(j1,j3) + ztd7*ztu5(j1,j3)
ENDDO
END DO ! Vertical loop
!$acc end parallel
Inner loop is sequential
Outer loop is parallel

spcl.inf.ethz.ch
@spcl_eth
Life Range Reordering (IMPACT’16 Verdoolaege et. al)
29
sequential
parallel
parallel
sequential
Privatization needed
for parallel execution
False dependences
prevent interchange
Scalable
Scheduling

spcl.inf.ethz.ch
@spcl_eth
30

spcl.inf.ethz.ch
@spcl_eth
31
Intrinsics to model
Multi-dimensional
strided arrays
Better ways to link
with NVIDIA
libdevice
Scalable
Modeling
Scalable
Scheduling
Unified
Memory
OpenCL + SPIR-V
Backend

spcl.inf.ethz.ch
@spcl_eth
32
Memory on CPU + GPU Hybrid Machine
System DRAM System GDDR5
Automatic

spcl.inf.ethz.ch
@spcl_eth
Performance
1
10
100
1000
Dragonegg + LLVM (CPU
only)
Cray (CPU only) Polly-ACC (P100) Manual OpenACC (P100)
COSMO
COSMO
5x speedup
33
4.3x speedup
All important loop
transformations
performed
Headroom:
- Kernel compilation
(1.5s)
- Register usage
(2x)
- Block-size tuning
- Unified-memory
overhead?
22x speedup

spcl.inf.ethz.ch
@spcl_eth
Per-Kernel Performane
% Total Calls Time-per-call
29.98% 1.22414s 939 1.0221ms
FUNC___radiation_rg_MOD_inv_th_SCOP_0_KERNEL_0
19.19% 783.48ms 580 1.1786ms
FUNC___radiation_rg_MOD_inv_so_SCOP_0_KERNEL_1
8.50% 347.10ms 140 146.62us
FUNC___radiation_rg_MOD_fesft_dp_SCOP_11_KERNEL_0
... ~ 50 more
34
Per-Kernel Time
is short
Many Small Kernels
Still way-longer than
openacc kernels

spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
35
Maximilian
Falkenstein
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i

spcl.inf.ethz.ch
@spcl_eth
36
Maximilian
Falkenstein
for (intX c = 2; c < N+M; c++)
#pragma simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i,c-i) = A(i-1,c-1) + A(i,c-i-1)
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i i + j

spcl.inf.ethz.ch
@spcl_eth
37
Maximilian
Falkenstein
#pragma simd
A(i,c-i) = A(i-1,c-1) + A(i,c-i-1)
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i i + j
What is X?
N + M larger
than 32 bit
TODAY
• Use 64-bit
• Hope it’s enough

spcl.inf.ethz.ch
@spcl_eth
What type would be optimal?
38
Server or
Workstation
CPU
Embedded or
HPC
GPU
Embedded
CPU
FPGA
64-bit 32-bit 32/16 bit minimal
Today Tomorrow
COSMO
Can we always get
32 bit types?

spcl.inf.ethz.ch
@spcl_eth
Precise Solution
39
# simd
A(i, c-i) = A(i-1, c-1) + A(i, c-i-1)
-
-
c i
1
Domain: { (c) : 2 <= c < N + M
INT_MIN <= N, M <= INT_MAX }
f0() = c - i
f1() = c - i - 1
1) calc: min(fX()), max(fX()) under Domain
2) choose type accordingly

spcl.inf.ethz.ch
@spcl_eth
40
ILP Solver
• Minimal Types
• Potentially Costly
Approximations*
• s(a+b) ≤
max(s(a), s(b)) + 1
• Good, if smaller than
native type
* Earlier uses in GCC and Polly
Preconditions
• Assume values
fit into 32 bit
• Derive required
pre-conditions
+
-
c i
1

spcl.inf.ethz.ch
@spcl_eth
Type Distribution for LNT SCOPS
41
32 + epsilon is
almost always
enough!

spcl.inf.ethz.ch
@spcl_eth
Compile Time Overhead
0
5
10
15
20
25
30
No Types Solver Solver + Approx Solver + Approx (8 bit)
GPU Code Generation (5000 lines of code)
GPU Code Generation
42
Less than 10% overhead
vs. no types.
Less
than
10%

spcl.inf.ethz.ch
@spcl_eth
Automatic Compilation to FPGA ?
 Automatic Translation: Floating Point Fixed Point
 COSMO mostly floating point (single precision / double precision)
 SPIR-V flow to Xilinx HLS Tools
 Translate LLVM-IR directly to Verilog
 How to get cache coherence
 Visited Xilinx: Could share some of their software toolchain for Enzian
 How to schedule kernels
 Partial reconfiguration
 Data Caching
 Can we keep data in b-ram
 Kernel Size
 Can we reduce the size of all kernels?
 …
43

spcl.inf.ethz.ch
@spcl_eth
Conclusion
44
Optimal & Correct Types
Automatic
Unified Memory
Transfers
Complex Loop
Transformations
Hybrid
Mapping

Compilation of COSMO for GPU using LLVM

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Compilation of COSMO for GPU using LLVM

Similar to Compilation of COSMO for GPU using LLVM (20)

More from Linaro

More from Linaro (20)

Recently uploaded

Recently uploaded (20)

Compilation of COSMO for GPU using LLVM