By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Compilation of COSMO for GPU using LLVM
1. spcl.inf.ethz.ch
@spcl_eth
Automatic Accelerator Compilation of the COSMO Physics Core
Tobias Grosser, Siddharth Bhat, Torsten Hoefler
December 2017
Albert Cohen, Sven Verdoolaege,
Oleksandre Zinenko
Polly Labs, ENS Paris
Johannes Doerfert
Uni. Saarbruecken
Roman Gereev,
Ural Federal University
Hongin Zheng, Alexandre Isonard
Xilinx
Swiss Universities / PASC
Qualcomm, ARM, Xilinx
… many others
3. spcl.inf.ethz.ch
@spcl_eth
row = 0;
output_image_ptr = output_image;
output_image_ptr += (NN * dead_rows);
for (r = 0; r < NN - KK + 1; r++) {
output_image_offset = output_image_ptr;
output_image_offset += dead_cols;
col = 0;
for (c = 0; c < NN - KK + 1; c++) {
input_image_ptr = input_image;
input_image_ptr += (NN * row);
kernel_ptr = kernel;
S0: *output_image_offset = 0;
for (i = 0; i < KK; i++) {
input_image_offset = input_image_ptr;
input_image_offset += col;
kernel_offset = kernel_ptr;
for (j = 0; j < KK; j++) {
S1: temp1 = *input_image_offset++;
S1: temp2 = *kernel_offset++;
S1: *output_image_offset += temp1 * temp2;
}
kernel_ptr += KK;
input_image_ptr += NN;
}
S2: *output_image_offset = ((*output_image_offset)/
normal_factor);
output_image_offset++ ;
col++;
}
output_image_ptr += NN;
row++;
}
}
Fortran
C/C++ CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Multi-Core & SIMD CPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
GPUGPU
Accelerator
Sequential Software Parallel Hardware
Development
Time
Maintenance
Cost
Performance
Tuning
3
4. spcl.inf.ethz.ch
@spcl_eth
4
COSMO: Weather and Climate Model
• 500.000 Lines of Fortran
• 18.000 Loops
• 19 Years of Knowledge
• Used in Switzerland, Russia,
Germany, Poland, Italy, Israel,
Greece, Romania, …
6. spcl.inf.ethz.ch
@spcl_eth
COSMO – Weather Forecast
6
• Regional model
• High-resolution
• Runs “hourly”
(20 instances in parallel)
• Today: 40 Nodes * 8 GPU
• Manual translation to GPUs
3 Year,
Multi-person Project
Can we automate this
GPU mapping?
7. spcl.inf.ethz.ch
@spcl_eth
7
The LLVM Compiler
Targets
CPU
Intel / AMD
PowerPC
ARM / MIPS
GPU
NVIDIA
AMD / ARM
FPGA
Xilinx
Altera
TargetsTargets
Static Languages
C / C++
Fortran
Go / D / C# / …
Compute
Languages
Julia (MatLab style)
Dynamic
JavaScript
Java
Frontends
COSMO
8. spcl.inf.ethz.ch
@spcl_eth
Iteration Space
0 1 2 3 4 5
j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
Program Code
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S(i,j);
i = 0, j = 1
i = 4, j = 4i = 4, j = 3i = 4, j = 2
i = 3, j = 3
i = 4, j = 0
i = 3, j = 0
i = 2, j = 0
i = 1, j = 0
i = 4, j = 1
i = 2, j = 1
i = 1, j = 1
i = 2, j = 2
i = 3, j = 1i = 3, j = 2
Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation
Tobias Grosser, Armin Groesslinger, and Christian Lengauer in Parallel Processing Letters (PPL), April, 2012 8
Polyhedral Model – In a nutshell
9. spcl.inf.ethz.ch
@spcl_eth
Static Control Parts - SCoPs
Structured Control
IF-conditions
Counted FOR-loops (Fortran style)
Multi-dimensional array accesses (and scalars)
Loop-conditions and IF-conditions are Presburger Formula
Loop increments are constant (non-parametric)
Array subscript expressions are piecewise-affine
Can be modeled precisely with Presburger Sets
9
18. spcl.inf.ethz.ch
@spcl_eth
18
Kernels to Programs – Data Transfers
void heat(int n, float A[n], float hot, float cold) {
float B[n] = {0};
for (int i = 0; i < n; i++)
A[i] = cold;
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
} }
OpenCL Kernel
Host code
With unknown
side effects
CUDA GPU
CUDA GPU
CUDA GPU
CUDA GPU
19. spcl.inf.ethz.ch
@spcl_eth
19
Data Transfer – Per Kernel
Host Memory
initialize()
setCenter()
average()
average()
average()
D → 𝐻
D → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
time
𝐻 → 𝐷 𝐷 → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
Device Memory
24. spcl.inf.ethz.ch
@spcl_eth
24
Statistics - COSMO
Number of Loops
18,093 Total
9,760 Static Control Loops (Modeled precisely by Polly)
15,245 Non-Affine Memory Accesses (Approximated by Polly)
11.154 Loops after precise modeling, less e.g. due to:
• Infeasible assumptions taken, or modeling timeouts
Largest set of loops: 72 loops
Reasons why loops cannot be modeled
Function calls with side-effects
Uncomputable loops bounds (data-dependent loop bounds?)
Siddharth Bhat
27. spcl.inf.ethz.ch
@spcl_eth
Optical Effect on Solar Layer (inv_th
27
DO j3 = ki3sc+1, ki3ec
CALL coe_th (j3) { ! Determine effect of the layer in *coe_th*
! Optical depth of gases
DO j1 = ki1sc, ki1ec
…
IF (kco2 /= 0) THEN
zodgf = zodgf + pduco2(j1 ,j3)* (cobi(kco2,kspec,2)* EXP ( coali(kco2,kspec,2) *
palogp(j1 ,j3) -cobti(kco2,kspec,2) * palogt(j1 ,j3)))
ENDIF
…
zeps=SQRT(zodgf*zodgf)
…
ENDDO
}
DO j1 = ki1sc, ki1ec ! Set RHS
…
ENDDO
DO j1 = ki1sc, ki1ec ! Elimination and storage of utility variables
…
ENDDO
ENDDO ! End of vertical loop over layers
Outer loop is sequential
Inner loop is parallel
Sequential Dependences
Inner loop is parallel
Inner loop is parallel
28. spcl.inf.ethz.ch
@spcl_eth
Optical Effect on Solar Layer – After interchange
28
!> Turn loop structure with multiple ip loops inside a
!> single k loop into perfectly nested k-ip loop on GPU.
#ifdef _OPENACC
!$acc parallel
!$acc loop gang vector
DO j1 = ki1sc, ki1ec
!$acc loop seq
DO j3 = ki3sc+1, ki3ec ! Loop over vertical
! Determine effects of layer in *coe_so*
CALL coe_so_gpu(pduh2oc (j1,j3) , pduh2of (j1,j3) , …, pa4c(j1), pa4f(j1), pa5c(j1), pa5f(j1))
! Elimination
…
ztd1 = 1.0_dp/(1.0_dp-pa5f(j1)*(pca2(j1,j3)*ztu6(j1,j3-1)+pcc2(j1,j3)*ztu8(j1,j3-1)))
ztu9(j1,j3) = pa5c(j1)*pcd1(j1,j3)+ztd6*ztu3(j1,j3) + ztd7*ztu5(j1,j3)
ENDDO
END DO ! Vertical loop
!$acc end parallel
Inner loop is sequential
Outer loop is parallel
29. spcl.inf.ethz.ch
@spcl_eth
Life Range Reordering (IMPACT’16 Verdoolaege et. al)
29
sequential
parallel
parallel
sequential
Privatization needed
for parallel execution
False dependences
prevent interchange
Scalable
Scheduling
31. spcl.inf.ethz.ch
@spcl_eth
31
Polly-ACC: Architecture
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
Intrinsics to model
Multi-dimensional
strided arrays
Better ways to link
with NVIDIA
libdevice
Scalable
Modeling
Scalable
Scheduling
Unified
Memory
OpenCL + SPIR-V
Backend
34. spcl.inf.ethz.ch
@spcl_eth
Per-Kernel Performane
% Total Calls Time-per-call
29.98% 1.22414s 939 1.0221ms
FUNC___radiation_rg_MOD_inv_th_SCOP_0_KERNEL_0
19.19% 783.48ms 580 1.1786ms
FUNC___radiation_rg_MOD_inv_so_SCOP_0_KERNEL_1
8.50% 347.10ms 140 146.62us
FUNC___radiation_rg_MOD_fesft_dp_SCOP_11_KERNEL_0
... ~ 50 more
34
Per-Kernel Time
is short
Many Small Kernels
Still way-longer than
openacc kernels
35. spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
35
Maximilian
Falkenstein
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i
36. spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
36
Maximilian
Falkenstein
for (intX c = 2; c < N+M; c++)
#pragma simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i,c-i) = A(i-1,c-1) + A(i,c-i-1)
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i i + j
37. spcl.inf.ethz.ch
@spcl_eth
Correct Types for Loop Transformations
37
Maximilian
Falkenstein
for (intX c = 2; c < N+M; c++)
#pragma simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i,c-i) = A(i-1,c-1) + A(i,c-i-1)
for (int32 i = 1; i < N; i++)
for (int32 j = 1; j <= M; j++)
A(i,j) = A(i-1,j) + A(i,j-1)
j
i i + j
What is X?
N + M larger
than 32 bit
TODAY
• Use 64-bit
• Hope it’s enough
38. spcl.inf.ethz.ch
@spcl_eth
What type would be optimal?
38
Server or
Workstation
CPU
Embedded or
HPC
GPU
Embedded
CPU
FPGA
64-bit 32-bit 32/16 bit minimal
Today Tomorrow
COSMO
Can we always get
32 bit types?
39. spcl.inf.ethz.ch
@spcl_eth
Precise Solution
39
for (intX c = 2; c < N+M; c++)
# simd
for (intX i = max(1, c-M); i <= min(N, c-1); i++)
A(i, c-i) = A(i-1, c-1) + A(i, c-i-1)
-
-
c i
1
Domain: { (c) : 2 <= c < N + M
INT_MIN <= N, M <= INT_MAX }
f0() = c - i
f1() = c - i - 1
1) calc: min(fX()), max(fX()) under Domain
2) choose type accordingly
40. spcl.inf.ethz.ch
@spcl_eth
40
ILP Solver
• Minimal Types
• Potentially Costly
Approximations*
• s(a+b) ≤
max(s(a), s(b)) + 1
• Good, if smaller than
native type
* Earlier uses in GCC and Polly
Preconditions
• Assume values
fit into 32 bit
• Derive required
pre-conditions
+
-
c i
1
43. spcl.inf.ethz.ch
@spcl_eth
Automatic Compilation to FPGA ?
Automatic Translation: Floating Point Fixed Point
COSMO mostly floating point (single precision / double precision)
SPIR-V flow to Xilinx HLS Tools
Translate LLVM-IR directly to Verilog
How to get cache coherence
Visited Xilinx: Could share some of their software toolchain for Enzian
How to schedule kernels
Partial reconfiguration
Data Caching
Can we keep data in b-ram
Kernel Size
Can we reduce the size of all kernels?
…
43