This document summarizes research on LLVM optimizations for PGAS (Partitioned Global Address Space) programs like Chapel. It discusses generating LLVM IR from Chapel to enable optimizations like LICM (Loop Invariant Code Motion). Evaluations show LLVM optimizations remove many communication operations and improve performance for some applications vs. C code generation. However, LLVM constraints and wide pointer overhead hurt performance for other applications. Future work includes more applications, possibly-remote to definitely-local transformations, and parallel intermediate representations in LLVM.
Double Revolving field theory-how the rotor develops torque
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-
1. LLVM Optimizations for PGAS Programs
-Case Study: LLVM Wide Pointer Optimizations in Chapel-
CHIUW2014 (co-located with IPDPS 2014),
Phoenix, Arizona
Akihiro Hayashi, Rishi Surendran,
Jisheng Zhao, Vivek Sarkar
(Rice University),
Michael Ferguson
(Laboratory for Telecommunication Sciences)
1
2. Background: Programming Model
for Large-scale Systems
Message Passing Interface (MPI) is a
ubiquitous programming model
but introduces non-trivial complexity due to
message passing semantics
PGAS languages such as Chapel, X10,
Habanero-C and Co-array Fortran
provide high-productivity features:
Task parallelism
Data Distribution
Synchronization
2
3. Motivation:
Chapel Support for LLVM
Widely used and easy to Extend
3
LLVM Intermediate Representation (IR)
x86 Binary
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Compiler
chpl
PPC Binary ARM Binary
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
GPU Binary
UPC
Compiler
5. Our ultimate goal: A compiler that can
uniformly optimize PGAS Programs
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
Two parallel intermediate representations(PIR) as
extensions to LLVM IR
(Runtime-Independent, Runtime-Specific)
5
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary
6. The first step:
LLVM-based Chapel compiler
6
Pictures borrowed from 1) http://chapel.cray.com/logo.html
2) http://llvm.org/Logo.html
Chapel compiler supports LLVM IR generation
This talk discusses the pros and cons of LLVM-based
communication optimizations for Chapel
Wide pointer optimization
Preliminary Performance evaluation & analysis using
three regular applications
7. Chapel language
An object-oriented PGAS language
developed by Cray Inc.
Part of DARPA HPCS program
Key features
Array Operators: zip, replicate, remap,...
Explicit Task Parallelism: begin, cobegin
Locality Control: Locales
Data-Distribution: domain maps
Synchronizations: sync
7
9. The Pros and Cons of using
LLVM for Chapel
Pro: Using address space feature of LLVM
offers more opportunities for
communication optimization than C gen
9
// LLVM IR
%x = load i64 addrspace(100)* %xptr
// C-Code generation
chpl_comm_get(&x, …);
LLVM Optimizations
(e.g. LICM, scalar replacement)
Backend Compiler’s Optimizations
(e.g. gcc –O3)
Few chances of
optimization because
remote accesses are
lowered to chapel Comm
APIs
1. the existing LLVM passes can be
used for communication optimizations
2. Lowered to chapel Comm APIs after
optimizations
// Chapel
x = remoteData;
10. Address Space 100 generation
in Chapel
Address space 100 = possibly-remote
(our convention)
Constructs which generate address space 100
Array Load/Store (Except Local constructs)
Distributed Array
var d = {1..128} dmapped Block(boundingBox={1..128});
var A: [d] int;
Object and Field Load/ Store
class circle { var radius: real; … }
var c1 = new circle(radius=1.0);
On statement
var loc0: int;
on Locales[1] { loc0 = …; }
Ref intent
proc habanero(ref v: int): void { v = …; }
10
Except remote value
forwarding optimization
11. Motivating Example of address
space 100
11
(Pseudo-Code: Before LICM)
for i in 1..N {
// REMOTE GET
%x = load i64 addrspace(100)* %xptr
A(i) = %x;
}
(Pseudo-Code: After LICM)
// REMOTE GET
%x = load i64 addrspace(100)* %xptr
for i in 1..N {
A(i) = %x;
}
LICM by
LLVM
LICM = Loop Invariant Code Motion
12. The Pros and Cons of using
LLVM for Chapel (Cont’d)
Drawback: Using LLVM may lose opportunity for
optimizations and may add overhead at runtime
In LLVM 3.3, many optimizations assume that the
pointer size is the same across all address spaces
12
typedef struct wide_ptr_s {
chpl_localeID_t locale;
void* addr;
} wide_ptr_t;
locale addr
For LLVM Code Generation :
64bit packed pointer
CHPL_WIDE_POINTERS=node16
For C Code Generation :
128bit struct pointer
CHPL_WIDE_POINTERS=struct
wide.locale;
wide.addr;
wide >> 48
wide | 48BITS_MASK;
16bit 48bit
1. Needs more instructions
2. Lose opportunities for Alias analysis
13. Performance Evaluations:
Experimental Methodologies
We tested execution in the following modes
1.C-Struct (--fast)
C code generation + struct pointer + gcc
Conventional Code generation in Chapel
2.LLVM without wide optimization (--fast --llvm)
LLVM IR generation + packed pointer
Does not use address space feature
3.LLVM with wide optimization
(--fast --llvm --llvm-wide-opt)
LLVM IR generation + packed pointer
Use address space feature and apply the existing LLVM
optimizations
13
15. Performance Evaluations:
Details of Compiler & Runtime
Compiler:
Chapel version 1.9.0.23154 (Apr. 2014)
Built with
CHPL_LLVM=llvm
CHPL_WIDE_POINTERS=node16 or struct
CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
CHPL_TASK=qthread
Backend compiler: gcc-4.4.7, LLVM 3.3
Runtime:
GASNet-1.22.0 (ibv-conduit, mpi-spawner)
qthreads-1.10
(2 Shepherds, 6 worker per shepherd) 15
16. Stream-EP
From HPCC benchmark
Array Size: 2^30
16
coforall loc in Locales do on loc {
// per each locale
var A, B, C: [D] real(64);
forall (a, b, c) in zip(A, B, C) do
a = b + alpha * c;
}
17. Stream-EP Result
17
2.56
1.33
0.72
0.41 0.24 0.11
6.62
3.22
1.73
1.01
0.62
0.26
2.45
1.28
0.72
0.40 0.25 0.10
0
1
2
3
4
5
6
7
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
vs.
vs. Overhead of introducing LLVM + packed pointer (2.6x slower)
1
2 3
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.7x faster)
LLVM+wide opt is faster than the conventional C-Struct (1.1x)
18. Stream-EP Analysis
18
C-Struct LLVM w/o wopt LLVM w/ wopt
1.39E+11 1.40E+11 5.46E+10
Dynamic number of Chapel PUT/GET APIs actually executed (16 Locales):
// C-Struct, LLVM w/o wopt
forall (a, b, c) in zip(A, B, C) do
8GETS / 1PUT
// LLVM w/ wopt
6GETS (Get Array Head, offs)
forall (a, b, c) in zip(A, B, C) do
2GETS / 1PUT
LICM by LLVM
20. Cholesky Result
20
Lower is better
2401.32
941.70
730.94
2781.12
1105.38
902.86858.77
283.32 216.48
0.00
500.00
1000.00
1500.00
2000.00
2500.00
3000.00
8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
1
2 3
vs.
vs. Overhead of introducing LLVM + packed pointer (1.2x slower)
1
2
3
_
_
vs.
Performance improvement by LLVM opt (4.2x faster)
LLVM+wide opt is faster than the conventional C-Struct (3.4x)
21. Cholesky Analysis
21
C-Struct LLVM w/o wopt LLVM w/ wopt
1.78E+09 1.97E+09 5.89E+08
Dynamic number of Chapel PUT/GET APIs actually executed (2 Locales):
Obtained with 1,000 x 1,000 input (100x100 tile size)
// C-Struct, LLVM w/o wopt
for jB in zero..tileSize-1 do {
for kB in zero..tileSize-1 do {
4GETS
for iB in zero..tileSize-1 do {
8GETS (+1 GETS w/ LLVM)
1PUT
}}}
// LLVM w/ wopt
for jB in zero..tileSize-1 do {
1GET
for kB in zero..tileSize-1 do {
3GETS
for iB in zero..tileSize-1 do {
2GETS
1PUT
}}}
23. Smithwaterman Result
23
381.23 379.01
1260.31 1263.76
626.38 635.45
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
8 locales 16 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
1
2 3
vs.
vs. Overhead of introducing LLVM + packed pointer (3.3x slower)
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.0x faster)
LLVM+wide opt is slower than the conventional C-Struct (0.6x)
24. Smithwaterman Analysis
24
C-Struct LLVM w/o wopt LLVM w/ wopt
1.41E+08 1.41E+08 5.26E+07
Dynamic number of Chapel PUT/GET APIs actually executed (1 Locale):
Obtained with 1,856 x 1,920 input (232x240 tile size)
// C-Struct, LLVM w/o wopt
for (ii, jj) in tile_1_2d_domain
{
33 GETS
1 PUTS
}
// LLVM w/ wopt
for (ii, jj) in tile_1_2d_domain
{
12 GETS
1 PUTS
}
No LICM though there are opportunities
25. Key Insights
Using address space 100 offers finer-grain
optimization opportunities (e.g. Chapel Array)
25
for i in {1..N} {
data = A(i);
}
for i in 1..N {
head = GET(pointer to array head)
offset1 = GET(offset)
data = GET(head+i*offset1)
}
Opportunities for
1.LICM
2.Aggregation
26. Conclusions
The first performance evaluation and analysis of
LLVM-based Chapel compiler
Capable of utilizing the existing optimizations passes
even for remote data (e.g. LICM)
Removes significant number of Comm APIs
LLVM w/ opt is always better than LLVM w/o opt
Stream-EP, Cholesky
LLVM-based code generation is faster than C-based code
generation (1.04x, 3.4x)
Smithwaterman
LLVM-based code generation is slower than C-based code
generation due to constraints of address space feature in
LLVM
No LICM though there are opportunities
Significant overhead of Packed Wide pointer
26
27. Future Work
Evaluate other applications
Regular applications
Irregular applications
Possibly-Remote to Definitely-Local
transformation by compiler
PIR in LLVM
27
local { A(i) = … } // hint by programmmer
… = A(i); // Definitely Local
on Locales[1] { // hint by programmer
var A: [D] int; // Definitely Local
30. // modules/internal/DefaultRectangular.chpl
class DefaultRectangularArr: BaseArr {
...
var dom : DefaultRectangularDom(rank=rank, idxType=idxType,
stridable=stridable); /* domain */
var off: rank*idxType; /* per-dimension offset (n-based-> 0-based) */
var blk: rank*idxType; /* per-dimension multiplier */
var str: rank*chpl__signedType(idxType); /* per-dimimension stride */
var origin: idxType; /* used for optimization */
var factoredOffs: idxType; /* used for calculating shiftedData */
var data : _ddata(eltType); /* pointer to an actual data */
var shiftedData : _ddata(eltType); /* shifted pointer to an actual data */
var noinit: bool = false;
...
Chapel Array Structure
30
// chpl_module.bc (with LLVM code generation)
%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object = type
{ %chpl_BaseArr_object, %chpl_DefaultRectangularDom_1_int64_t_F_object*, [1 x i64], [1 x i64], [1 x
i64], i64, i64, i64*, i64*, i8 }
31. Example1: Array Store
(very simple)
proc habanero (A) {
A(0) = 1;
}
31
Chapel version: 1.8.0.22047
Compiler option: --llvm --llvm-wide-opt --fast
Add “noinline” attribute to the function to avoid dead code
elimination
Good afternoon everyone. My name is Akihiro Hayashi. I’m a postdoc at Rice university.
Today, I’ll be talking about LLVM-based optimizations for PGAS programs. In particular, I focus on Chapel language and its optimization in this talk.
Let me first talk about the Programming model for Large-scale systems.
Message passing interface is very common programming model for large-scale system. But It is well known that using MPI introduces non-trivial complexity due to message passing semantics.
PGAS languages such as Chapel, X10, Habanero-C and CAF are designed for facilitating programming for large-scale systems by providing high-productivity language features such as task parallelism, data distribution and synchronization.
When it comes to compiler’s optimization, LLVM is emerging compiler infrastructure, which tries to replace the conventional compiler like GCC.
Here is an overview of LLVM.
LLVM defines machine-independent intermediate representation and it also provides powerful analyzer and optimizer for LLVM IR.
If you prepare the frontend that generates LLVM IR, you can analyze and optimize a code in a language independent manner. I think the most famous one is “Clang”, which takes C/C++ and generates LLVM IR.
You’ll finally get target specific binary by using target specific backend.
The most important thing in this slide is that Chapel compiler is now capable of generating LLVM IR.
Here is a big picture.
We think It’s feasible to build LLVM-based compiler that can uniformly analyze and optimize PGAS languages because PGAS languages have similar philosophy and languages design.
That means one sophisticated compiler optimize several kinds of PGAS languages and generates binary for several kinds of supercomputers.
This slide shows the details of universal PGAS compiler. Our plan is to extend LLVM IR to support parallel programs with PGAS and explicit task parallelism.
We’re thinking we defines two kinds of parallel intermediate representations as extensions to LLVM IR.
These are runtime independent IR and Runtime specific IR. You may want to detect task parallel construct and apply some sort of optimization with Runtime independent IR.
In this talk, we focus on LLVM-based chapel compiler to take the first step to the our ultimate goal.
just
Just read
Let’s talk about the pros and cons of using LLVM for Chapel.
We believe good thing to use LLVM is that we can use address space feature of LLVM.
This offers more opportunities for communication optimization than C code generations.
Here are examples of remote get code.
If you use c-code generator, remote get is expressed as chpl_comm_get API. But there are few changes of optimization because remote accesses are already lowered to chapel comm APIs.
On the other hand, if we use LLVM and its address space feature. We can express remote get as one instruction that involves address space 100.
Suppose xptr is loop invariant. We can reduce the redundant Comm API by LICM
But using LLVM has drawback. Chapel uses wide pointer to associate data with node. Wide pointer is as C-struct and you can extract nodeID and address by dot operator.