1. Dynamic Compilation for Massively Parallel
Processors
Gregory Diamos
PhD candidate
Georgia Institute of Technology and NVIDIA Research
April 14, 2011
Gregory Diamos CS264 - Dynamic Compilation 1/62
2. What is an execution model?
Gregory Diamos CS264 - Dynamic Compilation 2/62
3. Goals of programming languages
Programming languages are designed for productivity.
Efficiency is measured in terms of:
1 cost - hardware investment, power consumption, area requirement
2 complexity - application development effort
3 speed - amount of work performed per unit time
Gregory Diamos CS264 - Dynamic Compilation 3/62
4. Goals of processor architecture
Hardware is designed for speed and efficiency.
Gregory Diamos CS264 - Dynamic Compilation 4/62
5. Goals of processor architecture - 2
[1] - M. Koyanagi, T. Fukushima, and T. Tanaka. "High-Density Through Silicon Vias for 3-D LSIs" [2] - Novoselov et al. "Electric Field Effect in Atomically Thin Carbon Films." [3] - Intel Corp. 22nm test chip.
It is constrained by the limitations of physical devices.
Gregory Diamos CS264 - Dynamic Compilation 5/62
7. Goals of execution models
Execution models provide impedance matching between applications and
hardware.
Goals:
leverage common optimizations across multiple applications.
limit the impact of hardware changes on software.
ISAs have traditionally been effective execution models.
Gregory Diamos CS264 - Dynamic Compilation 7/62
8. Programming challenges of heterogeneity
The introduction of heterogeneous and multi-core processors changes the
hardware/software interface:
Intel Nehalem IBM PowerEN AMD Fusion NVIDIA Fermi
1 multi-core creates multiple interfaces.
2 heterogeneity creates different interfaces.
3 these increase software complexity.
Gregory Diamos CS264 - Dynamic Compilation 8/62
9. Program the entire processor, not individual cores.
(new execution model abstractions are needed)
Gregory Diamos CS264 - Dynamic Compilation 9/62
11. Bulk-synchronous parallel (BSP)
[1] - Leslie Valiant. A bridging model for parallel computing.
Gregory Diamos CS264 - Dynamic Compilation 11/62
12. The Parallel Thread eXecution (PTX) Model
PTX defines a kernel as a 2-level grid of bulk-synchronous tasks.
Gregory Diamos CS264 - Dynamic Compilation 12/62
13. Dynamically translating PTX
Dynamic compilers can transform this parallelism to fit the hardware.
Gregory Diamos CS264 - Dynamic Compilation 13/62
18. Binary translators are everywhere
If you are running a browser, you are using dynamic compilation.
Gregory Diamos CS264 - Dynamic Compilation 18/62
20. Low Level Virtual Machines
Compile all programs to a common virtual machine representation (LLVM
IR), keep this around.
Perform common optimizations on this IR.
Target various machines by lowering it to an ISA.
Statically or via JIT compilation.
Gregory Diamos CS264 - Dynamic Compilation 20/62
22. Execution model translation
Extend binary translation to execution model translation.
Dynamic compilers can map threads/tasks to the HW.
Gregory Diamos CS264 - Dynamic Compilation 22/62
23. Different core architectures
Can we target these from the same execution model.
What about efficiency?
Gregory Diamos CS264 - Dynamic Compilation 23/62
25. Mapping CTAs to cores - thread fusion
Scheduler Block
Restore Registers
Barrier
Spill Registers
Original PTX Code Transformed PTX Code
Transform threads into loops over the program.
Distribute loops to handle barriers.
Gregory Diamos CS264 - Dynamic Compilation 25/62
26. Mapping CTAs to cores - vectorization
Pack adjacent threads into vector instructions.
Speculate that divergence never occurs, check in case it does.
Gregory Diamos CS264 - Dynamic Compilation 26/62
27. Mapping CTAs to cores - multiple instruction streams
T0 T1 T2 T3
Instructions from different threads are independent.
merge instruction streams and statically schedule on functional units.
Gregory Diamos CS264 - Dynamic Compilation 27/62
31. Thread frontier analysis
Supporting control flow on SIMD processors requires finding divergent
branches and potential re-converge points.
entry T0 T1 T2 T3 T0 T1 T2 T3
entry
Block Id Thread Frontiers
B1 Push B3 on T0
bra cond1() {}
bra cond1() bra cond3()
if((cond1() || cond2()) Push Exit on T1
&& (cond3() || cond4()))
{ bra cond2()
B2 {B2 - B3} thread-frontier
reconvergence Push B5 on T2
... of T0
} bra cond2() .... bra cond4() Push Exit on T4
B3
bra cond3() {B3 - Exit} Pop stack
Exit on T4
thread-frontier
reconvergence Pop stack
of T2
B4 switch to B5 on T2
bra cond4() {B4 - Exit} post dominator
reconvergence
Pop stack
Exit on T1
exit of T1 and T3 Pop stack
switch to B3 on T0
compound .... B5
conditionals {B5 - Exit}
re-convergence at thread frontiers
exit post dominator
short circuit control flow reconvergence
of T1, T2, and T3
immediate post-dominator re-convergence
Compiler analysis can identify immediate post donimators or
thread-frontiers as re-convergence points.
Gregory Diamos CS264 - Dynamic Compilation 31/62
35. Reduced memory bandwidth on CPUs
Optimized for single-threaded CPU
Optimized for SIMD (GPU)
This reduces memory bandwidth by 10x for a memory microbenchmark
running on a 4-core CPU.
Gregory Diamos CS264 - Dynamic Compilation 35/62
36. The good news
Gregory Diamos CS264 - Dynamic Compilation 36/62
37. Scaling across three decades of processors
Many existing applications still scale.
12x
480x
280GTX has 40x more peak flops than a Phenom, 480x more than an
Atom.
Gregory Diamos CS264 - Dynamic Compilation 37/62
54. 2. Block streaming
Blocking into pages, shared memory buffers, and transaction sized
chunks makes memory accesses efficient.
Gregory Diamos CS264 - Dynamic Compilation 54/62
55. 3. Shared memory merging network
A network for join can be constructed, similar to a sorting network.
Gregory Diamos CS264 - Dynamic Compilation 55/62
56. 4. Data chunking
Stream compaction packs result data into chunks that can be streamed
out of shared memory efficiently.
Gregory Diamos CS264 - Dynamic Compilation 56/62
61. Conclusions
Emerging heterogeneous architectures need matching execution model
abstractions.
dynamic compilation can enable portability.
When writing massively parallel codes, consider:
data structures and algorithms.
mapping onto the execution model.
transformations in the compiler/runtime.
processor micro-architecture.
Gregory Diamos CS264 - Dynamic Compilation 61/62
62. Thoughts on open source software
Gregory Diamos CS264 - Dynamic Compilation 62/62