[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Dynamic Compilation for Massively Parallel
Processors

Gregory Diamos

PhD candidate

Georgia Institute of Technology and NVIDIA Research

April 14, 2011

Gregory Diamos CS264 - Dynamic Compilation 1/62

What is an execution model?


Goals of programming languages
Programming languages are designed for productivity.

Eﬃciency is measured in terms of:
1 cost - hardware investment, power consumption, area requirement
2 complexity - application development eﬀort
3 speed - amount of work performed per unit time

Goals of processor architecture

Hardware is designed for speed and eﬃciency.


Goals of processor architecture - 2

[1] - M. Koyanagi, T. Fukushima, and T. Tanaka. "High-Density Through Silicon Vias for 3-D LSIs" [2] - Novoselov et al. "Electric Field Effect in Atomically Thin Carbon Films." [3] - Intel Corp. 22nm test chip.

It is constrained by the limitations of physical devices.


Execution models bridge the gap


Goals of execution models

Execution models provide impedance matching between applications and
hardware.

Goals:
leverage common optimizations across multiple applications.
limit the impact of hardware changes on software.

ISAs have traditionally been eﬀective execution models.


Programming challenges of heterogeneity

The introduction of heterogeneous and multi-core processors changes the
hardware/software interface:

Intel Nehalem IBM PowerEN AMD Fusion NVIDIA Fermi

1 multi-core creates multiple interfaces.
2 heterogeneity creates diﬀerent interfaces.
3 these increase software complexity.


Program the entire processor, not individual cores.
(new execution model abstractions are needed)


Emerging execution models


Bulk-synchronous parallel (BSP)

[1] - Leslie Valiant. A bridging model for parallel computing.


The Parallel Thread eXecution (PTX) Model

PTX deﬁnes a kernel as a 2-level grid of bulk-synchronous tasks.


Dynamically translating PTX

Dynamic compilers can transform this parallelism to ﬁt the hardware.


Beyond PTX - Data distributions


Beyond PTX - Memory hierarchies

[1] - Leslie Valiant. A bridging model for multi-core.
[2] Fatahalian et al. Sequoia: Programming the memory hierarchy.

Dynamic compilation/binary
translation


Binary translation


Binary translators are everywhere

If you are running a browser, you are using dynamic compilation.


x86 binary translation


Low Level Virtual Machines

Compile all programs to a common virtual machine representation (LLVM
IR), keep this around.

Perform common optimizations on this IR.
Target various machines by lowering it to an ISA.
Statically or via JIT compilation.


Execution model translation


Execution model translation

Extend binary translation to execution model translation.

Dynamic compilers can map threads/tasks to the HW.


Diﬀerent core architectures

Can we target these from the same execution model.
What about eﬃciency?


Ocelot

Enables thread-aware compiler transformations.


Mapping CTAs to cores - thread fusion

Scheduler Block

Restore Registers

Barrier

Spill Registers

Original PTX Code Transformed PTX Code

Transform threads into loops over the program.
Distribute loops to handle barriers.


Mapping CTAs to cores - vectorization

Pack adjacent threads into vector instructions.
Speculate that divergence never occurs, check in case it does.

Mapping CTAs to cores - multiple instruction streams

T0 T1 T2 T3

Instructions from diﬀerent threads are independent.
merge instruction streams and statically schedule on functional units.

PTX analysis


Divergence analysis


Subkernels

subkernel


Thread frontier analysis

Supporting control ﬂow on SIMD processors requires ﬁnding divergent
branches and potential re-converge points.
entry T0 T1 T2 T3 T0 T1 T2 T3
entry
Block Id Thread Frontiers
B1 Push B3 on T0
bra cond1() {}
bra cond1() bra cond3()
if((cond1() || cond2()) Push Exit on T1
&& (cond3() || cond4()))
{ bra cond2()
B2 {B2 - B3} thread-frontier
reconvergence Push B5 on T2
... of T0
} bra cond2() .... bra cond4() Push Exit on T4
B3
bra cond3() {B3 - Exit} Pop stack
Exit on T4
thread-frontier
reconvergence Pop stack
of T2
B4 switch to B5 on T2
bra cond4() {B4 - Exit} post dominator
reconvergence
Pop stack
Exit on T1
exit of T1 and T3 Pop stack
switch to B3 on T0
compound .... B5
conditionals {B5 - Exit}
re-convergence at thread frontiers
exit post dominator

short circuit control flow reconvergence
of T1, T2, and T3

immediate post-dominator re-convergence

Compiler analysis can identify immediate post donimators or
thread-frontiers as re-convergence points.


Consequences of architecture
diﬀerences


Degraded performance portability

1600
600

1400
500

1200
400

1000
GFLOPS

GFLOPS
300

800
600
Fermi SGEMM Fermi SGEMM
200

AMD SGEMM AMD SGEMM

400
100

200
0

0
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000

N N

Performance of two OpenCL applications, one tuned for AMD, the other
for NVIDIA.


Memory traversal patterns

Warp(4) Cycle 1 Warp(4) Cycle 2

Warp(1) Cycle 1 Warp(1) Cycle 2

Thread loops change row major memory accesses to column major
accesses.

Reduced memory bandwidth on CPUs
Optimized for single-threaded CPU

Optimized for SIMD (GPU)

This reduces memory bandwidth by 10x for a memory microbenchmark
running on a 4-core CPU.

The good news


Scaling across three decades of processors

Many existing applications still scale.

12x

480x

280GTX has 40x more peak ﬂops than a Phenom, 480x more than an
Atom.


Questions?


Databases on GPUs


Who cares about databases?


What do applications look like?

What do applications look like?


Gobs of data


Distributed systems


Lots of parallelism


What do CPU algorithms look like?

What do cpu algorithms look
like?


Btrees


Sequential algorithms

<
relation 1 =
result

relation 2 >


It doesn’t look good

Outlook not so good...


Or does it?

Where is the parallelism?


Flattened trees


Relational algebra


Inner Join

A Case Study: Inner Join


1. Recursive partitioning


2. Block streaming

Blocking into pages, shared memory buﬀers, and transaction sized
chunks makes memory accesses eﬃcient.


3. Shared memory merging network

A network for join can be constructed, similar to a sorting network.


4. Data chunking

Stream compaction packs result data into chunks that can be streamed
out of shared memory eﬃciently.

Operator fusion


Will it blend?


Yes it blends.

Operator NVIDIA C2050 Phenom 9570
inner-join 26.4-32.3 GB/s 0.11-0.63 GB/s
select 104.2 GB/s 2.55 GB/s
set operators 45.8 GB/s 0.72 GB/s
projection 54.3 GB/s 2.34 GB/s
cross product 98.8 GB/s 2.67 GB/s


Questions?


Conclusions

Emerging heterogeneous architectures need matching execution model
abstractions.
dynamic compilation can enable portability.

When writing massively parallel codes, consider:
data structures and algorithms.
mapping onto the execution model.
transformations in the compiler/runtime.
processor micro-architecture.


Thoughts on open source software


Questions?

Questions?

Contact Me:

gregory.diamos@gatech.edu

Contribute to Harmony, Ocelot, and Vanaheimr:

http://code.google.com/p/harmonyruntime/

http://code.google.com/p/gpuocelot/

http://code.google.com/p/vanaheimr/


[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (10)

Similaire à [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Similaire à [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech) (20)

Plus de npinto

Plus de npinto (16)

Dernier

Dernier (20)

[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)