Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

HETEROGENEOUS PARTICLE
BASED SIMULATION
Takahiro Harada, AMD

2 Harada, Heterogeneous Particle-based Simulation
 Large number of particles
 Particles with identical size
– Work granularity is almost the same
– Good for the wide SIMD architecture
PARTICLE BASED SIMULATION ON THE GPU
Harada et al. 2007

PARTICLE BASED SIMULATION
 Collision
 Integration
 Acceleration structure is used for efficient collide
– Uniform grid → Suited for the GPU
– Less divergence
𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗
𝑣 +=
𝑓
𝑚
∆𝑡
𝑥 += 𝑣∆𝑡
𝑑𝑣
𝑑𝑡
=
𝑓
𝑚
𝑑𝑥
𝑑𝑡
= 𝑣

DIVERGENCE ON SIMD
0 1 2 3 4 5 6 7
Void Kernel()
{
if(A)
FuncA();
else if(B)
FuncB();
else
FuncC();
}

PARTICLE BASED SIMULATION ON THE GPU
 Particle collision using a uniform grid 0 1 2 3 4 5 6 7
Void Kernel()
{
prepare();
collide(Cell0);
collide(Cell1);
collide(Cell2);
collide(Cell3);
collide(Cell4);
collide(Cell5);
collide(Cell6);
collide(Cell7);
collide(Cell8);
}
Cell0 Cell1 Cell2
Cell3 Cell4 Cell5
Cell6 Cell7 Cell8

MIXED PARTICLE SIMULATION
 Not only small particles
 Difficulty for GPUs
– Large particles interact with small particles
– Large-large collision

CHALLENGE
 Non uniform work granularity
– Small-small(SS) collision
 Uniform, GPU
– Large-large(LL) collision
 Non Uniform, CPU
– Large-small(LS) collision
 Non Uniform, CPU

FUSION ARCHITECTURE
 CPU and GPU are:
– On the same die
– Much closer
– Efficient data sharing
 CPU and GPU are good at different works
– CPU: serial computation, conditional branch
– GPU: parallel computation
 Able to dispatch works to:
– Serial work with varying granularity → CPU
– Parallel work with the uniform granularity → GPU

MIXED PARTICLE SIMULATION
 Benefit from Fusion Architecture
– Different works in a simulation
– CPU & GPU are working together
– Shares data

METHOD

TWO SIMULATIONS
 Small particles
 Large particles
Build
Acc. Structure
SS
Collision
S
Integration
Build
Acc. Structure
LL
Collision
L
Integration
LS
Collision
Position
Velocity
Force
Grid
Position
Velocity
Force

 Small particles
 Large particles
Uniform Work
Non Uniform Work
CLASSIFY BY WORK GRANULARITY
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure

 Small particles
 Large particles
GPU
CPU
CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure

 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
DATA SHARING
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Build
Acc. Structure
Position
Velocity
Grid
Force
LS
Collision

 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
SYNCHRONIZATION
Position
Velocity
Force
Grid
Position
Velocity
Force
SS
Collision
S
Integration
L
Integration
LL
Collision
Position
Velocity
Grid
Force
Synchronization
LS
Collision
Build
Acc. Structure
Build
Acc. Structure
Synchronization

GPU
CPU
VISUALIZING WORKLOADS
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Synchronization
L
Integration
 Small particles
 Large particles
 Grid construction can be moved at the end of the pipeline
– Unbalanced workload

 Small particles
 Large particles
 To get better load balancing
– The sync is for passing the force buffer filled by the CPU to the GPU
– Move the LL collision after the sync
GPU
CPU
LOAD BALANCING
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Synchronization
L
Integration
LS
Collision

GPUWork
CPUWork

MULTI THREADING
(4 THREADS)

FURTHER OPTIMIZATION
GPU
CPU0
CPU1
CPU2
Build
Acc.
Structure
SS
Collision
S
Integ.
LL
Collision
L
Integ.
LS
Collision
Synchronization
1. Not optimized for “Llano” which is a 4 core CPU
– Only 2 CPU core were used
– Can use 2 more cores for LS collision
2. LL collision was not optimized
– CPU waits when the GPU was constructing a grid
– Use CPU to improve SS collision

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
 Cannot split the work by large particle indices
– More than 1 large particle can collide with a small particle
– Have to lock the memory on write → Inefficient
 Prepare a local buffer for a thread
– A buffer storing force on small particles
– Lock free
 Local buffers are merged to one
L0
S0
S1
L1
Thread0
Thread1
Thread2

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
Synchronization

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization

 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases

 Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT

 Requirements
– Full sort was over the budget
– Full sort is not “a must”
– Sort is an optional computation for performance improvement
– Incremental sort
– Use multiple threads
 Solution
– Used generalized “Odd-even transition sort”
CHOOSE SORT

BLOCK TRANSITION SORT
 Generalized “Odd-even transition sort”
 Instead of sorting 2 adjacent elements, sort adjacent 2 blocks
 Iterate until convergence
 Use a thread to sort 2 adjacent blocks
– 6 blocks for 3 threads
– Radix sort
Odd-even transition sort
Block transition sort

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
MergeMergeMerge
Synchronization

GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
CPU1
CPU2
LS
Collision
LS
Collision
LS
MergeMergeMerge
LL
Coll.
L
Integ.
Synchronization
S Sorting
S Sorting
S Sorting
Synchronization

DEMO
GPUWork
CPUWork

CONCLUSIONS
 Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU
and GPU on AMD’s Fusion Architecture
– The CPU is used for works with non identical compute granularity
– The GPU is used for highly parallel works
 Memory sharing between the CPU and GPU is the key for the efficiency
– Avoid wasteful memory copies

REFERENCE
 Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,
Proc. of Computer Graphics International, 63-70(2007)
 Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,
Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

Similaire à Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011) (20)

Dernier

Dernier (20)

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)