This document discusses using a heterogeneous CPU-GPU approach for particle-based simulations with particles of varying sizes. Large particles are handled by the CPU due to irregular workloads, while small particles are handled by the GPU due to uniform workloads. Optimizations include multi-threading large-small collisions on the CPU, spatially sorting small particles on the GPU to improve cache utilization, and load balancing work between the CPU and GPU. The approach leverages the strengths of the CPU and GPU on AMD's fusion architecture to efficiently simulate mixed particle systems.
2. 2 Harada, Heterogeneous Particle-based Simulation
Large number of particles
Particles with identical size
– Work granularity is almost the same
– Good for the wide SIMD architecture
PARTICLE BASED SIMULATION ON THE GPU
Harada et al. 2007
3. 3 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION
Collision
Integration
Acceleration structure is used for efficient collide
– Uniform grid → Suited for the GPU
– Less divergence
𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗
𝑣 +=
𝑓
𝑚
∆𝑡
𝑥 += 𝑣∆𝑡
𝑑𝑣
𝑑𝑡
=
𝑓
𝑚
𝑑𝑥
𝑑𝑡
= 𝑣
5. 5 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION ON THE GPU
Particle collision using a uniform grid 0 1 2 3 4 5 6 7
Void Kernel()
{
prepare();
collide(Cell0);
collide(Cell1);
collide(Cell2);
collide(Cell3);
collide(Cell4);
collide(Cell5);
collide(Cell6);
collide(Cell7);
collide(Cell8);
}
Cell0 Cell1 Cell2
Cell3 Cell4 Cell5
Cell6 Cell7 Cell8
6. 6 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
Not only small particles
Difficulty for GPUs
– Large particles interact with small particles
– Large-large collision
7. 7 Harada, Heterogeneous Particle-based Simulation
CHALLENGE
Non uniform work granularity
– Small-small(SS) collision
Uniform, GPU
– Large-large(LL) collision
Non Uniform, CPU
– Large-small(LS) collision
Non Uniform, CPU
8. 8 Harada, Heterogeneous Particle-based Simulation
FUSION ARCHITECTURE
CPU and GPU are:
– On the same die
– Much closer
– Efficient data sharing
CPU and GPU are good at different works
– CPU: serial computation, conditional branch
– GPU: parallel computation
Able to dispatch works to:
– Serial work with varying granularity → CPU
– Parallel work with the uniform granularity → GPU
9. 9 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
Benefit from Fusion Architecture
– Different works in a simulation
– CPU & GPU are working together
– Shares data
11. 11 Harada, Heterogeneous Particle-based Simulation
TWO SIMULATIONS
Small particles
Large particles
Build
Acc. Structure
SS
Collision
S
Integration
Build
Acc. Structure
LL
Collision
L
Integration
LS
Collision
Position
Velocity
Force
Grid
Position
Velocity
Force
12. 12 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
Uniform Work
Non Uniform Work
CLASSIFY BY WORK GRANULARITY
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
13. 13 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
GPU
CPU
CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
14. 14 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
DATA SHARING
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Build
Acc. Structure
Position
Velocity
Grid
Force
LS
Collision
15. 15 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
SYNCHRONIZATION
Position
Velocity
Force
Grid
Position
Velocity
Force
SS
Collision
S
Integration
L
Integration
LL
Collision
Position
Velocity
Grid
Force
Synchronization
LS
Collision
Build
Acc. Structure
Build
Acc. Structure
Synchronization
16. 16 Harada, Heterogeneous Particle-based Simulation
GPU
CPU
VISUALIZING WORKLOADS
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Synchronization
L
Integration
Small particles
Large particles
Grid construction can be moved at the end of the pipeline
– Unbalanced workload
17. 17 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
To get better load balancing
– The sync is for passing the force buffer filled by the CPU to the GPU
– Move the LL collision after the sync
GPU
CPU
LOAD BALANCING
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Synchronization
L
Integration
LS
Collision
20. 20 Harada, Heterogeneous Particle-based Simulation
FURTHER OPTIMIZATION
GPU
CPU0
CPU1
CPU2
Build
Acc.
Structure
SS
Collision
S
Integ.
LL
Collision
L
Integ.
LS
Collision
Synchronization
1. Not optimized for “Llano” which is a 4 core CPU
– Only 2 CPU core were used
– Can use 2 more cores for LS collision
2. LL collision was not optimized
– CPU waits when the GPU was constructing a grid
– Use CPU to improve SS collision
21. 21 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
Cannot split the work by large particle indices
– More than 1 large particle can collide with a small particle
– Have to lock the memory on write → Inefficient
Prepare a local buffer for a thread
– A buffer storing force on small particles
– Lock free
Local buffers are merged to one
L0
S0
S1
L1
Thread0
Thread1
Thread2
22. 22 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
Synchronization
23. 23 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization
26. 26 Harada, Heterogeneous Particle-based Simulation
Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
27. 27 Harada, Heterogeneous Particle-based Simulation
Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
28. 28 Harada, Heterogeneous Particle-based Simulation
Requirements
– Full sort was over the budget
– Full sort is not “a must”
– Sort is an optional computation for performance improvement
– Incremental sort
– Use multiple threads
Solution
– Used generalized “Odd-even transition sort”
CHOOSE SORT
29. 29 Harada, Heterogeneous Particle-based Simulation
BLOCK TRANSITION SORT
Generalized “Odd-even transition sort”
Instead of sorting 2 adjacent elements, sort adjacent 2 blocks
Iterate until convergence
Use a thread to sort 2 adjacent blocks
– 6 blocks for 3 threads
– Radix sort
Odd-even transition sort
Block transition sort
30. 30 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization
31. 31 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
LL
Coll.
L
Integ.
Synchronization
S Sorting
S Sorting
S Sorting
Synchronization
34. 34 Harada, Heterogeneous Particle-based Simulation
CONCLUSIONS
Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU
and GPU on AMD’s Fusion Architecture
– The CPU is used for works with non identical compute granularity
– The GPU is used for highly parallel works
Memory sharing between the CPU and GPU is the key for the efficiency
– Avoid wasteful memory copies
35. 35 Harada, Heterogeneous Particle-based Simulation
REFERENCE
Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,
Proc. of Computer Graphics International, 63-70(2007)
Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,
Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)