SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
HETEROGENEOUS PARTICLE
BASED SIMULATION
Takahiro Harada, AMD
2 Harada, Heterogeneous Particle-based Simulation
 Large number of particles
 Particles with identical size
– Work granularity is almost the same
– Good for the wide SIMD architecture
PARTICLE BASED SIMULATION ON THE GPU
Harada et al. 2007
3 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION
 Collision
 Integration
 Acceleration structure is used for efficient collide
– Uniform grid → Suited for the GPU
– Less divergence
𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗
𝑣 +=
𝑓
𝑚
∆𝑡
𝑥 += 𝑣∆𝑡
𝑑𝑣
𝑑𝑡
=
𝑓
𝑚
𝑑𝑥
𝑑𝑡
= 𝑣
4 Harada, Heterogeneous Particle-based Simulation
DIVERGENCE ON SIMD
0 1 2 3 4 5 6 7
Void Kernel()
{
if(A)
FuncA();
else if(B)
FuncB();
else
FuncC();
}
5 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION ON THE GPU
 Particle collision using a uniform grid 0 1 2 3 4 5 6 7
Void Kernel()
{
prepare();
collide(Cell0);
collide(Cell1);
collide(Cell2);
collide(Cell3);
collide(Cell4);
collide(Cell5);
collide(Cell6);
collide(Cell7);
collide(Cell8);
}
Cell0 Cell1 Cell2
Cell3 Cell4 Cell5
Cell6 Cell7 Cell8
6 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
 Not only small particles
 Difficulty for GPUs
– Large particles interact with small particles
– Large-large collision
7 Harada, Heterogeneous Particle-based Simulation
CHALLENGE
 Non uniform work granularity
– Small-small(SS) collision
 Uniform, GPU
– Large-large(LL) collision
 Non Uniform, CPU
– Large-small(LS) collision
 Non Uniform, CPU
8 Harada, Heterogeneous Particle-based Simulation
FUSION ARCHITECTURE
 CPU and GPU are:
– On the same die
– Much closer
– Efficient data sharing
 CPU and GPU are good at different works
– CPU: serial computation, conditional branch
– GPU: parallel computation
 Able to dispatch works to:
– Serial work with varying granularity → CPU
– Parallel work with the uniform granularity → GPU
9 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
 Benefit from Fusion Architecture
– Different works in a simulation
– CPU & GPU are working together
– Shares data
10 Harada, Heterogeneous Particle-based Simulation
METHOD
11 Harada, Heterogeneous Particle-based Simulation
TWO SIMULATIONS
 Small particles
 Large particles
Build
Acc. Structure
SS
Collision
S
Integration
Build
Acc. Structure
LL
Collision
L
Integration
LS
Collision
Position
Velocity
Force
Grid
Position
Velocity
Force
12 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
Uniform Work
Non Uniform Work
CLASSIFY BY WORK GRANULARITY
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
13 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
GPU
CPU
CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
14 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
DATA SHARING
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Build
Acc. Structure
Position
Velocity
Grid
Force
LS
Collision
15 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
 Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
SYNCHRONIZATION
Position
Velocity
Force
Grid
Position
Velocity
Force
SS
Collision
S
Integration
L
Integration
LL
Collision
Position
Velocity
Grid
Force
Synchronization
LS
Collision
Build
Acc. Structure
Build
Acc. Structure
Synchronization
16 Harada, Heterogeneous Particle-based Simulation
GPU
CPU
VISUALIZING WORKLOADS
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Synchronization
L
Integration
 Small particles
 Large particles
 Grid construction can be moved at the end of the pipeline
– Unbalanced workload
17 Harada, Heterogeneous Particle-based Simulation
 Small particles
 Large particles
 To get better load balancing
– The sync is for passing the force buffer filled by the CPU to the GPU
– Move the LL collision after the sync
GPU
CPU
LOAD BALANCING
Build
Acc. Structure
SS
Collision
S
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Synchronization
L
Integration
LS
Collision
18 Harada, Heterogeneous Particle-based Simulation
GPUWork
CPUWork
19 Harada, Heterogeneous Particle-based Simulation
MULTI THREADING
(4 THREADS)
20 Harada, Heterogeneous Particle-based Simulation
FURTHER OPTIMIZATION
GPU
CPU0
CPU1
CPU2
Build
Acc.
Structure
SS
Collision
S
Integ.
LL
Collision
L
Integ.
LS
Collision
Synchronization
1. Not optimized for “Llano” which is a 4 core CPU
– Only 2 CPU core were used
– Can use 2 more cores for LS collision
2. LL collision was not optimized
– CPU waits when the GPU was constructing a grid
– Use CPU to improve SS collision
21 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
 Cannot split the work by large particle indices
– More than 1 large particle can collide with a small particle
– Have to lock the memory on write → Inefficient
 Prepare a local buffer for a thread
– A buffer storing force on small particles
– Lock free
 Local buffers are merged to one
L0
S0
S1
L1
Thread0
Thread1
Thread2
22 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
Synchronization
23 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization
24 Harada, Heterogeneous Particle-based Simulation
 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
25 Harada, Heterogeneous Particle-based Simulation
 Spatially coherent memory layout improves cache utilization
 As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
26 Harada, Heterogeneous Particle-based Simulation
 Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
27 Harada, Heterogeneous Particle-based Simulation
 Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
28 Harada, Heterogeneous Particle-based Simulation
 Requirements
– Full sort was over the budget
– Full sort is not “a must”
– Sort is an optional computation for performance improvement
– Incremental sort
– Use multiple threads
 Solution
– Used generalized “Odd-even transition sort”
CHOOSE SORT
29 Harada, Heterogeneous Particle-based Simulation
BLOCK TRANSITION SORT
 Generalized “Odd-even transition sort”
 Instead of sorting 2 adjacent elements, sort adjacent 2 blocks
 Iterate until convergence
 Use a thread to sort 2 adjacent blocks
– 6 blocks for 3 threads
– Radix sort
Odd-even transition sort
Block transition sort
30 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
LL
Collision
L
Integ.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
Synchronization
31 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Integ.
CPU0
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision Synchronization
MergeMergeMerge
LL
Coll.
L
Integ.
Synchronization
S Sorting
S Sorting
S Sorting
Synchronization
32 Harada, Heterogeneous Particle-based Simulation
DEMO
GPUWork
CPUWork
33 Harada, Heterogeneous Particle-based Simulation
DEMO
GPUWork
CPUWork
34 Harada, Heterogeneous Particle-based Simulation
CONCLUSIONS
 Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU
and GPU on AMD’s Fusion Architecture
– The CPU is used for works with non identical compute granularity
– The GPU is used for highly parallel works
 Memory sharing between the CPU and GPU is the key for the efficiency
– Avoid wasteful memory copies
35 Harada, Heterogeneous Particle-based Simulation
REFERENCE
 Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,
Proc. of Computer Graphics International, 63-70(2007)
 Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,
Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

Contenu connexe

Tendances

Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
Cass Everitt
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
Holger Gruen
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)
Takahiro Harada
 

Tendances (20)

OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
 
FlameWorks GTC 2014
FlameWorks GTC 2014FlameWorks GTC 2014
FlameWorks GTC 2014
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Let's talk about Garbage Collection
Let's talk about Garbage CollectionLet's talk about Garbage Collection
Let's talk about Garbage Collection
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
Optimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeOptimizing the graphics pipeline with compute
Optimizing the graphics pipeline with compute
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 

En vedette

A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
Takahiro Harada
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)
Takahiro Harada
 
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Takahiro Harada
 

En vedette (10)

A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)
 
Introducing Firerender for 3DS Max
Introducing Firerender for 3DS MaxIntroducing Firerender for 3DS Max
Introducing Firerender for 3DS Max
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
 
Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)Physics Tutorial, GPU Physics (GDC2010)
Physics Tutorial, GPU Physics (GDC2010)
 
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
 
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
Introduction to Monte Carlo Ray Tracing (CEDEC 2013)
 
確率的ライトカリング 理論と実装 (CEDEC2016)
確率的ライトカリング 理論と実装 (CEDEC2016)確率的ライトカリング 理論と実装 (CEDEC2016)
確率的ライトカリング 理論と実装 (CEDEC2016)
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
 
自由なデータ
自由なデータ自由なデータ
自由なデータ
 
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
Introduction to Bidirectional Path Tracing (BDPT) & Implementation using Open...
 

Similaire à Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
ecoumans
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgate
guest40fc7cd
 
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Masashi Imano
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
Christopher Choi
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 

Similaire à Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011) (20)

Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
 
Lec07 threading hw
Lec07 threading hwLec07 threading hw
Lec07 threading hw
 
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
Optimization of Electrical Machines in the Cloud with SyMSpace by LCMOptimization of Electrical Machines in the Cloud with SyMSpace by LCM
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
Travelling salesman problem
Travelling salesman problemTravelling salesman problem
Travelling salesman problem
 
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgate
 
Esa act mtimpe_talk
Esa act mtimpe_talkEsa act mtimpe_talk
Esa act mtimpe_talk
 
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
Optimization of parameter settings for GAMG solver in simple solver, OpenFOAM...
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
 
A kind and gentle introducton to rac
A kind and gentle introducton to racA kind and gentle introducton to rac
A kind and gentle introducton to rac
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

  • 2. 2 Harada, Heterogeneous Particle-based Simulation  Large number of particles  Particles with identical size – Work granularity is almost the same – Good for the wide SIMD architecture PARTICLE BASED SIMULATION ON THE GPU Harada et al. 2007
  • 3. 3 Harada, Heterogeneous Particle-based Simulation PARTICLE BASED SIMULATION  Collision  Integration  Acceleration structure is used for efficient collide – Uniform grid → Suited for the GPU – Less divergence 𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗 𝑣 += 𝑓 𝑚 ∆𝑡 𝑥 += 𝑣∆𝑡 𝑑𝑣 𝑑𝑡 = 𝑓 𝑚 𝑑𝑥 𝑑𝑡 = 𝑣
  • 4. 4 Harada, Heterogeneous Particle-based Simulation DIVERGENCE ON SIMD 0 1 2 3 4 5 6 7 Void Kernel() { if(A) FuncA(); else if(B) FuncB(); else FuncC(); }
  • 5. 5 Harada, Heterogeneous Particle-based Simulation PARTICLE BASED SIMULATION ON THE GPU  Particle collision using a uniform grid 0 1 2 3 4 5 6 7 Void Kernel() { prepare(); collide(Cell0); collide(Cell1); collide(Cell2); collide(Cell3); collide(Cell4); collide(Cell5); collide(Cell6); collide(Cell7); collide(Cell8); } Cell0 Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8
  • 6. 6 Harada, Heterogeneous Particle-based Simulation MIXED PARTICLE SIMULATION  Not only small particles  Difficulty for GPUs – Large particles interact with small particles – Large-large collision
  • 7. 7 Harada, Heterogeneous Particle-based Simulation CHALLENGE  Non uniform work granularity – Small-small(SS) collision  Uniform, GPU – Large-large(LL) collision  Non Uniform, CPU – Large-small(LS) collision  Non Uniform, CPU
  • 8. 8 Harada, Heterogeneous Particle-based Simulation FUSION ARCHITECTURE  CPU and GPU are: – On the same die – Much closer – Efficient data sharing  CPU and GPU are good at different works – CPU: serial computation, conditional branch – GPU: parallel computation  Able to dispatch works to: – Serial work with varying granularity → CPU – Parallel work with the uniform granularity → GPU
  • 9. 9 Harada, Heterogeneous Particle-based Simulation MIXED PARTICLE SIMULATION  Benefit from Fusion Architecture – Different works in a simulation – CPU & GPU are working together – Shares data
  • 10. 10 Harada, Heterogeneous Particle-based Simulation METHOD
  • 11. 11 Harada, Heterogeneous Particle-based Simulation TWO SIMULATIONS  Small particles  Large particles Build Acc. Structure SS Collision S Integration Build Acc. Structure LL Collision L Integration LS Collision Position Velocity Force Grid Position Velocity Force
  • 12. 12 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles Uniform Work Non Uniform Work CLASSIFY BY WORK GRANULARITY Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Build Acc. Structure
  • 13. 13 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles GPU CPU CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Build Acc. Structure
  • 14. 14 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  Grid, small particle data has to be shared with the CPU for LS collision – Allocated as zero copy buffer GPU CPU DATA SHARING Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision Build Acc. Structure Position Velocity Grid Force LS Collision
  • 15. 15 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  Grid, small particle data has to be shared with the CPU for LS collision – Allocated as zero copy buffer GPU CPU SYNCHRONIZATION Position Velocity Force Grid Position Velocity Force SS Collision S Integration L Integration LL Collision Position Velocity Grid Force Synchronization LS Collision Build Acc. Structure Build Acc. Structure Synchronization
  • 16. 16 Harada, Heterogeneous Particle-based Simulation GPU CPU VISUALIZING WORKLOADS Build Acc. Structure SS Collision S Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Synchronization L Integration  Small particles  Large particles  Grid construction can be moved at the end of the pipeline – Unbalanced workload
  • 17. 17 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  To get better load balancing – The sync is for passing the force buffer filled by the CPU to the GPU – Move the LL collision after the sync GPU CPU LOAD BALANCING Build Acc. Structure SS Collision S Integration Position Velocity Force Grid Position Velocity Force LL Collision Synchronization L Integration LS Collision
  • 18. 18 Harada, Heterogeneous Particle-based Simulation GPUWork CPUWork
  • 19. 19 Harada, Heterogeneous Particle-based Simulation MULTI THREADING (4 THREADS)
  • 20. 20 Harada, Heterogeneous Particle-based Simulation FURTHER OPTIMIZATION GPU CPU0 CPU1 CPU2 Build Acc. Structure SS Collision S Integ. LL Collision L Integ. LS Collision Synchronization 1. Not optimized for “Llano” which is a 4 core CPU – Only 2 CPU core were used – Can use 2 more cores for LS collision 2. LL collision was not optimized – CPU waits when the GPU was constructing a grid – Use CPU to improve SS collision
  • 21. 21 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION  Cannot split the work by large particle indices – More than 1 large particle can collide with a small particle – Have to lock the memory on write → Inefficient  Prepare a local buffer for a thread – A buffer storing force on small particles – Lock free  Local buffers are merged to one L0 S0 S1 L1 Thread0 Thread1 Thread2
  • 22. 22 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision Synchronization
  • 23. 23 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge Synchronization
  • 24. 24 Harada, Heterogeneous Particle-based Simulation  Spatially coherent memory layout improves cache utilization  As particles move, spatial locality decreases OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
  • 25. 25 Harada, Heterogeneous Particle-based Simulation  Spatially coherent memory layout improves cache utilization  As particles move, spatial locality decreases OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
  • 26. 26 Harada, Heterogeneous Particle-based Simulation  Sort particles by spatial location to improve cache utilization – Z curve SPATIAL SORT
  • 27. 27 Harada, Heterogeneous Particle-based Simulation  Sort particles by spatial location to improve cache utilization – Z curve SPATIAL SORT
  • 28. 28 Harada, Heterogeneous Particle-based Simulation  Requirements – Full sort was over the budget – Full sort is not “a must” – Sort is an optional computation for performance improvement – Incremental sort – Use multiple threads  Solution – Used generalized “Odd-even transition sort” CHOOSE SORT
  • 29. 29 Harada, Heterogeneous Particle-based Simulation BLOCK TRANSITION SORT  Generalized “Odd-even transition sort”  Instead of sorting 2 adjacent elements, sort adjacent 2 blocks  Iterate until convergence  Use a thread to sort 2 adjacent blocks – 6 blocks for 3 threads – Radix sort Odd-even transition sort Block transition sort
  • 30. 30 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge Synchronization
  • 31. 31 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge LL Coll. L Integ. Synchronization S Sorting S Sorting S Sorting Synchronization
  • 32. 32 Harada, Heterogeneous Particle-based Simulation DEMO GPUWork CPUWork
  • 33. 33 Harada, Heterogeneous Particle-based Simulation DEMO GPUWork CPUWork
  • 34. 34 Harada, Heterogeneous Particle-based Simulation CONCLUSIONS  Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU and GPU on AMD’s Fusion Architecture – The CPU is used for works with non identical compute granularity – The GPU is used for highly parallel works  Memory sharing between the CPU and GPU is the key for the efficiency – Avoid wasteful memory copies
  • 35. 35 Harada, Heterogeneous Particle-based Simulation REFERENCE  Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs, Proc. of Computer Graphics International, 63-70(2007)  Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation, Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)