SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
High-Performance Physics Solver Design
for Next Generation Consoles
Vangelis Kokkevis
Steven Osman
Eric Larsen
Simulation Technology Group
Sony Computer Entertainment America
US R&D
This Talk
Optimizing physics simulation on a multi-core
architecture.
Focus on CELL architecture
Variety of simulation domains
Cloth, Rigid Bodies, Fluids, Particles
Practical advice based on real case-studies
Demos!
Basic Issues
Looking for opportunities to parallelize processing
High Level – Many independent solvers on multiple cores
Low Level – One solver, one/multiple cores
Coding with small memory in mind
Streaming
Batching up work
Software Caching
Speeding up processing within each unit
SIMD processing, instruction scheduling
Double-buffering
Parallelizing/optimizing existing code
What is not in this talk?
Details on specific physics algorithms
Too much material for a 1-hour talk
Will provide references to techniques
Much insight on non-CELL platforms
Concentrate on actual results
Concepts should be applicable beyond CELL
The Cell Processor Model
PPU
SPE0
DMA
256K LS
SPU0
SPE0
DMA
256K LS
SPU1
SPE0
DMA
256K LS
SPU2
SPE0
DMA
256K LS
SPU3
DMA
256K LS
SPU4
DMA
256K LS
SPU5
DMA
256K LS
SPU6
DMA
256K LS
SPU7
Main Memory
L1/L2
Physics on CELL
Physics should happen mostly on SPUs
There’s more of them!
SPUs have greater bandwidth & performance
PPU is busy doing other stuff
PPU
SPE0
DMA
256K LS
SPU0
DMA
SPU1
DMA
SPU2
DMA
SPU3
DMA
SPU4
DMA
SPU5
DMA
SPU6
DMA
SPU7
Main Memory
L1/L2
256K LS 256K LS 256K LS
256K LS 256K LS 256K LS 256K LS
SPU Performance Recipe
Large bandwidth to and from main memory
Quick (1-cycle) LS memory access
SIMD instruction set
Concurrent DMA and processing
Challenges:
Limited LS size, shared between code and data
Random accesses of main memory are slow
Cloth Simulation
Cloth Simulation
Cloth mesh simulated as point masses
(vertices) connected via distance constraints
(edges).
m1m1
m2m2 m3m3
d1d1
d2d2
d3d3Mesh TriangleMesh Triangle
References:
T.Jacobsen,Advanced Character Physics, GDC 2001
A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005
Simulation Step
1. Compute external forces, fE,per vertex
2. Compute new vertex positions [ Integration ]:
pt+1 = (2pt − pt−1) + 1
3. Fix edge lengths
Adjust vertex positions
4. Correct penetrations with collision geometry
Adjust vertex positions
2fE ∗ 1
m ∗ Δt2
How many vertices?
How many vertices fit in 256K (less actually)?
A lot, surprisingly…
Tips:
Look for opportunities to stream data
Keep in LS only data required for each step
Integration Step
Less than 4000 verts in 200K of memory
We don’t need to keep them all in LS
Keep vertex data in main memory and bring it
in in blocks
16 + 16 + 16 + 4 = 52 bytes / vertex
pt+1 = (2pt − pt−1) + 1
2fE ∗ 1
m ∗ Δt2
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
B0 B1 B2 B3
1
m
Main Memory
Local Store
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
B0 B1 B2 B3
1
m
Local Store
Main Memory
DMA_IN
B0
B0 B0 B0 B0
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
B0 B1 B2 B3
1
m
Main Memory
Local Store
DMA_IN
B0
B0 B0 B0 B0
Process B0
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
Main Memory
Local Store
DMA_IN
B0
B0 B0 B0 B0
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Process B1
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Process B1
DMA_OUT
B1
Streaming Integration
B0 B1 B2 B3
B0 B1 B2 B3
B0 B1 B2 B3
pt
pt−1
fE
Process B0
DMA_OUT
B0
B0 B1 B2 B3
1
m
DMA_IN
B1
Main Memory
Local Store
DMA_IN
B0
B1 B1 B1 B1
Process B1
DMA_OUT
B1
…
Double-buffering
Take advantage of concurrent DMA and
processing to hide transfer times
Without double-buffering:
Process B0
DMA_OUT
B0
DMA_IN
B1
DMA_IN
B0
Process B1
DMA_OUT
B1
…
With double-buffering:
Process B0
DMA_IN
B1
DMA_IN
B0
…
DMA_OUT
B0
Process B1
DMA_IN
B2
Process B2
DMA_OUT
B1
DMA_IN
B3
Streaming Data
Streaming is possible when the data access
pattern is simple and predictable (e.g. linear)
Number of verts processed per frame depends on
processing speed and bandwidth but not LS size
Unfortunately, not every step in the cloth
solver can be fully streamed
Fixing edge lengths requires random memory
access…
Fixing Edge Lengths
Points coming out of the integration step don’t
necessarily satisfy edge distance constraints
struct Edge
{
int v1;
int v2;
float restLen;
}
p[v1] p[v2]
p[v1] p[v2]
Vector3 d = p[v2] – p[v1];
float len = sqrt(dot(d,d));
diff = (len-restLen)/len;
p[v1] -= d * 0.5 * diff;
p[v2] += d * 0.5 * diff;
Fixing Edge Lengths
An iterative process: Fix one edge at a time
by adjusting 2 vertex positions
Requires random access to particle positions
array
Solution:
Keep all particle positions in LS
Stream in edge data
In 200K we can fit 200KB / 16B > 12K vertices
Rigid Bodies
Our group is currently porting the AGEIATM
PhysXTM SDK to CELL
Large codebase written with a PC
architecture in mind
Assumes easy random access to memory
Processes tasks sequentially (no parallelism)
Interesting example on how to port existing
code to a multi-core architecture
Starting the Port
Determine all the stages of the rigid body
pipeline
Look for stages that are good candidates for
parallelizing/optimizing
Profile code to make sure we are focusing on
the right parts
Constraint
Solve
Constraint
Solve
Broadphase
Collision Detection
Rigid Body Pipeline
Broadphase
Collision Detection
Narrowphase
Collision Detection
Narrowphase
Collision Detection
Constraint PrepConstraint Prep
IntegrationIntegration
Constraint PrepConstraint Prep
Points of contact
between bodies
Potentially colliding
body pairs
New body positions
Updated body
velocities
Current body positions
Constraint Equations
Rigid Body Pipeline
Broadphase
Collision Detection
Broadphase
Collision Detection
Points of contact
between bodies
Potentially colliding
body pairs
New body positions
Updated body
velocities
Current body positions
Constraint Equations
NP NP NP
CP CP CP
CS CS CS
I I I
Profiling Scenario
Profiling Results
Cumulative Frame Time
0
10000
20000
30000
40000
50000
60000
70000
1
57
113
169
225
281
337
393
449
505
561
617
673
729
785
841
897
953
1009
1065
1121
1177
1233
1289
1345
1401
1457
1513
1569
1625
1681
1737
1793
1849
1905
1961
Other
INTEGRATION
SOLVER
CONSTRAINT_PREP
NARROWPHASE
BROADPHASE
Running on the SPUs
Three steps:
1. (PPU) Pre-process
“Gather” operation (extract data from PhysX data
structures and pack it in MM)
2. (SPU) Execute
DMA packed data from MM to LS
Process data and store output in LS
DMA output to MM
3. (PPU) Post-process
“Scatter” operation (unpack output data and put back in
PhysX data structures)
Why Involve the PPU?
Required PhysX data is not conveniently packed
Data is often not aligned
We need to use PhysX data structures to avoid
breaking features we haven’t ported
Solutions:
Use list DMAs to bring in data
Modify existing code to force alignment
Change PhysX code to work with new data structures
Batching Up Work
Task
Description
Task
Description
batch
inputs/
outputs
batch
inputs/
outputs
Work batch
buffers in MM
Work batch
buffers in MM
PhysX
data-structures
PhysX
data-structures
PPU
Pre-Process
PPU
Pre-Process
Task
Description
Task
Description
batch
inputs/
outputs
batch
inputs/
outputs
……
SPU
Execute
SPU
Execute
PPU
Post-Process
PPU
Post-Process
PhysX
data-structures
PhysX
data-structures
Create work batches for each task
A
Narrow-phase Collision Detection
Problem:
A list of object pairs that may be colliding
Want to do contact processing on SPUs
Pairs list has references to geometry
C
B
(A,C)
(A,B)
(B,C)
…
Narrow-phase Collision Detection
Data locality
Same bodies may be in several pairs
Geometry may be instanced for different bodies
SPU memory access
Can only access main memory with DMA
No hardware cache
Data reuse must be explicit
Software Cache
Idea: make a (read-only) software cache
Cache entry is one geometric object
Entries have variable size
Basic operation
SPU checks cache for object
If not in cache, object fetched with DMA
Cache returns a local address for object
Software Cache
Data Structures
Two entry buffers
New entries appended to “current” buffer
Hash-table used to record and find loaded entries
A
B
C
Buffer 0
Next DMA
Buffer 1
Software Cache
Data Replacement
When space runs out in a buffer
Overwrite data in second buffer
Considerations
Does not fragment memory
No searches for free space
But does not prefer frequently used data
Software Cache
Hiding the DMA latency
Double-buffering
Start DMA for un-cached entries
Process previously DMA’d entries
Process/pre-fetch batches
Fetch and compute times vary
Batching may improve balance
DMA-lists useful
One DMA command
Multiple chunks of data gathered
Current Buffer
D
F
E
Process
DMA
A
B
C
Software Caching
Conclusions
Simple cache is practical
Used for small convex objects in PhysX
Design considerations
Tradeoff of cache-logic cycles vs. bandwidth saved
Pre-fetching important to include
Single SPU Performance
PPU only:PPU only:
ExecExec
PPU + SPU:PPU + SPU:
PPUPPU
PPUPPU
SPUSPU ExecExec
FreeFree
SPU Exec < PPU Exec: SIMD + fast mem accessSPU Exec < PPU Exec: SIMD + fast mem access
PreP PostP
Multiple SPU Performance
Pre- and Post- processing times determine
how many SPUs can be used effectively
Multiple SPU Performance
11 22 33 44PPUPPU
1 SPU1 SPU
11
22
33
44SS
3 SPUs3 SPUs
11 22 4433
11
22
33
44
2 SPUs2 SPUs
PPU vs SPU comparisons
Convex Stack (500 boxes)
0
10000
20000
30000
40000
50000
60000
70000
80000
1
35
69
103
137
171
205
239
273
307
341
375
409
443
477
511
545
579
613
647
681
715
749
783
817
851
885
919
953
987
1021
1055
1089
1123
1157
1191
1225
1259
1293
frame
microseconds
PPU-only
1-SPU
2-SPUs
3-SPUs
4-SPUs
Duck Demo
One of our first CELL demos (spring 2005)
Several interacting physics systems:
Rigid bodies (ducks & boats)
Height-field water surface
Cloth with ripping (sails)
Particle based fluids (splashes + cups)
Duck Demo (Lots of Ducks)
Duck Demo
Ambitious project with short deadline
Early PC prototypes of some pieces
Most straightforward way to parallelize:
Dedicate one SPU for each subsystem
Each piece could be developed and tested
individually
Duck Demo Resource Allocation
PU – main loop
SPU thread synchronization, draw calls
SPU0 – height field water (<50%)
SPU1 – splashes iso-surface (<50%)
SPU2 – cloth sails for boat 1 (<50%)
SPU3 – cloth sails for boat 2 (<50%)
SPU4 – rigid body collision/response (95%)
HF water
Iso-Surface
Cloth
Cloth
Rigid Bodies
1 frame
Parallelization Recipe
One three-step approach to code
parallelization:
1. Find independent components
2. Run them side-by-side
3. Recursively apply recipe to components
Challenges
Step 1: Find independent components
Where do you look?
Maybe you need to break apart and overlap
your data?
e.g. Broad phase collision detection
Maybe you need to break apart your loop into
individual iterations?
e.g. Solving cloth constraints
Broad Phase Collision Detection
600 Objects
200 Objects A 200 Objects B 200 Objects C
200 Objects A 200 Objects Bvs
200 Objects A 200 Objects Cvs
200 Objects C200 Objects B vs
We can execute
all three of
these
simultaneously
Need to test 600 rigid bodies against each other.
Cloth Solving
for (i=1 to 5) {
cloth=solve(cloth)
}
A B
C
for (i=1 to 5) {
solve_on_proc1(a);
solve_on_proc2(b);
wait_for_all()
solve_on_proc1(c);
wait_for_all();
}
…challenges
Step 2: Run them side-by-side
Bandwidth and cache issues
Need good data layout to avoid thrashing cache
or bus
Processor issues
Need efficient processor management scheme
What if the job sizes are very different?
e.g. a suit of cloth and a separate neck tie
Need further refinement of large jobs, or you only
save on the small neck tie time
…challenges
Step 3: Recurse
When do you stop?
Overhead of launching smaller jobs
Synchronization when a stage is done
e.g. Gather results from all collision detection before
solving
But this can go down to the instruction level
e.g. Using Structure-of-Arrays, transform four
independent vectors at once
High Level Parallelization:
Duck Demo
Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails
Dependency exists
Fluid Simulation Fluid Surface
Rigid Bodies
Cloth Sails
Cloth Boat 1
Cloth Boat 2
Note that the parts didn’t take an
equal amount of time to run. We
could have done better given time!
But cloth was for
multiple boats
Broad Phase
Collision Detection
Narrow Phase
Collision Detection
Constraint Solving
Lower Level Parallelization
Rigid Body Simulation
600 bodies
example
Objects A Objects B Objects C
Broad Phase
Collision Detection
Narrow Phase
Collision Detection
Constraint Solving
Objects A Objects B
Objects CObjects B
Objects A
Objects C
Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1
Structure of Arrays
X0 Y0 Z0 W0
X1 Y1 Z1 W1
X2 Y2 Z2 W2
X3 Y3 Z3 W3
X4 Y4 Z4 W4
X5 Y5 Z5 W5
X6 Y6 Z6 W6
X7 Y7 Z7 W7
Data[0]
Data[1]
Data[2]
Data[3]
Data[4]
Data[5]
Data[6]
Data[7]
X0
Y0
Z0
W0
X1
Y1
Z1
W1
X2
Y2
Z2
W2
X3
Y3
Z3
W3
X4
Y4
Z4
W4
X5
Y5
Z5
W5
X6
Y6
Z6
W6
X7
Y7
Z7
W7
Data[0]
Data[1]
Data[2]
Data[3]
Data[4]
Data[5]
Data[6]
Data[7]
Array of Structures
or “AoS”
Structure of Arrays
or “SoA”
1 AoS Vector
1 SoA Vector
Bonus!
Since W is almost always 0 or 1, we can eliminate it with a
clever math library and save 25% memory and bandwidth!
Lowest Level Parallelization:
Structure-of-Array processing of Particles
Given:
pn(t)=position of particle n at time t
vn(t)=velocity of particle n at time t
p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2
p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2
…
Note they are independent of each other
So we can run four together using SoA
p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2
Failure Case
Gauss Seidel Solver
Consider a simple position-based solver that
uses distance constraints. Given:
p=current positions of all objects
solve(cn, p) takes p and constraint cn and computes a new p
that satisfies cn
p=solve(c0, p)
p=solve(c1, p)
…
Note that to solve c1, we need the result of c0.
Can’t solve c0 and c1 concurrently!
Failure Case
Possible Solutions
Generally, it’s you’re out of luck, but…
Some cases have very limited dependencies
e.g. particle-based cloth solving
Solution: Arrange constraints such that no four
adjacent constraints share cloth particles
Consider a different solver
e.g. Jacobi solvers don’t use updated values until all
constraints have been processed once
But they need more memory (pnew and pcurrent)
And may need more iterations to converge
Duck Demo (EyeToy + SPH)
Smoothed Particle Hydrodynamics
(SPH) Fluid Simulation
Smoothed-particles
Mass distributed around a point
Density falls to 0 at a radius h
Forces between particles closer than 2h
h
SPH Fluid Simulation
High-level parallelism
Put particles in grid cells
Process on different SPUs
(Not used in duck demo)
Low-level parallelism
SIMD and dual-issue on SPU
Large n per cell may be better
Less grid overhead
Loops fast on SPU
SPH Loop
Consider two sets of particles P and Q
E.g., taken from neighbor grid cells
O(n2) problem
Can unroll (e.g., by 4)
for (i = 0; i < numP; i++)
for (j = 0; j < numQ; j+=4)
Compute force (pi, qj)
Compute force (pi, qj+1)
Compute force (pi, qj+2)
Compute force (pi, qj+3)
SPH Loop, SoA
Idea:
Increase SIMD throughput with structure-of-arrays
Transpose and produce combinations
pi
qj
qj+1
x y z
x y z
x y z
x
y
z
x
y
z
x
y
z
x
y
z
SoA p
SoA q x
y
z
x
y
z
x
y
z
x
y
zx y z
x y zqj+2
qj+3
SPH Loop, Software Pipelined
Add software pipelining
Conversion instructions can dual-issue with math
Load[i]
To SoA[i]
Compute[i]
From SoA[i]
Store[i]
Compute[i]
Load[i+1]
Store [i-1]
To SoA[i+1]
From SoA[i-1]
Pipe 0 Pipe 1
Recap
Finding independence is hard!
Across subsystems or within subsystems?
Across iterations or within iterations?
Data level independence?
Instruction level independence?
How about “bandwidth level” independence?
Parallelization overhead
Sometimes running serially wins over overhead of
parallelization
Particle Simulation Demo
Questions?
http://www.research.scea.com/
Contacts:
Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com
Eric Larsen: eric_larsen@playstation.sony.com
Steven Osman: steven_osman@playstation.sony.com

Contenu connexe

Tendances

Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11
Acunu
 
2011.06.20 stratified-btree
2011.06.20 stratified-btree2011.06.20 stratified-btree
2011.06.20 stratified-btree
Acunu
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
Dynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics ProcessorsDynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics Processors
daced
 

Tendances (20)

Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Range reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernelRange reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernel
 
Pipeline
PipelinePipeline
Pipeline
 
Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11Stratified B-trees - HotStorage11
Stratified B-trees - HotStorage11
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
2011.06.20 stratified-btree
2011.06.20 stratified-btree2011.06.20 stratified-btree
2011.06.20 stratified-btree
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Dynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics ProcessorsDynamic Load-balancing On Graphics Processors
Dynamic Load-balancing On Graphics Processors
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 

Similaire à High-Performance Physics Solver Design for Next Generation Consoles

Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 

Similaire à High-Performance Physics Solver Design for Next Generation Consoles (20)

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...
Avoiding Chaos:  Methodology for Managing Performance in a Shared Storage A...Avoiding Chaos:  Methodology for Managing Performance in a Shared Storage A...
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Low-level Graphics APIs
Low-level Graphics APIsLow-level Graphics APIs
Low-level Graphics APIs
 
Operating System
Operating SystemOperating System
Operating System
 

Plus de Slide_N

Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
Slide_N
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
Slide_N
 

Plus de Slide_N (20)

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive Entertainment
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdf
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

High-Performance Physics Solver Design for Next Generation Consoles

  • 1. High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D
  • 2. This Talk Optimizing physics simulation on a multi-core architecture. Focus on CELL architecture Variety of simulation domains Cloth, Rigid Bodies, Fluids, Particles Practical advice based on real case-studies Demos!
  • 3. Basic Issues Looking for opportunities to parallelize processing High Level – Many independent solvers on multiple cores Low Level – One solver, one/multiple cores Coding with small memory in mind Streaming Batching up work Software Caching Speeding up processing within each unit SIMD processing, instruction scheduling Double-buffering Parallelizing/optimizing existing code
  • 4. What is not in this talk? Details on specific physics algorithms Too much material for a 1-hour talk Will provide references to techniques Much insight on non-CELL platforms Concentrate on actual results Concepts should be applicable beyond CELL
  • 5. The Cell Processor Model PPU SPE0 DMA 256K LS SPU0 SPE0 DMA 256K LS SPU1 SPE0 DMA 256K LS SPU2 SPE0 DMA 256K LS SPU3 DMA 256K LS SPU4 DMA 256K LS SPU5 DMA 256K LS SPU6 DMA 256K LS SPU7 Main Memory L1/L2
  • 6. Physics on CELL Physics should happen mostly on SPUs There’s more of them! SPUs have greater bandwidth & performance PPU is busy doing other stuff PPU SPE0 DMA 256K LS SPU0 DMA SPU1 DMA SPU2 DMA SPU3 DMA SPU4 DMA SPU5 DMA SPU6 DMA SPU7 Main Memory L1/L2 256K LS 256K LS 256K LS 256K LS 256K LS 256K LS 256K LS
  • 7. SPU Performance Recipe Large bandwidth to and from main memory Quick (1-cycle) LS memory access SIMD instruction set Concurrent DMA and processing Challenges: Limited LS size, shared between code and data Random accesses of main memory are slow
  • 9. Cloth Simulation Cloth mesh simulated as point masses (vertices) connected via distance constraints (edges). m1m1 m2m2 m3m3 d1d1 d2d2 d3d3Mesh TriangleMesh Triangle References: T.Jacobsen,Advanced Character Physics, GDC 2001 A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005
  • 10. Simulation Step 1. Compute external forces, fE,per vertex 2. Compute new vertex positions [ Integration ]: pt+1 = (2pt − pt−1) + 1 3. Fix edge lengths Adjust vertex positions 4. Correct penetrations with collision geometry Adjust vertex positions 2fE ∗ 1 m ∗ Δt2
  • 11. How many vertices? How many vertices fit in 256K (less actually)? A lot, surprisingly… Tips: Look for opportunities to stream data Keep in LS only data required for each step
  • 12. Integration Step Less than 4000 verts in 200K of memory We don’t need to keep them all in LS Keep vertex data in main memory and bring it in in blocks 16 + 16 + 16 + 4 = 52 bytes / vertex pt+1 = (2pt − pt−1) + 1 2fE ∗ 1 m ∗ Δt2
  • 13. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE B0 B1 B2 B3 1 m Main Memory Local Store
  • 14. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE B0 B1 B2 B3 1 m Local Store Main Memory DMA_IN B0 B0 B0 B0 B0
  • 15. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE B0 B1 B2 B3 1 m Main Memory Local Store DMA_IN B0 B0 B0 B0 B0 Process B0
  • 16. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m Main Memory Local Store DMA_IN B0 B0 B0 B0 B0
  • 17. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1
  • 18. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1 Process B1
  • 19. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1 Process B1 DMA_OUT B1
  • 20. Streaming Integration B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 pt pt−1 fE Process B0 DMA_OUT B0 B0 B1 B2 B3 1 m DMA_IN B1 Main Memory Local Store DMA_IN B0 B1 B1 B1 B1 Process B1 DMA_OUT B1 …
  • 21. Double-buffering Take advantage of concurrent DMA and processing to hide transfer times Without double-buffering: Process B0 DMA_OUT B0 DMA_IN B1 DMA_IN B0 Process B1 DMA_OUT B1 … With double-buffering: Process B0 DMA_IN B1 DMA_IN B0 … DMA_OUT B0 Process B1 DMA_IN B2 Process B2 DMA_OUT B1 DMA_IN B3
  • 22. Streaming Data Streaming is possible when the data access pattern is simple and predictable (e.g. linear) Number of verts processed per frame depends on processing speed and bandwidth but not LS size Unfortunately, not every step in the cloth solver can be fully streamed Fixing edge lengths requires random memory access…
  • 23. Fixing Edge Lengths Points coming out of the integration step don’t necessarily satisfy edge distance constraints struct Edge { int v1; int v2; float restLen; } p[v1] p[v2] p[v1] p[v2] Vector3 d = p[v2] – p[v1]; float len = sqrt(dot(d,d)); diff = (len-restLen)/len; p[v1] -= d * 0.5 * diff; p[v2] += d * 0.5 * diff;
  • 24. Fixing Edge Lengths An iterative process: Fix one edge at a time by adjusting 2 vertex positions Requires random access to particle positions array Solution: Keep all particle positions in LS Stream in edge data In 200K we can fit 200KB / 16B > 12K vertices
  • 25. Rigid Bodies Our group is currently porting the AGEIATM PhysXTM SDK to CELL Large codebase written with a PC architecture in mind Assumes easy random access to memory Processes tasks sequentially (no parallelism) Interesting example on how to port existing code to a multi-core architecture
  • 26. Starting the Port Determine all the stages of the rigid body pipeline Look for stages that are good candidates for parallelizing/optimizing Profile code to make sure we are focusing on the right parts
  • 27. Constraint Solve Constraint Solve Broadphase Collision Detection Rigid Body Pipeline Broadphase Collision Detection Narrowphase Collision Detection Narrowphase Collision Detection Constraint PrepConstraint Prep IntegrationIntegration Constraint PrepConstraint Prep Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations
  • 28. Rigid Body Pipeline Broadphase Collision Detection Broadphase Collision Detection Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations NP NP NP CP CP CP CS CS CS I I I
  • 30. Profiling Results Cumulative Frame Time 0 10000 20000 30000 40000 50000 60000 70000 1 57 113 169 225 281 337 393 449 505 561 617 673 729 785 841 897 953 1009 1065 1121 1177 1233 1289 1345 1401 1457 1513 1569 1625 1681 1737 1793 1849 1905 1961 Other INTEGRATION SOLVER CONSTRAINT_PREP NARROWPHASE BROADPHASE
  • 31. Running on the SPUs Three steps: 1. (PPU) Pre-process “Gather” operation (extract data from PhysX data structures and pack it in MM) 2. (SPU) Execute DMA packed data from MM to LS Process data and store output in LS DMA output to MM 3. (PPU) Post-process “Scatter” operation (unpack output data and put back in PhysX data structures)
  • 32. Why Involve the PPU? Required PhysX data is not conveniently packed Data is often not aligned We need to use PhysX data structures to avoid breaking features we haven’t ported Solutions: Use list DMAs to bring in data Modify existing code to force alignment Change PhysX code to work with new data structures
  • 33. Batching Up Work Task Description Task Description batch inputs/ outputs batch inputs/ outputs Work batch buffers in MM Work batch buffers in MM PhysX data-structures PhysX data-structures PPU Pre-Process PPU Pre-Process Task Description Task Description batch inputs/ outputs batch inputs/ outputs …… SPU Execute SPU Execute PPU Post-Process PPU Post-Process PhysX data-structures PhysX data-structures Create work batches for each task
  • 34. A Narrow-phase Collision Detection Problem: A list of object pairs that may be colliding Want to do contact processing on SPUs Pairs list has references to geometry C B (A,C) (A,B) (B,C) …
  • 35. Narrow-phase Collision Detection Data locality Same bodies may be in several pairs Geometry may be instanced for different bodies SPU memory access Can only access main memory with DMA No hardware cache Data reuse must be explicit
  • 36. Software Cache Idea: make a (read-only) software cache Cache entry is one geometric object Entries have variable size Basic operation SPU checks cache for object If not in cache, object fetched with DMA Cache returns a local address for object
  • 37. Software Cache Data Structures Two entry buffers New entries appended to “current” buffer Hash-table used to record and find loaded entries A B C Buffer 0 Next DMA Buffer 1
  • 38. Software Cache Data Replacement When space runs out in a buffer Overwrite data in second buffer Considerations Does not fragment memory No searches for free space But does not prefer frequently used data
  • 39. Software Cache Hiding the DMA latency Double-buffering Start DMA for un-cached entries Process previously DMA’d entries Process/pre-fetch batches Fetch and compute times vary Batching may improve balance DMA-lists useful One DMA command Multiple chunks of data gathered Current Buffer D F E Process DMA A B C
  • 40. Software Caching Conclusions Simple cache is practical Used for small convex objects in PhysX Design considerations Tradeoff of cache-logic cycles vs. bandwidth saved Pre-fetching important to include
  • 41. Single SPU Performance PPU only:PPU only: ExecExec PPU + SPU:PPU + SPU: PPUPPU PPUPPU SPUSPU ExecExec FreeFree SPU Exec < PPU Exec: SIMD + fast mem accessSPU Exec < PPU Exec: SIMD + fast mem access PreP PostP
  • 42. Multiple SPU Performance Pre- and Post- processing times determine how many SPUs can be used effectively
  • 43. Multiple SPU Performance 11 22 33 44PPUPPU 1 SPU1 SPU 11 22 33 44SS 3 SPUs3 SPUs 11 22 4433 11 22 33 44 2 SPUs2 SPUs
  • 44. PPU vs SPU comparisons Convex Stack (500 boxes) 0 10000 20000 30000 40000 50000 60000 70000 80000 1 35 69 103 137 171 205 239 273 307 341 375 409 443 477 511 545 579 613 647 681 715 749 783 817 851 885 919 953 987 1021 1055 1089 1123 1157 1191 1225 1259 1293 frame microseconds PPU-only 1-SPU 2-SPUs 3-SPUs 4-SPUs
  • 45. Duck Demo One of our first CELL demos (spring 2005) Several interacting physics systems: Rigid bodies (ducks & boats) Height-field water surface Cloth with ripping (sails) Particle based fluids (splashes + cups)
  • 46. Duck Demo (Lots of Ducks)
  • 47. Duck Demo Ambitious project with short deadline Early PC prototypes of some pieces Most straightforward way to parallelize: Dedicate one SPU for each subsystem Each piece could be developed and tested individually
  • 48. Duck Demo Resource Allocation PU – main loop SPU thread synchronization, draw calls SPU0 – height field water (<50%) SPU1 – splashes iso-surface (<50%) SPU2 – cloth sails for boat 1 (<50%) SPU3 – cloth sails for boat 2 (<50%) SPU4 – rigid body collision/response (95%) HF water Iso-Surface Cloth Cloth Rigid Bodies 1 frame
  • 49. Parallelization Recipe One three-step approach to code parallelization: 1. Find independent components 2. Run them side-by-side 3. Recursively apply recipe to components
  • 50. Challenges Step 1: Find independent components Where do you look? Maybe you need to break apart and overlap your data? e.g. Broad phase collision detection Maybe you need to break apart your loop into individual iterations? e.g. Solving cloth constraints
  • 51. Broad Phase Collision Detection 600 Objects 200 Objects A 200 Objects B 200 Objects C 200 Objects A 200 Objects Bvs 200 Objects A 200 Objects Cvs 200 Objects C200 Objects B vs We can execute all three of these simultaneously Need to test 600 rigid bodies against each other.
  • 52. Cloth Solving for (i=1 to 5) { cloth=solve(cloth) } A B C for (i=1 to 5) { solve_on_proc1(a); solve_on_proc2(b); wait_for_all() solve_on_proc1(c); wait_for_all(); }
  • 53. …challenges Step 2: Run them side-by-side Bandwidth and cache issues Need good data layout to avoid thrashing cache or bus Processor issues Need efficient processor management scheme What if the job sizes are very different? e.g. a suit of cloth and a separate neck tie Need further refinement of large jobs, or you only save on the small neck tie time
  • 54. …challenges Step 3: Recurse When do you stop? Overhead of launching smaller jobs Synchronization when a stage is done e.g. Gather results from all collision detection before solving But this can go down to the instruction level e.g. Using Structure-of-Arrays, transform four independent vectors at once
  • 55. High Level Parallelization: Duck Demo Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails Dependency exists Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails Cloth Boat 1 Cloth Boat 2 Note that the parts didn’t take an equal amount of time to run. We could have done better given time! But cloth was for multiple boats
  • 56. Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving Lower Level Parallelization Rigid Body Simulation 600 bodies example Objects A Objects B Objects C Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving Objects A Objects B Objects CObjects B Objects A Objects C Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1 Proc 3Proc 2Proc 1
  • 57. Structure of Arrays X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7] X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7] Array of Structures or “AoS” Structure of Arrays or “SoA” 1 AoS Vector 1 SoA Vector Bonus! Since W is almost always 0 or 1, we can eliminate it with a clever math library and save 25% memory and bandwidth!
  • 58. Lowest Level Parallelization: Structure-of-Array processing of Particles Given: pn(t)=position of particle n at time t vn(t)=velocity of particle n at time t p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2 p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2 … Note they are independent of each other So we can run four together using SoA p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2
  • 59. Failure Case Gauss Seidel Solver Consider a simple position-based solver that uses distance constraints. Given: p=current positions of all objects solve(cn, p) takes p and constraint cn and computes a new p that satisfies cn p=solve(c0, p) p=solve(c1, p) … Note that to solve c1, we need the result of c0. Can’t solve c0 and c1 concurrently!
  • 60. Failure Case Possible Solutions Generally, it’s you’re out of luck, but… Some cases have very limited dependencies e.g. particle-based cloth solving Solution: Arrange constraints such that no four adjacent constraints share cloth particles Consider a different solver e.g. Jacobi solvers don’t use updated values until all constraints have been processed once But they need more memory (pnew and pcurrent) And may need more iterations to converge
  • 62. Smoothed Particle Hydrodynamics (SPH) Fluid Simulation Smoothed-particles Mass distributed around a point Density falls to 0 at a radius h Forces between particles closer than 2h h
  • 63. SPH Fluid Simulation High-level parallelism Put particles in grid cells Process on different SPUs (Not used in duck demo) Low-level parallelism SIMD and dual-issue on SPU Large n per cell may be better Less grid overhead Loops fast on SPU
  • 64. SPH Loop Consider two sets of particles P and Q E.g., taken from neighbor grid cells O(n2) problem Can unroll (e.g., by 4) for (i = 0; i < numP; i++) for (j = 0; j < numQ; j+=4) Compute force (pi, qj) Compute force (pi, qj+1) Compute force (pi, qj+2) Compute force (pi, qj+3)
  • 65. SPH Loop, SoA Idea: Increase SIMD throughput with structure-of-arrays Transpose and produce combinations pi qj qj+1 x y z x y z x y z x y z x y z x y z x y z SoA p SoA q x y z x y z x y z x y zx y z x y zqj+2 qj+3
  • 66. SPH Loop, Software Pipelined Add software pipelining Conversion instructions can dual-issue with math Load[i] To SoA[i] Compute[i] From SoA[i] Store[i] Compute[i] Load[i+1] Store [i-1] To SoA[i+1] From SoA[i-1] Pipe 0 Pipe 1
  • 67. Recap Finding independence is hard! Across subsystems or within subsystems? Across iterations or within iterations? Data level independence? Instruction level independence? How about “bandwidth level” independence? Parallelization overhead Sometimes running serially wins over overhead of parallelization
  • 69. Questions? http://www.research.scea.com/ Contacts: Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com Eric Larsen: eric_larsen@playstation.sony.com Steven Osman: steven_osman@playstation.sony.com