Learn about new developments in simulating and rendering grass, fur and hair. We’ll show thousands of blades of grass or strands of fur being simulated in real-time, as well as our latest findings in Order-Independent Transparency in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21.
1. Grass, Fur and all things hairy
Nicolas Thibieroz Karl Hillesland
Gaming Engineering Manager, AMD Senior Research Engineer, AMD
2. Next-gen Grass, Fur and Hair
● The time for next-gen quality is now
● Tomb Raider pioneered next-gen hair
● Even on PS4/XB1
● Users expect this level of quality for next-
gen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance
3. TressFX applied to Grass, Fur and Hair
● Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:
● Compute simulations
● Anti-aliasing
● Transparency
● Volumetric self-shadowing
● A good lighting model
4. Forward Rendering Pipeline – a refresher
● Consists of three steps:
● Hair simulation
● Shade and store fragments into buffers
● Fetch shaded fragments, sort and render
5. // Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;
● Head UAV
● Each pixel location has a “head pointer” to a linked list in
the PPLL UAV
● PPLL UAV
● As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
● A link is created to the fragment pointed to by the head
pointer
● Head pointer then points to the new fragment
Per-Pixel Linked Lists
Head UAV
PPLL UAV
7. Forward Rendering Pipeline – a refresher
Shade and Store fragments into Buffers
Coverage
depth
color
coverage
next
Lighting
VS PS
Homogeneous
clip space
World
space
Null RT
Stencil
PPLL
UAV
Head
UAV
Shadows
Extrusion from
line segments
to non-indexed
triangles
8. Full Screen Quad
Forward Rendering Pipeline – a refresher
Fetch shaded fragments, sort and render
VS PS
Stencil
Head
UAV
PPLL
UAV
Render target
Fragment sorting and
manual blending
9. Forward Rendering Performance
● Main cost in forward rendering mode is in the
shading part
● All fragments are lit and shadowed before being stored
● PPLL storing is typically not the bottleneck!
● Don’t need maximum quality on all fragments
● “tail” fragments need only “good enough” quality
● Solution: Use shader LOD
10. Forward vs Deferred Rendering Pipeline
Deferred rendering pipeline
● Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render
● Full shading on K-frontmost
fragments
● “Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm
Forward rendering pipeline
● Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render
12. Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers
Coverage
depth
tangent
coverage
next
VS PS
Homogeneous
clip space
World
space
Null RT
Stencil
PPLL
UAV
Head
UAV
Index
Buffer
Indexed triangle list
13. Deferred Rendering Pipeline
Fetch fragments, sort, shade and render
VS PS
Stencil
Head
UAV
PPLL
UAV
Render target
K frontmost fragment:
full shading, sorting
and manual blending
Lighting Shadows
Full Screen Quad
Tail fragments:
cheap chading,
no sorting and
manual blending
14. Deferred Rendering Shading LOD Optimization
● Deferred approach allows a reduction in shading cost “Shader LOD”
● Only sort and shade K frontmost fragments at high quality
● “Simple” shading and out-of-order rendering on tail fragments
● Single-tap shadowing on tail fragments
● Very little quality difference compared to full shading
● But much better performance!
Technique Cost
Out of order, no shading 1.31 ms
Out of order, shading 2.80 ms
Forward PPLL, shading 3.38 ms
Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p
Shading cost
is ~ 1.5 ms
PPLL cost
is ~ 0.58 ms
Fast!
16. ● A great portion of time was spent in the GPU front-end
● 920,000 line segments for fur model
● Expansion from line segments to triangles was done in GS and then VS with Draw()
● Each segment would create a quad (two triangles) with 6 vertices
Geometry Optimizations
DrawIndexed() method
Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };
1
Line segments Expanded quads
0
1
2
3 2
4
0
5
1,4
Draw() method
Line segments Expanded quads
0
1
2
3,5
6
2,3
7,10
8,9
0
11
Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };
● Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
17. ● Input line segments have a random order
● Just render fewer (but thicker) fragments when far away!
● Needs shading adjustments to ensure smooth quality transitions
● Increase alpha threshold for fragment inclusion when far away
Distance-based LOD system Optimization
18. ● PPLL Head UAV uses a RWTexture2D instead of a Buffer
● Results in more efficient caching for UAV accesses
● Avoid GPR indexing for sorting
● Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
● Used an ALU-based indexing approach to improve performance
● TO DO: compute shader simulation optimizations
● Currently a set of multiple compute shaders
● Looking at combining some of these, optimizing shaders and output formats
Other Optimizations
19. Per-Pixel Linked Lists UAV Memory Considerations
● How much memory is needed?
● Guesstimate for a given usage model
● Max (hair pixels x average overdraw) fragments
● What happens when I run out?
● Missing fragments
● What can be done about it?
21. PP Linked-List (PPLL) k-Buffer
fixed size array
Node Pool
All fragments
How big?
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
Simple Memory Bound
22. The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
● Full quality shade on front k
● Cheap shade on rest
20 frags/pixel (ave)
Red = over 100
k is 4, 8, 16
23. The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
● Full quality shade on front k
● Cheap shade on rest
k-Buffer
Tail
Can’t know front k
until all fragments processed
25. If New Fragment in k
Index of
furthest
k-Buffer
Blend
Tail Color
If in k
1. Swap with furthest
2. Find new furthest
3. Blend with tail
Tail
Fragment
New
Fragment
26. If not in k
Index of
furthest
k-Buffer
Blend
Tail Color
If not in k
1. Blend with tail
Tail
Fragment
New
Fragment
27. From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
blend tail fragment (reg)
Read k-buffer from mem
Sort and blend k-buffer (reg)
update k-buffer (mem)
blend tail fragment (mem)
32. Spinlock Mutex
[allow_uav_condition]
for(; i<MAX_LOOP_COUNT && !bStop; ++i)
{
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work ]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop
Paranoia
Try
Release
Do Work
33. Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t<KBUFFER_SIZE; t++)
{
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}
Generally more
memory traffic
than PPLL
34. Initialization: The first k
Options
● Clear k-buffer fullscreen (0,1)
● Clear k-buffer stenciled, 3rd pass
● Clear on first fragment
● Count
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit
35. The first k
InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit
36. Models
2k polygons
~20k hairs~130k hairs
Stats
2-3.5 M fragments
200-300k pixels
Shading
One point light & shadow
2 shifted specular lobes
39. Performance
Time ratio to out-of-order blending
● Forward PPLL: 1.02 to 1.4
● Forward k-Buffer: 1.2 to 1.4
● Deferred PPLL: 0.7 to 0.9
● Deferred k-Buffer: 0.9 to 1.6
40. K-Buffer in Memory
● Simple memory bound
● Can be less memory
● Usually slower
● Increased memory traffic
42. Hair Simulation
● Length Constraint
● Local Constraint
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)
43. Fur Simulation
● Length Constraint
● Local Constraint
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)
44. Grass Simulation
● Length Constraint
● Local Constraint (1D)
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)
45. Constraint Method (iterative)
● Used for length, local and global constraints
● Length is most difficult to converge
● particularly under large movement
C0
C1
Cn-2
p0
p2
Pn-2
Pn-1
46. Tridiagonal Matrix Formulation
● Direct solve for length constraint
● Almost zero stretch
● Limited to smaller time steps (stability)
● Still cheap
● Leverages matrix structure of strands
● Two sweeps of strand
49. Summary
● Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute
Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphics-
development/amd-radeon-sdk/
52. Isoline Tessellation for hair/fur? 1/2
● Isoline tessellation has two tess factors
● First is line density (lines per invocation)
● Second is line detail (segments per line)
● In theory provides easy LOD system
● Variable line density and detail by increasing both tessellation factors
based on distance
Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)
53. Isoline Tessellation for hair/fur? 2/2
● In practice isoline tessellation is not cost effective for this scenario
● Lines are always 1-pixel thick
● Need GS to extrude them into triangles for smooth edges
● Major impact on performance!
● Alternative is to enable MSAA
● Most engines are deferred so this causes a large performance impact
● No extrusion for smoothing edges and no MSAA = poor quality!
● Bottom line: a pure Vertex Shader solution is faster
● LOD benefit is easily done in VS (more on this later)
● Curvature is rarely a problem (dependant on vertices/strands at authoring time)