Grass, Fur and all Things Hairy - AMD at GDC14

Grass, Fur and all things hairy
Nicolas Thibieroz Karl Hillesland
Gaming Engineering Manager, AMD Senior Research Engineer, AMD

Next-gen Grass, Fur and Hair
● The time for next-gen quality is now
● Tomb Raider pioneered next-gen hair
● Even on PS4/XB1
● Users expect this level of quality for next-
gen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance

TressFX applied to Grass, Fur and Hair
● Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:
● Compute simulations
● Anti-aliasing
● Transparency
● Volumetric self-shadowing
● A good lighting model

Forward Rendering Pipeline – a refresher
● Consists of three steps:
● Hair simulation
● Shade and store fragments into buffers
● Fetch shaded fragments, sort and render

// Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;
● Head UAV
● Each pixel location has a “head pointer” to a linked list in
the PPLL UAV
● PPLL UAV
● As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
● A link is created to the fragment pointed to by the head
pointer
● Head pointer then points to the new fragment
Per-Pixel Linked Lists
Head UAV
PPLL UAV

CSCSCS
Input Geometry
Post-simulation
geometry (UAV)
Hair Simulation
Simulation
parameters
Model
space
World
space

Shade and Store fragments into Buffers
Coverage
depth
color
coverage
next
Lighting
VS PS
Homogeneous
clip space
World
space
Null RT
Stencil
PPLL
UAV
Head
UAV
Shadows
Extrusion from
line segments
to non-indexed
triangles

Full Screen Quad
Fetch shaded fragments, sort and render
VS PS
Stencil
Head
UAV
PPLL
UAV
Render target
Fragment sorting and
manual blending

Forward Rendering Performance
● Main cost in forward rendering mode is in the
shading part
● All fragments are lit and shadowed before being stored
● PPLL storing is typically not the bottleneck!
● Don’t need maximum quality on all fragments
● “tail” fragments need only “good enough” quality
● Solution: Use shader LOD

Forward vs Deferred Rendering Pipeline
Deferred rendering pipeline
● Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render
● Full shading on K-frontmost
fragments
● “Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm
Forward rendering pipeline
● Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render

CSCSCS
Input Geometry
Post-simulation
geometry (UAV)
Deferred Rendering Pipeline
Hair Simulation – unchanged!
Simulation
parameters
Model
space
World
space

Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers
Coverage
depth
tangent
coverage
next
VS PS
Homogeneous
clip space
World
space
Null RT
Stencil
PPLL
UAV
Head
UAV
Index
Buffer
Indexed triangle list

Deferred Rendering Pipeline
Fetch fragments, sort, shade and render
VS PS
Stencil
Head
UAV
PPLL
UAV
Render target
K frontmost fragment:
full shading, sorting
and manual blending
Lighting Shadows
Full Screen Quad
Tail fragments:
cheap chading,
no sorting and
manual blending

Deferred Rendering Shading LOD Optimization
● Deferred approach allows a reduction in shading cost “Shader LOD”
● Only sort and shade K frontmost fragments at high quality
● “Simple” shading and out-of-order rendering on tail fragments
● Single-tap shadowing on tail fragments
● Very little quality difference compared to full shading
● But much better performance!
Technique Cost
Out of order, no shading 1.31 ms
Out of order, shading 2.80 ms
Forward PPLL, shading 3.38 ms
Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p
Shading cost
is ~ 1.5 ms
PPLL cost
is ~ 0.58 ms
Fast!

Full quality shading forced on
for all fragments
Shading LOD

● A great portion of time was spent in the GPU front-end
● 920,000 line segments for fur model
● Expansion from line segments to triangles was done in GS and then VS with Draw()
● Each segment would create a quad (two triangles) with 6 vertices
Geometry Optimizations
DrawIndexed() method
Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };
1
Line segments Expanded quads
0
1
2
3 2
4
0
5
1,4
Draw() method
Line segments Expanded quads
0
1
2
3,5
6
2,3
7,10
8,9
0
11
Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };
● Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!

● Input line segments have a random order
● Just render fewer (but thicker) fragments when far away!
● Needs shading adjustments to ensure smooth quality transitions
● Increase alpha threshold for fragment inclusion when far away
Distance-based LOD system Optimization

● PPLL Head UAV uses a RWTexture2D instead of a Buffer
● Results in more efficient caching for UAV accesses
● Avoid GPR indexing for sorting
● Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
● Used an ALU-based indexing approach to improve performance
● TO DO: compute shader simulation optimizations
● Currently a set of multiple compute shaders
● Looking at combining some of these, optimizing shaders and output formats
Other Optimizations

Per-Pixel Linked Lists UAV Memory Considerations
● How much memory is needed?
● Guesstimate for a given usage model
● Max (hair pixels x average overdraw) fragments
● What happens when I run out?
● Missing fragments
● What can be done about it?

PP Linked-List (PPLL) k-Buffer
fixed size array
Node Pool
All fragments
How big?
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
Simple Memory Bound

The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
● Full quality shade on front k
● Cheap shade on rest
20 frags/pixel (ave)
Red = over 100
k is 4, 8, 16

The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
● Full quality shade on front k
● Cheap shade on rest
k-Buffer
Tail
Can’t know front k
until all fragments processed

k-Buffer
For Each Fragment in Each Pixel
Index of
furthest
New
Fragment
Blend
Tail ColorTail
Fragment

If New Fragment in k
Index of
furthest
k-Buffer
Blend
Tail Color
If in k
1. Swap with furthest
2. Find new furthest
3. Blend with tail
Tail
Fragment
New
Fragment

If not in k
Index of
furthest
k-Buffer
Blend
Tail Color
If not in k
1. Blend with tail
Tail
Fragment
New
Fragment

From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
blend tail fragment (reg)
Read k-buffer from mem
Sort and blend k-buffer (reg)
update k-buffer (mem)
blend tail fragment (mem)

k-Buffer
Screen Width
ScreenHeight
k
8 bytes each
(depth and data)
PPLL nodes were 12 bytes
(depth, data, next)
K=4, 8, 16

PPLL: 2nd Pass
New
Fragment
Index of
furthest
Blend
Tail ColorTail
Fragment
k-Buffer
Registers

k-Buffer in Memory: 1st Pass
New
Fragment
Index of
furthest
Blend
Tail ColorTail
FragmentMutex, index,
…
Blend
Unit
k-Buffer
Memory

Mutex/Count/Index Buffer
Screen Width
ScreenHeight
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit
32 bits

Spinlock Mutex
[allow_uav_condition]
for(; i<MAX_LOOP_COUNT && !bStop; ++i)
{
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work ]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop
Paranoia
Try
Release
Do Work

Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t<KBUFFER_SIZE; t++)
{
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}
Generally more
memory traffic
than PPLL

Initialization: The first k
Options
● Clear k-buffer fullscreen (0,1)
● Clear k-buffer stenciled, 3rd pass
● Clear on first fragment
● Count
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit

The first k
InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit

Models
2k polygons
~20k hairs~130k hairs
Stats
2-3.5 M fragments
200-300k pixels
Shading
One point light & shadow
2 shifted specular lobes

Depth Complexity
Grey 1
Blue 8
Green 50
Red 100+

Contention
Max attempts per pixel, k=4
Dark Blue 1
Aqua <=4
Bright Aqua <=8

Performance
Time ratio to out-of-order blending
● Forward PPLL: 1.02 to 1.4
● Forward k-Buffer: 1.2 to 1.4
● Deferred PPLL: 0.7 to 0.9
● Deferred k-Buffer: 0.9 to 1.6

K-Buffer in Memory
● Simple memory bound
● Can be less memory
● Usually slower
● Increased memory traffic

Hair Simulation
● Length Constraint
● Local Constraint
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)

Fur Simulation
● Local Constraint
● Model Transform

Grass Simulation
● Local Constraint (1D)
● Model Transform

Constraint Method (iterative)
● Used for length, local and global constraints
● Length is most difficult to converge
● particularly under large movement
C0
C1
Cn-2
p0
p2
Pn-2
Pn-1

Tridiagonal Matrix Formulation
● Direct solve for length constraint
● Almost zero stretch
● Limited to smaller time steps (stability)
● Still cheap
● Leverages matrix structure of strands
● Two sweeps of strand

Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for
Inextensible Hair Strand Simulation”,
VRIPHYS, 2013

Summary
● Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute
Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphics-
development/amd-radeon-sdk/

Isoline Tessellation for hair/fur? 1/2
● Isoline tessellation has two tess factors
● First is line density (lines per invocation)
● Second is line detail (segments per line)
● In theory provides easy LOD system
● Variable line density and detail by increasing both tessellation factors
based on distance
Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)

Isoline Tessellation for hair/fur? 2/2
● In practice isoline tessellation is not cost effective for this scenario
● Lines are always 1-pixel thick
● Need GS to extrude them into triangles for smooth edges
● Major impact on performance!
● Alternative is to enable MSAA
● Most engines are deferred so this causes a large performance impact
● No extrusion for smoothing edges and no MSAA = poor quality!
● Bottom line: a pure Vertex Shader solution is faster
● LOD benefit is easily done in VS (more on this later)
● Curvature is rarely a problem (dependant on vertices/strands at authoring time)

AA, Self-shadowing and Transparency
Basic
Rendering
Antialiasing Antialiasing
+ Self
Shadowing
Antialiasing
+ Self
Shadowing
+ Transparency

Grass, Fur and all Things Hairy - AMD at GDC14

Recommended

Recommended

More Related Content

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

Grass, Fur and all Things Hairy - AMD at GDC14

Editor's Notes