SlideShare a Scribd company logo
1 of 54
Grass, Fur and all things hairy
Nicolas Thibieroz Karl Hillesland
Gaming Engineering Manager, AMD Senior Research Engineer, AMD
Next-gen Grass, Fur and Hair
● The time for next-gen quality is now
● Tomb Raider pioneered next-gen hair
● Even on PS4/XB1
● Users expect this level of quality for next-
gen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance
TressFX applied to Grass, Fur and Hair
● Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:
● Compute simulations
● Anti-aliasing
● Transparency
● Volumetric self-shadowing
● A good lighting model
Forward Rendering Pipeline – a refresher
● Consists of three steps:
● Hair simulation
● Shade and store fragments into buffers
● Fetch shaded fragments, sort and render
// Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;
● Head UAV
● Each pixel location has a “head pointer” to a linked list in
the PPLL UAV
● PPLL UAV
● As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
● A link is created to the fragment pointed to by the head
pointer
● Head pointer then points to the new fragment
Per-Pixel Linked Lists
Head UAV
PPLL UAV
CSCSCS
Input Geometry
Post-simulation
geometry (UAV)
Forward Rendering Pipeline – a refresher
Hair Simulation
Simulation
parameters
Model
space
World
space
Forward Rendering Pipeline – a refresher
Shade and Store fragments into Buffers
Coverage
depth
color
coverage
next
Lighting
VS PS
Homogeneous
clip space
World
space
Null RT
Stencil
PPLL
UAV
Head
UAV
Shadows
Extrusion from
line segments
to non-indexed
triangles
Full Screen Quad
Forward Rendering Pipeline – a refresher
Fetch shaded fragments, sort and render
VS PS
Stencil
Head
UAV
PPLL
UAV
Render target
Fragment sorting and
manual blending
Forward Rendering Performance
● Main cost in forward rendering mode is in the
shading part
● All fragments are lit and shadowed before being stored
● PPLL storing is typically not the bottleneck!
● Don’t need maximum quality on all fragments
● “tail” fragments need only “good enough” quality
● Solution: Use shader LOD
Forward vs Deferred Rendering Pipeline
Deferred rendering pipeline
● Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render
● Full shading on K-frontmost
fragments
● “Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm
Forward rendering pipeline
● Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render
CSCSCS
Input Geometry
Post-simulation
geometry (UAV)
Deferred Rendering Pipeline
Hair Simulation – unchanged!
Simulation
parameters
Model
space
World
space
Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers
Coverage
depth
tangent
coverage
next
VS PS
Homogeneous
clip space
World
space
Null RT
Stencil
PPLL
UAV
Head
UAV
Index
Buffer
Indexed triangle list
Deferred Rendering Pipeline
Fetch fragments, sort, shade and render
VS PS
Stencil
Head
UAV
PPLL
UAV
Render target
K frontmost fragment:
full shading, sorting
and manual blending
Lighting Shadows
Full Screen Quad
Tail fragments:
cheap chading,
no sorting and
manual blending
Deferred Rendering Shading LOD Optimization
● Deferred approach allows a reduction in shading cost “Shader LOD”
● Only sort and shade K frontmost fragments at high quality
● “Simple” shading and out-of-order rendering on tail fragments
● Single-tap shadowing on tail fragments
● Very little quality difference compared to full shading
● But much better performance!
Technique Cost
Out of order, no shading 1.31 ms
Out of order, shading 2.80 ms
Forward PPLL, shading 3.38 ms
Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p
Shading cost
is ~ 1.5 ms
PPLL cost
is ~ 0.58 ms
Fast!
Full quality shading forced on
for all fragments
Shading LOD
● A great portion of time was spent in the GPU front-end
● 920,000 line segments for fur model
● Expansion from line segments to triangles was done in GS and then VS with Draw()
● Each segment would create a quad (two triangles) with 6 vertices
Geometry Optimizations
DrawIndexed() method
Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };
1
Line segments Expanded quads
0
1
2
3 2
4
0
5
1,4
Draw() method
Line segments Expanded quads
0
1
2
3,5
6
2,3
7,10
8,9
0
11
Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };
● Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
● Input line segments have a random order
● Just render fewer (but thicker) fragments when far away!
● Needs shading adjustments to ensure smooth quality transitions
● Increase alpha threshold for fragment inclusion when far away
Distance-based LOD system Optimization
● PPLL Head UAV uses a RWTexture2D instead of a Buffer
● Results in more efficient caching for UAV accesses
● Avoid GPR indexing for sorting
● Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
● Used an ALU-based indexing approach to improve performance
● TO DO: compute shader simulation optimizations
● Currently a set of multiple compute shaders
● Looking at combining some of these, optimizing shaders and output formats
Other Optimizations
Per-Pixel Linked Lists UAV Memory Considerations
● How much memory is needed?
● Guesstimate for a given usage model
● Max (hair pixels x average overdraw) fragments
● What happens when I run out?
● Missing fragments
● What can be done about it?
k-Buffer in Memory
PP Linked-List (PPLL) k-Buffer
fixed size array
Node Pool
All fragments
How big?
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
Simple Memory Bound
The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
● Full quality shade on front k
● Cheap shade on rest
20 frags/pixel (ave)
Red = over 100
k is 4, 8, 16
The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
● Full quality shade on front k
● Cheap shade on rest
k-Buffer
Tail
Can’t know front k
until all fragments processed
k-Buffer
For Each Fragment in Each Pixel
Index of
furthest
New
Fragment
Blend
Tail ColorTail
Fragment
If New Fragment in k
Index of
furthest
k-Buffer
Blend
Tail Color
If in k
1. Swap with furthest
2. Find new furthest
3. Blend with tail
Tail
Fragment
New
Fragment
If not in k
Index of
furthest
k-Buffer
Blend
Tail Color
If not in k
1. Blend with tail
Tail
Fragment
New
Fragment
From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
blend tail fragment (reg)
Read k-buffer from mem
Sort and blend k-buffer (reg)
update k-buffer (mem)
blend tail fragment (mem)
k-Buffer
Screen Width
ScreenHeight
k
8 bytes each
(depth and data)
PPLL nodes were 12 bytes
(depth, data, next)
K=4, 8, 16
PPLL: 2nd Pass
New
Fragment
Index of
furthest
Blend
Tail ColorTail
Fragment
k-Buffer
Registers
k-Buffer in Memory: 1st Pass
New
Fragment
Index of
furthest
Blend
Tail ColorTail
FragmentMutex, index,
…
Blend
Unit
k-Buffer
Memory
Mutex/Count/Index Buffer
Screen Width
ScreenHeight
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit
32 bits
Spinlock Mutex
[allow_uav_condition]
for(; i<MAX_LOOP_COUNT && !bStop; ++i)
{
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work ]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop
Paranoia
Try
Release
Do Work
Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t<KBUFFER_SIZE; t++)
{
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}
Generally more
memory traffic
than PPLL
Initialization: The first k
Options
● Clear k-buffer fullscreen (0,1)
● Clear k-buffer stenciled, 3rd pass
● Clear on first fragment
● Count
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit
The first k
InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}
Mutex Bit
Initialized Bit
Max Index
(4 bits)
Count
(remainder)
High bit
Models
2k polygons
~20k hairs~130k hairs
Stats
2-3.5 M fragments
200-300k pixels
Shading
One point light & shadow
2 shifted specular lobes
Depth Complexity
Grey 1
Blue 8
Green 50
Red 100+
Contention
Max attempts per pixel, k=4
Dark Blue 1
Aqua <=4
Bright Aqua <=8
Performance
Time ratio to out-of-order blending
● Forward PPLL: 1.02 to 1.4
● Forward k-Buffer: 1.2 to 1.4
● Deferred PPLL: 0.7 to 0.9
● Deferred k-Buffer: 0.9 to 1.6
K-Buffer in Memory
● Simple memory bound
● Can be less memory
● Usually slower
● Increased memory traffic
Simulation
Hair Simulation
● Length Constraint
● Local Constraint
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)
Fur Simulation
● Length Constraint
● Local Constraint
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)
Grass Simulation
● Length Constraint
● Local Constraint (1D)
● Global Constraint
● Model Transform
● Collision Shapes
● External Forces (wind, gravity, etc.)
Constraint Method (iterative)
● Used for length, local and global constraints
● Length is most difficult to converge
● particularly under large movement
C0
C1
Cn-2
p0
p2
Pn-2
Pn-1
Tridiagonal Matrix Formulation
● Direct solve for length constraint
● Almost zero stretch
● Limited to smaller time steps (stability)
● Still cheap
● Leverages matrix structure of strands
● Two sweeps of strand
Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for
Inextensible Hair Strand Simulation”,
VRIPHYS, 2013
Demos
Summary
● Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute
Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphics-
development/amd-radeon-sdk/
Questions?
Extras
Isoline Tessellation for hair/fur? 1/2
● Isoline tessellation has two tess factors
● First is line density (lines per invocation)
● Second is line detail (segments per line)
● In theory provides easy LOD system
● Variable line density and detail by increasing both tessellation factors
based on distance
Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)
Isoline Tessellation for hair/fur? 2/2
● In practice isoline tessellation is not cost effective for this scenario
● Lines are always 1-pixel thick
● Need GS to extrude them into triangles for smooth edges
● Major impact on performance!
● Alternative is to enable MSAA
● Most engines are deferred so this causes a large performance impact
● No extrusion for smoothing edges and no MSAA = poor quality!
● Bottom line: a pure Vertex Shader solution is faster
● LOD benefit is easily done in VS (more on this later)
● Curvature is rarely a problem (dependant on vertices/strands at authoring time)
AA, Self-shadowing and Transparency
Basic
Rendering
Antialiasing Antialiasing
+ Self
Shadowing
Antialiasing
+ Self
Shadowing
+ Transparency

More Related Content

More from AMD Developer Central

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Grass, Fur and all Things Hairy - AMD at GDC14

  • 1. Grass, Fur and all things hairy Nicolas Thibieroz Karl Hillesland Gaming Engineering Manager, AMD Senior Research Engineer, AMD
  • 2. Next-gen Grass, Fur and Hair ● The time for next-gen quality is now ● Tomb Raider pioneered next-gen hair ● Even on PS4/XB1 ● Users expect this level of quality for next- gen titles ● You need to start thinking about this ● This talk is about making high-quality fur, grass and hair run at real-time performance
  • 3. TressFX applied to Grass, Fur and Hair ● Variations of the same technique can be used for all those applications ● In all cases the core principles of next-gen quality are still needed: ● Compute simulations ● Anti-aliasing ● Transparency ● Volumetric self-shadowing ● A good lighting model
  • 4. Forward Rendering Pipeline – a refresher ● Consists of three steps: ● Hair simulation ● Shade and store fragments into buffers ● Fetch shaded fragments, sort and render
  • 5. // Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset; // Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset); // Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element; ● Head UAV ● Each pixel location has a “head pointer” to a linked list in the PPLL UAV ● PPLL UAV ● As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter) ● A link is created to the fragment pointed to by the head pointer ● Head pointer then points to the new fragment Per-Pixel Linked Lists Head UAV PPLL UAV
  • 6. CSCSCS Input Geometry Post-simulation geometry (UAV) Forward Rendering Pipeline – a refresher Hair Simulation Simulation parameters Model space World space
  • 7. Forward Rendering Pipeline – a refresher Shade and Store fragments into Buffers Coverage depth color coverage next Lighting VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Shadows Extrusion from line segments to non-indexed triangles
  • 8. Full Screen Quad Forward Rendering Pipeline – a refresher Fetch shaded fragments, sort and render VS PS Stencil Head UAV PPLL UAV Render target Fragment sorting and manual blending
  • 9. Forward Rendering Performance ● Main cost in forward rendering mode is in the shading part ● All fragments are lit and shadowed before being stored ● PPLL storing is typically not the bottleneck! ● Don’t need maximum quality on all fragments ● “tail” fragments need only “good enough” quality ● Solution: Use shader LOD
  • 10. Forward vs Deferred Rendering Pipeline Deferred rendering pipeline ● Hair simulation ● Store fragment properties into buffers ● Fetch fragment properties, sort, shade and render ● Full shading on K-frontmost fragments ● “Tail” fragments are shaded with a simpler light equation and shadowing algorithm Forward rendering pipeline ● Hair simulation ● Full shading and store fragments into buffers ● Fetch shaded fragments, sort and render
  • 11. CSCSCS Input Geometry Post-simulation geometry (UAV) Deferred Rendering Pipeline Hair Simulation – unchanged! Simulation parameters Model space World space
  • 12. Deferred Rendering Pipeline – a refresher Store Fragment Properties into Buffers Coverage depth tangent coverage next VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Index Buffer Indexed triangle list
  • 13. Deferred Rendering Pipeline Fetch fragments, sort, shade and render VS PS Stencil Head UAV PPLL UAV Render target K frontmost fragment: full shading, sorting and manual blending Lighting Shadows Full Screen Quad Tail fragments: cheap chading, no sorting and manual blending
  • 14. Deferred Rendering Shading LOD Optimization ● Deferred approach allows a reduction in shading cost “Shader LOD” ● Only sort and shade K frontmost fragments at high quality ● “Simple” shading and out-of-order rendering on tail fragments ● Single-tap shadowing on tail fragments ● Very little quality difference compared to full shading ● But much better performance! Technique Cost Out of order, no shading 1.31 ms Out of order, shading 2.80 ms Forward PPLL, shading 3.38 ms Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands Running on AMD Radeon 7970 @ 1080p Shading cost is ~ 1.5 ms PPLL cost is ~ 0.58 ms Fast!
  • 15. Full quality shading forced on for all fragments Shading LOD
  • 16. ● A great portion of time was spent in the GPU front-end ● 920,000 line segments for fur model ● Expansion from line segments to triangles was done in GS and then VS with Draw() ● Each segment would create a quad (two triangles) with 6 vertices Geometry Optimizations DrawIndexed() method Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) }; 1 Line segments Expanded quads 0 1 2 3 2 4 0 5 1,4 Draw() method Line segments Expanded quads 0 1 2 3,5 6 2,3 7,10 8,9 0 11 Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) }; ● Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
  • 17. ● Input line segments have a random order ● Just render fewer (but thicker) fragments when far away! ● Needs shading adjustments to ensure smooth quality transitions ● Increase alpha threshold for fragment inclusion when far away Distance-based LOD system Optimization
  • 18. ● PPLL Head UAV uses a RWTexture2D instead of a Buffer ● Results in more efficient caching for UAV accesses ● Avoid GPR indexing for sorting ● Sorting K frontmost fragments required array of Generic Purpose Registers with random indexing into it ● Used an ALU-based indexing approach to improve performance ● TO DO: compute shader simulation optimizations ● Currently a set of multiple compute shaders ● Looking at combining some of these, optimizing shaders and output formats Other Optimizations
  • 19. Per-Pixel Linked Lists UAV Memory Considerations ● How much memory is needed? ● Guesstimate for a given usage model ● Max (hair pixels x average overdraw) fragments ● What happens when I run out? ● Missing fragments ● What can be done about it?
  • 21. PP Linked-List (PPLL) k-Buffer fixed size array Node Pool All fragments How big? k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k Simple Memory Bound
  • 22. The Front k Approximation to avoid massive sorting ● Only sort the front k fragments per-pixel ● Blend the rest out-of-order If deferring for shader LOD … also ● Full quality shade on front k ● Cheap shade on rest 20 frags/pixel (ave) Red = over 100 k is 4, 8, 16
  • 23. The Front k Approximation to avoid massive sorting ● Only sort the front k fragments per-pixel ● Blend the rest out-of-order If deferring for shader LOD … also ● Full quality shade on front k ● Cheap shade on rest k-Buffer Tail Can’t know front k until all fragments processed
  • 24. k-Buffer For Each Fragment in Each Pixel Index of furthest New Fragment Blend Tail ColorTail Fragment
  • 25. If New Fragment in k Index of furthest k-Buffer Blend Tail Color If in k 1. Swap with furthest 2. Find new furthest 3. Blend with tail Tail Fragment New Fragment
  • 26. If not in k Index of furthest k-Buffer Blend Tail Color If not in k 1. Blend with tail Tail Fragment New Fragment
  • 27. From PPLL to k-Buffer For each pixel: Write frags to mem For each fragment in each pixel read fragment from mem update k-buffer (reg) blend tail fragment (reg) Read k-buffer from mem Sort and blend k-buffer (reg) update k-buffer (mem) blend tail fragment (mem)
  • 28. k-Buffer Screen Width ScreenHeight k 8 bytes each (depth and data) PPLL nodes were 12 bytes (depth, data, next) K=4, 8, 16
  • 29. PPLL: 2nd Pass New Fragment Index of furthest Blend Tail ColorTail Fragment k-Buffer Registers
  • 30. k-Buffer in Memory: 1st Pass New Fragment Index of furthest Blend Tail ColorTail FragmentMutex, index, … Blend Unit k-Buffer Memory
  • 31. Mutex/Count/Index Buffer Screen Width ScreenHeight Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit 32 bits
  • 32. Spinlock Mutex [allow_uav_condition] for(; i<MAX_LOOP_COUNT && !bStop; ++i) { uint oldID; InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID); if( (oldID&RESERVED) != RESERVED) ) { [[ … Do work ]] DeviceMemoryBarrier(); tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED; bStop = true; } // end mutex check }// end spinlock loop Paranoia Try Release Do Work
  • 33. Find New Max Depth uint new_max_depth = u_inDepth; [unroll] for(int t=0; t<KBUFFER_SIZE; t++) { uint element_depth = DEPTH( vScreenAddress, t ); if(element_depth > new_max_depth ) { new_max_depth = element_depth; new_max_id = t; } } Generally more memory traffic than PPLL
  • 34. Initialization: The first k Options ● Clear k-buffer fullscreen (0,1) ● Clear k-buffer stenciled, 3rd pass ● Clear on first fragment ● Count Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit
  • 35. The first k InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount); [allow_uav_condition] if(oldCount < KBUFFER_SIZE) { DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData); } Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit
  • 36. Models 2k polygons ~20k hairs~130k hairs Stats 2-3.5 M fragments 200-300k pixels Shading One point light & shadow 2 shifted specular lobes
  • 37. Depth Complexity Grey 1 Blue 8 Green 50 Red 100+
  • 38. Contention Max attempts per pixel, k=4 Dark Blue 1 Aqua <=4 Bright Aqua <=8
  • 39. Performance Time ratio to out-of-order blending ● Forward PPLL: 1.02 to 1.4 ● Forward k-Buffer: 1.2 to 1.4 ● Deferred PPLL: 0.7 to 0.9 ● Deferred k-Buffer: 0.9 to 1.6
  • 40. K-Buffer in Memory ● Simple memory bound ● Can be less memory ● Usually slower ● Increased memory traffic
  • 42. Hair Simulation ● Length Constraint ● Local Constraint ● Global Constraint ● Model Transform ● Collision Shapes ● External Forces (wind, gravity, etc.)
  • 43. Fur Simulation ● Length Constraint ● Local Constraint ● Global Constraint ● Model Transform ● Collision Shapes ● External Forces (wind, gravity, etc.)
  • 44. Grass Simulation ● Length Constraint ● Local Constraint (1D) ● Global Constraint ● Model Transform ● Collision Shapes ● External Forces (wind, gravity, etc.)
  • 45. Constraint Method (iterative) ● Used for length, local and global constraints ● Length is most difficult to converge ● particularly under large movement C0 C1 Cn-2 p0 p2 Pn-2 Pn-1
  • 46. Tridiagonal Matrix Formulation ● Direct solve for length constraint ● Almost zero stretch ● Limited to smaller time steps (stability) ● Still cheap ● Leverages matrix structure of strands ● Two sweeps of strand
  • 47. Tridiagonal Matrix Formulation “Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation”, VRIPHYS, 2013
  • 48. Demos
  • 49. Summary ● Next-gen look is possible now! ● Deferred Rendering for shading LOD is fastest ● k-buffer in memory is an option for memory-constrained situations ● High-quality grass and fur simulation with compute Upcoming TressFX 2 SDK sample update with fur scenario at http://developer.amd.com/tools-and-sdks/graphics- development/amd-radeon-sdk/
  • 52. Isoline Tessellation for hair/fur? 1/2 ● Isoline tessellation has two tess factors ● First is line density (lines per invocation) ● Second is line detail (segments per line) ● In theory provides easy LOD system ● Variable line density and detail by increasing both tessellation factors based on distance Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)
  • 53. Isoline Tessellation for hair/fur? 2/2 ● In practice isoline tessellation is not cost effective for this scenario ● Lines are always 1-pixel thick ● Need GS to extrude them into triangles for smooth edges ● Major impact on performance! ● Alternative is to enable MSAA ● Most engines are deferred so this causes a large performance impact ● No extrusion for smoothing edges and no MSAA = poor quality! ● Bottom line: a pure Vertex Shader solution is faster ● LOD benefit is easily done in VS (more on this later) ● Curvature is rarely a problem (dependant on vertices/strands at authoring time)
  • 54. AA, Self-shadowing and Transparency Basic Rendering Antialiasing Antialiasing + Self Shadowing Antialiasing + Self Shadowing + Transparency

Editor's Notes

  1. NEW GIRL
  2. Easy LOD: can easily be done with tessellation