SlideShare a Scribd company logo
1 of 34
Shader Model 5.0 and
Compute Shader


Nick Thibieroz, AMD
DX11 Basics
» New API from Microsoft
» Will be released alongside Windows 7
  »   Runs on Vista as well
» Supports downlevel hardware
  »   DX9, DX10, DX11-class HW supported
  »   Exposed features depend on GPU
» Allows the use of the same API for
  multiple generations of GPUs
  »   However Vista/Windows7 required
» Lots of new features…
Shader Model 5.0
SM5.0 Basics
» All shader types support Shader Model 5.0
  »   Vertex Shader
  »   Hull Shader
  »   Domain Shader
  »   Geometry Shader
  »   Pixel Shader
» Some instructions/declarations/system
  values are shader-specific
» Pull Model
» Shader subroutines
Uniform Indexing
» Can now index resource inputs
  »   Buffer and Texture resources
  »   Constant buffers
  »   Texture samplers
» Indexing occurs on the slot number
  »   E.g. Indexing of multiple texture arrays
  »   E.g. indexing across constant buffer slots
» Index must be a constant expression
Texture2D txDiffuse[2] : register(t0);
Texture2D txDiffuse1   : register(t1);
static uint Indices[4] = { 4, 3, 2, 1 };
float4 PS(PS_INPUT i) : SV_Target
{
  float4 color=txDiffuse[Indices[3]].Sample(sam, i.Tex);
  // float4 color=txDiffuse1.Sample(sam, i.Tex);
}
SV_Coverage
» System value available to PS stage only
» Bit field indicating the samples covered by
  the current primitive
  »   E.g. a value of 0x09 (1001b) indicates that
      sample 0 and 3 are covered by the primitive


» Easy way to detect MSAA edges for per-
  pixel/per-sample processing optimizations
  »   E.g. for MSAA 4x:
  »   bIsEdge=(uCovMask!=0x0F && uCovMask!=0);
Double Precision
» Double precision optionally supported
    »   IEEE 754 format with full precision (0.5 ULP)
    »   Mostly used for applications requiring a high
        amount of precision
    »   Denormalized values support
» Slower performance than single precision!
» Check for support:
D3D11_FEATURE_DATA_DOUBLES fdDoubleSupport;
pDev->CheckFeatureSupport( D3D11_FEATURE_DOUBLES,
                           &fdDoubleSupport,
                           sizeof(fdDoubleSupport) );
if (fdDoubleSupport.DoublePrecisionFloatShaderOps)
{
    // Double precision floating-point supported!
}
Gather()
» Fetches 4 point-sampled values in a single
  texture instruction
» Allows reduction of texture processing
      Better/faster shadow kernels
  »
                                            W Z
  »   Optimized SSAO implementations
» SM 5.0 Gather() more flexible             X Y
  »   Channel selection now supported
  »   Offset support (-32..31 range) for Texture2D
  »   Depth compare version e.g. for shadow mapping
                Gather[Cmp]Red()
                 Gather[Cmp]Green()
                 Gather[Cmp]Blue()
                 Gather[Cmp]Alpha()
Coarse Partial Derivatives
» ddx()/ddy() supplemented by coarse
  version
  »   ddx_coarse()
  »   ddy_coarse()
» Return same derivatives for whole 2x2 quad
  »   Actual derivatives used are IHV-specific
» Faster than “fine” version
  »   Trading quality for performance

                       ddx_coarse(      ) ==
                       ddx_coarse(      ) ==
                       ddx_coarse(      ) ==
                       ddx_coarse(      )

               Same principle applies to ddy_coarse()
Other Instructions
» FP32 to/from FP16 conversion
  »   uint f32tof16(float value);
  »   float f16tof32(uint value);
  »   fp16 stored in low 16 bits of uint
» Bit manipulation
  »   Returns the first occurrence of a set bit
      »   int firstbithigh(int value);
      »   int firstbitlow(int value);
  »   Reverse bit ordering
      »   uint reversebits(uint value);
  »   Useful for packing/compression code
  »   And more…
Unordered Access Views
» New view available in Shader Model 5.0
» UAVs allow binding of resources for arbitrary
  (unordered) read or write operations
  »   Supported in PS 5.0 and CS 5.0
» Applications
  »   Scatter operations
  »   Order-Independent Transparency
  »   Data binning operations
» Pixel Shader limited to 8 RTVs+UAVs total
  »   OMSetRenderTargetsAndUnorderedAccessViews()
» Compute Shader limited to 8 UAVs
  »   CSSetUnorderedAccessViews()
Raw Buffer Views
» New Buffer and View creation flag in SM 5.0
  »   Allows a buffer to be viewed as array of typeless
      32-bit aligned values
      »   Exception: Structured Buffers
  »   Buffer must be created with flag
      D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS
  »   Can be bound as SRV or UAV
      »   SRV: need D3D11_BUFFEREX_SRV_FLAG_RAW flag
      »   UAV: need D3D11_BUFFER_UAV_FLAG_RAW flag
ByteAddressBuffer   MyInputRawBuffer;     // SRV
RWByteAddressBuffer MyOutputRawBuffer;    // UAV

float4 MyPS(PSINPUT input) : COLOR
{
  uint u32BitData;
  u32BitData = MyInputRawBuffer.Load(input.index);// Read from SRV
  MyOutputRawBuffer.Store(input.index, u32BitData);// Write to UAV
  // Rest of code ...
}
Structured Buffers
» New Buffer creation flag in SM 5.0
  »   Ability to read or write a data structure at a
      specified index in a Buffer
  »   Resource must be created with flag
      D3D11_RESOURCE_MISC_BUFFER_STRUCTURED
  »   Can be bound as SRV or UAV
struct MyStruct
{
    float4 vValue1;
    uint   uBitField;
};
StructuredBuffer<MyStruct>   MyInputBuffer;    // SRV
RWStructuredBuffer<MyStruct> MyOutputBuffer;   // UAV

float4 MyPS(PSINPUT input) : COLOR
{
  MyStruct StructElement;
  StructElement = MyInputBuffer[input.index]; // Read from SRV
  MyOutputBuffer[input.index] = StructElement; // Write to UAV
  // Rest of code ...
}
Buffer Append/Consume
» Append Buffer allows new data to be written
  at the end of the buffer
  »   Raw and Structured Buffers only
  »   Useful for building lists, stacks, etc.
» Declaration
      Append[ByteAddress/Structured]Buffer MyAppendBuf;

» Access to write counter (Raw Buffer only)
      uint uCounter = MyRawAppendBuf.IncrementCounter();

» Append data to buffer
      MyRawAppendBuf.Store(uWriteCounter, value);
      MyStructuredAppendBuf.Append(StructElement);

» Can specify counters’ start offset
» Similar API for Consume and reading back a
  buffer
Atomic Operations
» PS and CS support atomic operations
  »   Can be used when multiple threads try to modify
      the same data location (UAV or TLS)
  » Avoid contention
  InterlockedAdd
  InterlockedAnd/InterlockedOr/InterlockedXor
  InterlockedCompareExchange
  InterlockedCompareStore
  InterlockedExchange
  InterlockedMax/InterlockedMin
» Can optionally return original value
» Potential cost in performance
  »   Especially if original value is required
  »   More latency hiding required
Compute Shader
Compute Shader Intro
» A new programmable shader stage in DX11
  »   Independent of the graphic pipeline
» New industry standard for GPGPU
  applications
» CS enables general processing operations
  »   Post-processing
  »   Video filtering
  »   Sorting/Binning
  »   Setting up resources for rendering
  »   Etc.
» Not limited to graphic applications
  »   E.g. AI, pathfinding, physics, compression…
CS 5.0 Features
» Supports Shader Model 5.0 instructions
» Texture sampling and filtering instructions
  »   Explicit derivatives required
» Execution not limited to fixed input/output
» Thread model execution
  »   Full control on the number of times the CS runs
» Read/write access to “on-cache” memory
  »   Thread Local Storage (TLS)
  »   Shared between threads
  »   Synchronization support
» Random access writes
  »   At last!  Enables new possibilities (scattering)
CS Threads
» A thread is the basic CS processing element
» CS declares the number of threads to
  operate on (the “thread group”)
  »   [numthreads(X, Y, Z)]                CS 5.0
      void MyCS(…)                       X*Y*Z<=1024
» To kick off CS execution:              Z<=64
  »   pDev11->Dispatch( nX, nY, nZ );
  »   nX, nY, nZ: number of thread groups to execute
» Number of thread groups can be written
  out to a Buffer as pre-pass
  »   pDev11->DispatchIndirect(LPRESOURCE
      *hBGroupDimensions, DWORD dwOffsetBytes);
  »   Useful for conditional execution
CS Threads & Groups
» pDev11->Dispatch(3, 2, 1);
» [numthreads(4, 4, 1)]
  void MyCS(…)
» Total threads = 3*2*4*4 = 96
CS Parameter Inputs
» pDev11->Dispatch(nX, nY, nZ);
» [numthreads(X, Y, Z)]
  void MyCS(
      uint3 groupID:                SV_GroupID,
      uint3 groupThreadID:          SV_GroupThreadID,
      uint3 dispatchThreadID:       SV_DispatchThreadID,
      uint groupIndex:              SV_GroupIndex);
» groupID.xyz: group offsets from Dispatch()
»   groupID.xyz   є   (0..nX-1, 0..nY-1, 0..nZ-1);
»   Constant within a CS thread group invocation
» groupThreadID.xyz: thread ID in group
»   groupThreadID.xyz    є   (0..X-1, 0..Y-1, 0..Z-1);
»   Independent of Dispatch() parameters
» dispatchThreadID.xyz: global thread offset
»   = groupID.xyz*(X,Y,Z) + groupThreadID.xyz
» groupIndex: flattened version of groupThreadID
CS Bandwidth Advantage
» Memory bandwidth often still a bottleneck
  »   Post-processing, compression, etc.
» Fullscreen filters often require input pixels
  to be fetched multiple times!
  »   Depth of Field, SSAO, Blur, etc.
  »   BW usage depends on TEX cache and kernel size
» TLS allows reduction in BW requirements
» Typical usage model
  »   Each thread reads data from input resource
  »   …and write it into TLS group data
  »   Synchronize threads
  »   Read back and process TLS group data
Thread Local Storage
» Shared between threads
» Read/write access at any location
» Declared in the shader
  »   groupshared float4 vCacheMemory[1024];
» Limited to 32 KB
» Need synchronization before reading back
  data written by other threads
  »   To ensure all threads have finished writing
  »   GroupMemoryBarrier();
  »   GroupMemoryBarrierWithGroupSync();
CS 4.X
» Compute Shader supported on DX10(.1) HW
  »   CS 4.0 on DX10 HW, CS 4.1 on DX10.1 HW
» Useful for prototyping CS on HW device
  before DX11 GPUs become available
» Drivers available from ATI and NVIDIA
» Major differences compared to CS5.0
  »   Max number of threads is 768 total
  »   Dispatch Zn==1 & no DispatchIndirect() support
  »   TLS size is 16 KB
  »   Thread can only write to its own offset in TLS
  »   Atomic operations not supported
  »   Only one UAV can be bound
  »   Only writable resource is Buffer type
PS 5.0 vs CS 5.0
 Example: Gaussian Blur
» Comparison between a PS 5.0 and CS5.0
  implementation of Gaussian Blur
» Two-pass Gaussian Blur
  »   High cost in texture instructions and bandwidth


» Can the compute shader perform better?
Gaussian Blur PS
» Separable filter Horizontal/Vertical pass
  »   Using kernel size of x*y
» For each pixel of each line:
  »   Fetch x texels in a horizontal segment           x
  »   Write H-blurred output pixel in RT:     BH            Gi Pi
» For each pixel of each column:                      i 1

  »   Fetch y texels in a vertical segment from RT
                                              y
  »   Write fully blurred output pixel:   B          Gi Pi
» Problems:                                    i 1

  »   Texels of source texture are read multiple times
  »   This will lead to cache trashing if kernel is large
  »   Also leads to many texture instructions used!
Gaussian Blur PS
Horizontal Pass
           Source texture




              Temp RT
Gaussian Blur PS
Vertical Pass
        Source texture (temp RT)




            Destination RT
Gaussian Blur CS – HP(1)
groupshared float4 HorizontalLine[WIDTH];             // TLS
Texture2D txInput;              // Input texture to read from
RWTexture2D<float4> OutputTexture;            // Tmp output
[numthreads(WIDTH,1,1)]
void GausBlurHoriz(uint3 groupID: SV_GroupID,
       pDevContext->Dispatch(1,HEIGHT,1);
                   uint3 groupThreadID: SV_GroupThreadID)
{
    // Fetch color from input texture
                [numthreads(WIDTH,1,1)]
        Dispatch(1,HEIGHT,1);


    float4 vColor=txInput[int2(groupThreadID.x,groupID.y)];
    // Store it into TLS
    HorizontalLine[groupThreadID.x]=vColor;
    // Synchronize threads
    GroupMemoryBarrierWithGroupSync();


    // Continued on next slide
Gaussian Blur CS – HP(2)
    // Compute horizontal Gaussian blur for each pixel
    vColor = float4(0,0,0,0);
    [unroll]for (int i=-GS2; i<=GS2; i++)
    {
        // Determine offset of pixel to fetch
        int nOffset = groupThreadID.x + i;
        // Clamp offset
        nOffset = clamp(nOffset, 0, WIDTH-1);
        // Add color for pixels within horizontal filter
        vColor += G[GS2+i] * HorizontalLine[nOffset];
    }

    // Store result
    OutputTexture[int2(groupThreadID.x,groupID.y)]=vColor;
}
Gaussian Blur BW:PS                                    vs      CS
» Pixel Shader
  »   # of reads per source pixel: 7 (H) + 7 (V) = 14
  »   # of writes per source pixel: 1 (H) + 1 (V) = 2
  »   Total number of memory operations per pixel: 16
  »   For a 1024x1024 RGBA8 source texture this is 64
      MBytes worth of data transfer
      »   Texture cache will reduce this number
      »   But become less effective as the kernel gets larger

» Compute Shader
  »   # of reads per source pixel: 1 (H) + 1 (V) = 2
  »   # of writes per source pixel: 1 (H) + 1 (V) = 2
  »   Total number of memory operations per pixel: 4
  »   For a 1024x1024 RGBA8 source texture this is 16
      MBytes worth of data transfer
Conclusion
» New Shader Model 5.0 feature set
  extensively powerful
  »   New instructions
  »   Double precision support
  »   Scattering support through UAVs
» Compute Shader
  »   No longer limited to graphic applications
  »   TLS memory allows considerable
      performance savings
» DX11 SDK available for prototyping
  »   Ask your IHV for a CS4.X-enabled driver
  »   REF driver for full SM 5.0 support
Questions?




   nicolas.thibieroz@amd.com

More Related Content

What's hot

Cascade Shadow Mapping
Cascade Shadow MappingCascade Shadow Mapping
Cascade Shadow MappingSukwoo Lee
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Shadow mapping 정리
Shadow mapping 정리Shadow mapping 정리
Shadow mapping 정리changehee lee
 
Implements Cascaded Shadow Maps with using Texture Array
Implements Cascaded Shadow Maps with using Texture ArrayImplements Cascaded Shadow Maps with using Texture Array
Implements Cascaded Shadow Maps with using Texture ArrayYEONG-CHEON YOU
 
Lighting Shading by John Hable
Lighting Shading by John HableLighting Shading by John Hable
Lighting Shading by John HableNaughty Dog
 
Compute shader
Compute shaderCompute shader
Compute shaderQooJuice
 
Direct x 11 입문
Direct x 11 입문Direct x 11 입문
Direct x 11 입문Jin Woo Lee
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Philip Hammer
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion CullingIntel® Software
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
Technical Deep Dive into the New Prefab System
Technical Deep Dive into the New Prefab SystemTechnical Deep Dive into the New Prefab System
Technical Deep Dive into the New Prefab SystemUnity Technologies
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리MinGeun Park
 
IndirectDraw with unity
IndirectDraw with unityIndirectDraw with unity
IndirectDraw with unityJung Suk Ko
 
구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2Kyoung Seok(경석) Ko(고)
 
[0806 박민근] 림 라이팅(rim lighting)
[0806 박민근] 림 라이팅(rim lighting)[0806 박민근] 림 라이팅(rim lighting)
[0806 박민근] 림 라이팅(rim lighting)MinGeun Park
 

What's hot (20)

Cascade Shadow Mapping
Cascade Shadow MappingCascade Shadow Mapping
Cascade Shadow Mapping
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Shadow mapping 정리
Shadow mapping 정리Shadow mapping 정리
Shadow mapping 정리
 
Implements Cascaded Shadow Maps with using Texture Array
Implements Cascaded Shadow Maps with using Texture ArrayImplements Cascaded Shadow Maps with using Texture Array
Implements Cascaded Shadow Maps with using Texture Array
 
Lighting Shading by John Hable
Lighting Shading by John HableLighting Shading by John Hable
Lighting Shading by John Hable
 
Compute shader
Compute shaderCompute shader
Compute shader
 
Direct x 11 입문
Direct x 11 입문Direct x 11 입문
Direct x 11 입문
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Technical Deep Dive into the New Prefab System
Technical Deep Dive into the New Prefab SystemTechnical Deep Dive into the New Prefab System
Technical Deep Dive into the New Prefab System
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Ssao
SsaoSsao
Ssao
 
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
 
IndirectDraw with unity
IndirectDraw with unityIndirectDraw with unity
IndirectDraw with unity
 
구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2구세대 엔진 신데렐라 만들기 최종본 유트브2
구세대 엔진 신데렐라 만들기 최종본 유트브2
 
[0806 박민근] 림 라이팅(rim lighting)
[0806 박민근] 림 라이팅(rim lighting)[0806 박민근] 림 라이팅(rim lighting)
[0806 박민근] 림 라이팅(rim lighting)
 
Voxelizaition with GPU
Voxelizaition with GPUVoxelizaition with GPU
Voxelizaition with GPU
 

Viewers also liked

GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and CompressionMark Kilgard
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
 
An Introduction to Writing Custom Unity Shaders!
An Introduction to Writing  Custom Unity Shaders!An Introduction to Writing  Custom Unity Shaders!
An Introduction to Writing Custom Unity Shaders!DevGAMM Conference
 
Shader Programming With Unity
Shader Programming With UnityShader Programming With Unity
Shader Programming With UnityMindstorm Studios
 
Geometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping SetupGeometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping SetupMark Kilgard
 
Game Programming 12 - Shaders
Game Programming 12 - ShadersGame Programming 12 - Shaders
Game Programming 12 - ShadersNick Pruehs
 
Shaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the BestShaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the BestBeMyApp
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
Play RICOH THETA 360 Videos in Unity Shanyuan Teng
Play RICOH THETA 360 Videos in Unity Shanyuan TengPlay RICOH THETA 360 Videos in Unity Shanyuan Teng
Play RICOH THETA 360 Videos in Unity Shanyuan TengTHETA Unofficial Guide
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Alexander Dolbilov
 
ApresentaçAo De Tcc Modelo
ApresentaçAo De Tcc ModeloApresentaçAo De Tcc Modelo
ApresentaçAo De Tcc ModeloLindaeloka
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Viewers also liked (18)

GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and Compression
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
An Introduction to Writing Custom Unity Shaders!
An Introduction to Writing  Custom Unity Shaders!An Introduction to Writing  Custom Unity Shaders!
An Introduction to Writing Custom Unity Shaders!
 
Shader Programming With Unity
Shader Programming With UnityShader Programming With Unity
Shader Programming With Unity
 
Geometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping SetupGeometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping Setup
 
Game Programming 12 - Shaders
Game Programming 12 - ShadersGame Programming 12 - Shaders
Game Programming 12 - Shaders
 
Shaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the BestShaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the Best
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Play RICOH THETA 360 Videos in Unity Shanyuan Teng
Play RICOH THETA 360 Videos in Unity Shanyuan TengPlay RICOH THETA 360 Videos in Unity Shanyuan Teng
Play RICOH THETA 360 Videos in Unity Shanyuan Teng
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 
ApresentaçAo De Tcc Modelo
ApresentaçAo De Tcc ModeloApresentaçAo De Tcc Modelo
ApresentaçAo De Tcc Modelo
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Compute Shader and Shader Model 5.0 Features

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with javaGary Sieling
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedis Labs
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And EffectsThomas Goddard
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentMohammed Farrag
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)mistercteam
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute modelsPedram Mazloom
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko3D
 
02 direct3 d_pipeline
02 direct3 d_pipeline02 direct3 d_pipeline
02 direct3 d_pipelineGirish Ghate
 
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개Yongho Ji
 
SMB3 Offload Data Transfer (ODX)
SMB3 Offload Data Transfer (ODX)SMB3 Offload Data Transfer (ODX)
SMB3 Offload Data Transfer (ODX)gordonross
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortStefan Marr
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim TkachenkoSupercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim TkachenkoAltinity Ltd
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 

Similar to Compute Shader and Shader Model 5.0 Features (20)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports Development
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute models
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
02 direct3 d_pipeline
02 direct3 d_pipeline02 direct3 d_pipeline
02 direct3 d_pipeline
 
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
 
SMB3 Offload Data Transfer (ODX)
SMB3 Offload Data Transfer (ODX)SMB3 Offload Data Transfer (ODX)
SMB3 Offload Data Transfer (ODX)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim TkachenkoSupercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Compute Shader and Shader Model 5.0 Features

  • 1.
  • 2. Shader Model 5.0 and Compute Shader Nick Thibieroz, AMD
  • 3. DX11 Basics » New API from Microsoft » Will be released alongside Windows 7 » Runs on Vista as well » Supports downlevel hardware » DX9, DX10, DX11-class HW supported » Exposed features depend on GPU » Allows the use of the same API for multiple generations of GPUs » However Vista/Windows7 required » Lots of new features…
  • 5. SM5.0 Basics » All shader types support Shader Model 5.0 » Vertex Shader » Hull Shader » Domain Shader » Geometry Shader » Pixel Shader » Some instructions/declarations/system values are shader-specific » Pull Model » Shader subroutines
  • 6. Uniform Indexing » Can now index resource inputs » Buffer and Texture resources » Constant buffers » Texture samplers » Indexing occurs on the slot number » E.g. Indexing of multiple texture arrays » E.g. indexing across constant buffer slots » Index must be a constant expression Texture2D txDiffuse[2] : register(t0); Texture2D txDiffuse1 : register(t1); static uint Indices[4] = { 4, 3, 2, 1 }; float4 PS(PS_INPUT i) : SV_Target { float4 color=txDiffuse[Indices[3]].Sample(sam, i.Tex); // float4 color=txDiffuse1.Sample(sam, i.Tex); }
  • 7. SV_Coverage » System value available to PS stage only » Bit field indicating the samples covered by the current primitive » E.g. a value of 0x09 (1001b) indicates that sample 0 and 3 are covered by the primitive » Easy way to detect MSAA edges for per- pixel/per-sample processing optimizations » E.g. for MSAA 4x: » bIsEdge=(uCovMask!=0x0F && uCovMask!=0);
  • 8. Double Precision » Double precision optionally supported » IEEE 754 format with full precision (0.5 ULP) » Mostly used for applications requiring a high amount of precision » Denormalized values support » Slower performance than single precision! » Check for support: D3D11_FEATURE_DATA_DOUBLES fdDoubleSupport; pDev->CheckFeatureSupport( D3D11_FEATURE_DOUBLES, &fdDoubleSupport, sizeof(fdDoubleSupport) ); if (fdDoubleSupport.DoublePrecisionFloatShaderOps) { // Double precision floating-point supported! }
  • 9. Gather() » Fetches 4 point-sampled values in a single texture instruction » Allows reduction of texture processing Better/faster shadow kernels » W Z » Optimized SSAO implementations » SM 5.0 Gather() more flexible X Y » Channel selection now supported » Offset support (-32..31 range) for Texture2D » Depth compare version e.g. for shadow mapping Gather[Cmp]Red() Gather[Cmp]Green() Gather[Cmp]Blue() Gather[Cmp]Alpha()
  • 10. Coarse Partial Derivatives » ddx()/ddy() supplemented by coarse version » ddx_coarse() » ddy_coarse() » Return same derivatives for whole 2x2 quad » Actual derivatives used are IHV-specific » Faster than “fine” version » Trading quality for performance ddx_coarse( ) == ddx_coarse( ) == ddx_coarse( ) == ddx_coarse( ) Same principle applies to ddy_coarse()
  • 11. Other Instructions » FP32 to/from FP16 conversion » uint f32tof16(float value); » float f16tof32(uint value); » fp16 stored in low 16 bits of uint » Bit manipulation » Returns the first occurrence of a set bit » int firstbithigh(int value); » int firstbitlow(int value); » Reverse bit ordering » uint reversebits(uint value); » Useful for packing/compression code » And more…
  • 12. Unordered Access Views » New view available in Shader Model 5.0 » UAVs allow binding of resources for arbitrary (unordered) read or write operations » Supported in PS 5.0 and CS 5.0 » Applications » Scatter operations » Order-Independent Transparency » Data binning operations » Pixel Shader limited to 8 RTVs+UAVs total » OMSetRenderTargetsAndUnorderedAccessViews() » Compute Shader limited to 8 UAVs » CSSetUnorderedAccessViews()
  • 13. Raw Buffer Views » New Buffer and View creation flag in SM 5.0 » Allows a buffer to be viewed as array of typeless 32-bit aligned values » Exception: Structured Buffers » Buffer must be created with flag D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS » Can be bound as SRV or UAV » SRV: need D3D11_BUFFEREX_SRV_FLAG_RAW flag » UAV: need D3D11_BUFFER_UAV_FLAG_RAW flag ByteAddressBuffer MyInputRawBuffer; // SRV RWByteAddressBuffer MyOutputRawBuffer; // UAV float4 MyPS(PSINPUT input) : COLOR { uint u32BitData; u32BitData = MyInputRawBuffer.Load(input.index);// Read from SRV MyOutputRawBuffer.Store(input.index, u32BitData);// Write to UAV // Rest of code ... }
  • 14. Structured Buffers » New Buffer creation flag in SM 5.0 » Ability to read or write a data structure at a specified index in a Buffer » Resource must be created with flag D3D11_RESOURCE_MISC_BUFFER_STRUCTURED » Can be bound as SRV or UAV struct MyStruct { float4 vValue1; uint uBitField; }; StructuredBuffer<MyStruct> MyInputBuffer; // SRV RWStructuredBuffer<MyStruct> MyOutputBuffer; // UAV float4 MyPS(PSINPUT input) : COLOR { MyStruct StructElement; StructElement = MyInputBuffer[input.index]; // Read from SRV MyOutputBuffer[input.index] = StructElement; // Write to UAV // Rest of code ... }
  • 15. Buffer Append/Consume » Append Buffer allows new data to be written at the end of the buffer » Raw and Structured Buffers only » Useful for building lists, stacks, etc. » Declaration Append[ByteAddress/Structured]Buffer MyAppendBuf; » Access to write counter (Raw Buffer only) uint uCounter = MyRawAppendBuf.IncrementCounter(); » Append data to buffer MyRawAppendBuf.Store(uWriteCounter, value); MyStructuredAppendBuf.Append(StructElement); » Can specify counters’ start offset » Similar API for Consume and reading back a buffer
  • 16. Atomic Operations » PS and CS support atomic operations » Can be used when multiple threads try to modify the same data location (UAV or TLS) » Avoid contention InterlockedAdd InterlockedAnd/InterlockedOr/InterlockedXor InterlockedCompareExchange InterlockedCompareStore InterlockedExchange InterlockedMax/InterlockedMin » Can optionally return original value » Potential cost in performance » Especially if original value is required » More latency hiding required
  • 18. Compute Shader Intro » A new programmable shader stage in DX11 » Independent of the graphic pipeline » New industry standard for GPGPU applications » CS enables general processing operations » Post-processing » Video filtering » Sorting/Binning » Setting up resources for rendering » Etc. » Not limited to graphic applications » E.g. AI, pathfinding, physics, compression…
  • 19. CS 5.0 Features » Supports Shader Model 5.0 instructions » Texture sampling and filtering instructions » Explicit derivatives required » Execution not limited to fixed input/output » Thread model execution » Full control on the number of times the CS runs » Read/write access to “on-cache” memory » Thread Local Storage (TLS) » Shared between threads » Synchronization support » Random access writes » At last!  Enables new possibilities (scattering)
  • 20. CS Threads » A thread is the basic CS processing element » CS declares the number of threads to operate on (the “thread group”) » [numthreads(X, Y, Z)] CS 5.0 void MyCS(…) X*Y*Z<=1024 » To kick off CS execution: Z<=64 » pDev11->Dispatch( nX, nY, nZ ); » nX, nY, nZ: number of thread groups to execute » Number of thread groups can be written out to a Buffer as pre-pass » pDev11->DispatchIndirect(LPRESOURCE *hBGroupDimensions, DWORD dwOffsetBytes); » Useful for conditional execution
  • 21. CS Threads & Groups » pDev11->Dispatch(3, 2, 1); » [numthreads(4, 4, 1)] void MyCS(…) » Total threads = 3*2*4*4 = 96
  • 22. CS Parameter Inputs » pDev11->Dispatch(nX, nY, nZ); » [numthreads(X, Y, Z)] void MyCS( uint3 groupID: SV_GroupID, uint3 groupThreadID: SV_GroupThreadID, uint3 dispatchThreadID: SV_DispatchThreadID, uint groupIndex: SV_GroupIndex); » groupID.xyz: group offsets from Dispatch() » groupID.xyz є (0..nX-1, 0..nY-1, 0..nZ-1); » Constant within a CS thread group invocation » groupThreadID.xyz: thread ID in group » groupThreadID.xyz є (0..X-1, 0..Y-1, 0..Z-1); » Independent of Dispatch() parameters » dispatchThreadID.xyz: global thread offset » = groupID.xyz*(X,Y,Z) + groupThreadID.xyz » groupIndex: flattened version of groupThreadID
  • 23. CS Bandwidth Advantage » Memory bandwidth often still a bottleneck » Post-processing, compression, etc. » Fullscreen filters often require input pixels to be fetched multiple times! » Depth of Field, SSAO, Blur, etc. » BW usage depends on TEX cache and kernel size » TLS allows reduction in BW requirements » Typical usage model » Each thread reads data from input resource » …and write it into TLS group data » Synchronize threads » Read back and process TLS group data
  • 24. Thread Local Storage » Shared between threads » Read/write access at any location » Declared in the shader » groupshared float4 vCacheMemory[1024]; » Limited to 32 KB » Need synchronization before reading back data written by other threads » To ensure all threads have finished writing » GroupMemoryBarrier(); » GroupMemoryBarrierWithGroupSync();
  • 25. CS 4.X » Compute Shader supported on DX10(.1) HW » CS 4.0 on DX10 HW, CS 4.1 on DX10.1 HW » Useful for prototyping CS on HW device before DX11 GPUs become available » Drivers available from ATI and NVIDIA » Major differences compared to CS5.0 » Max number of threads is 768 total » Dispatch Zn==1 & no DispatchIndirect() support » TLS size is 16 KB » Thread can only write to its own offset in TLS » Atomic operations not supported » Only one UAV can be bound » Only writable resource is Buffer type
  • 26. PS 5.0 vs CS 5.0 Example: Gaussian Blur » Comparison between a PS 5.0 and CS5.0 implementation of Gaussian Blur » Two-pass Gaussian Blur » High cost in texture instructions and bandwidth » Can the compute shader perform better?
  • 27. Gaussian Blur PS » Separable filter Horizontal/Vertical pass » Using kernel size of x*y » For each pixel of each line: » Fetch x texels in a horizontal segment x » Write H-blurred output pixel in RT: BH Gi Pi » For each pixel of each column: i 1 » Fetch y texels in a vertical segment from RT y » Write fully blurred output pixel: B Gi Pi » Problems: i 1 » Texels of source texture are read multiple times » This will lead to cache trashing if kernel is large » Also leads to many texture instructions used!
  • 28. Gaussian Blur PS Horizontal Pass Source texture Temp RT
  • 29. Gaussian Blur PS Vertical Pass Source texture (temp RT) Destination RT
  • 30. Gaussian Blur CS – HP(1) groupshared float4 HorizontalLine[WIDTH]; // TLS Texture2D txInput; // Input texture to read from RWTexture2D<float4> OutputTexture; // Tmp output [numthreads(WIDTH,1,1)] void GausBlurHoriz(uint3 groupID: SV_GroupID, pDevContext->Dispatch(1,HEIGHT,1); uint3 groupThreadID: SV_GroupThreadID) { // Fetch color from input texture [numthreads(WIDTH,1,1)] Dispatch(1,HEIGHT,1); float4 vColor=txInput[int2(groupThreadID.x,groupID.y)]; // Store it into TLS HorizontalLine[groupThreadID.x]=vColor; // Synchronize threads GroupMemoryBarrierWithGroupSync(); // Continued on next slide
  • 31. Gaussian Blur CS – HP(2) // Compute horizontal Gaussian blur for each pixel vColor = float4(0,0,0,0); [unroll]for (int i=-GS2; i<=GS2; i++) { // Determine offset of pixel to fetch int nOffset = groupThreadID.x + i; // Clamp offset nOffset = clamp(nOffset, 0, WIDTH-1); // Add color for pixels within horizontal filter vColor += G[GS2+i] * HorizontalLine[nOffset]; } // Store result OutputTexture[int2(groupThreadID.x,groupID.y)]=vColor; }
  • 32. Gaussian Blur BW:PS vs CS » Pixel Shader » # of reads per source pixel: 7 (H) + 7 (V) = 14 » # of writes per source pixel: 1 (H) + 1 (V) = 2 » Total number of memory operations per pixel: 16 » For a 1024x1024 RGBA8 source texture this is 64 MBytes worth of data transfer » Texture cache will reduce this number » But become less effective as the kernel gets larger » Compute Shader » # of reads per source pixel: 1 (H) + 1 (V) = 2 » # of writes per source pixel: 1 (H) + 1 (V) = 2 » Total number of memory operations per pixel: 4 » For a 1024x1024 RGBA8 source texture this is 16 MBytes worth of data transfer
  • 33. Conclusion » New Shader Model 5.0 feature set extensively powerful » New instructions » Double precision support » Scattering support through UAVs » Compute Shader » No longer limited to graphic applications » TLS memory allows considerable performance savings » DX11 SDK available for prototyping » Ask your IHV for a CS4.X-enabled driver » REF driver for full SM 5.0 support
  • 34. Questions? nicolas.thibieroz@amd.com