2. CS 354 2
Today’s material
In-class quiz
Lecture topic
Architecture of Graphics Processing Units (GPUs)
Course work
Homework #4 due today
Review textbook reading
Chapter 5, 6, and 7
Project #2 on texturing, shading, & lighting is coming
Remember: Midterm in-class on March 8
3. CS 354 3
My Office Hours
Tuesday, before class
Painter (PAI) 5.35
8:45 a.m. to 9:15
Thursday, after class
ACE 6.302
11:00 a.m. to 12
Randy’s office hours
Monday & Wednesday
11 a.m. to 12:00
Painter (PAI) 5.33
4. CS 354 4
Last time, this time
Last lecture, we discussed
Programmable shading
Graphics hardware shading languages
This lecture
How do GPUs work?
5. CS 354 5
On a sheet of paper
Daily Quiz • Write your EID, name, and date
• Write #1, #2, #3, #4 followed by its answer
Pick the best choice: Shade Multiple choice: The GLSL standard
has built-in data types for
trees are a) vectors
a) fractal trees with shadows b) matrices
b) OpenGL commands c) texture samplers
c) hierarchical arrangements of d) floating-point values
e) pointers to malloc’ed memory
shading computations f) a through e
d) fractal patterns of all sorts g) a through d
Name one general purpose
programming language that GLSL
borrows from.
6. CS 354 6
Key Trend in OpenGL Evolution
Complex
Configurability
Simple Shaders!
Configurability
High-level languages
Fixed-function Programmable
Direct3D follows the same trend
Also reflects trend in GPU architecture
API and hardware co-evolving
7. CS 354 7
Programming Shaders inside GPU
Multiple programmable domains within the GPU
3D Application
or Game Can be programmed in high-level languages
Cg, HLSL, or OpenGL Shading Language (GLSL)
OpenGL API
CPU – GPU
Boundary
GPU Vertex Primitive Clipping, Setup, Raster
Front End Assembly Assembly and Rasterization Operations
Vertex Geometry Fragment
Shader Program Shader
Attribute Fetch
Legend
Parameter Buffer Read Texture Fetch Framebuffer Access
programmable
fixed-function
Memory Interface
OpenGL 3.3
9. CS 354 9
Six Years of GPU Architecture
OpenGL Direct3D
Product New Features Version Version
Hardware transform & lighting, configurable
2000 GeForce 256 fixed-point shading, cube maps, texture 1.3 DX7
compression, anisotropic texture filtering
Programmable vertex transformation, 4
texture units, dependent textures, 3D
2001 GeForce3
textures, shadow maps, multisampling,
1.4 DX8
occlusion queries
2002 GeForce4 Ti 4600 Early Z culling, dual-monitor 1.4 DX8.1
Vertex program branching, floating-point
fragment programs, 16 texture units, limited
2003 GeForce FX
floating-point textures, color and depth
1.5 DX9
compression
Vertex textures, structured fragment
branching, non-power-of-two textures,
2004 GeForce 6800 Ultra
generalized floating-point textures, floating-
2.0 DX9c
point texture filtering and blending
2005 GeForce 7800 GTX Transparency antialiasing 2.0 DX9c
10. CS 354 10
GeForce Peak
Vertex Processing Trends
rate for trivial 4x4 exceeds peak
1,400
vertex transform setup rates—allows
Millions of vertices per second
1,200
excess vertex
processing
1,000
800
600
400
200
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
Vertex units 1 1 2 3 6 8
11. CS 354 11
GeForce Peak
Memory Bandwidth Trends
200
128-bit interface 256-bit interface
180
Raw
160 bandwidth
Gigabytes per second
140
Effective raw
bandwidth
120
with
compression
100
Expon.
(Effective raw
bandwidth
80
with
compression)
60
Expon. (Raw
bandwidth)
40
20
0
GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce
GT S 4600 6800 Ultra 7800 GT X
12. CS 354 12
Effective GPU
Memory Bandwidth
Compression schemes
Lossless depth and color (when multisampling)
compression
Lossy texture compression (S3TC / DXTC)
Typically assumes 4:1 compression
Avoidance useless work
Early killing of fragments (Z cull)
Avoiding useless blending and texture fetches
Very clever memory controller designs
Combining memory accesses for improved coherency
Caches for texture fetches
13. CS 354 13
GeForce Core and Memory
Clock Rates
1,400
DDR memory
1,200
transition—
memory rates
1,000
double physical
clock rate
Megahertz (Mhz)
800 Core
clock
600 Memory
clock
400
200
0
X
a
0
S
ltr
T
X
60
2
3
X
T
T
G
F
U
ce
Z
G
i4
N
ce
0
0
or
a
T
2
T
80
80
iv
ce
or
eF
a
4
R
7
iv
6
eF
e
or
G
R
ce
c
ce
eF
G
or
or
or
eF
G
eF
eF
G
G
G
14. CS 354 14
GeForce Peak
Triangle Setup Trends
300
assumes 50%
face culling
Millions of triangles per second
250
200
150
100
50
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
15. CS 354 15
GeForce Peak
Texture Fetch Trends
12,000
assuming no texture
cache misses
10,000
Millions of texture fetches
8,000
per second
6,000
4,000
2,000
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
Texture units 2×4 2×4 2×4 2×4 16 24
16. CS 354 16
GeForce Peak
Depth/Stencil-only Fill
18,000
assuming no double speed
Millions of depth/stencil pixel updates
16,000 read-modify-write depth-stencil
only
14,000
12,000
per second
10,000
8,000
6,000
4,000
2,000
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
Raster Op units 4 4 4 4+4 16+16 16+16
17. CS 354 17
GeForce Transistor Count and
Semiconductor Process
450
400
Millions of transistors
350
300
250
200
150
100
50
0
Riva ZX Riva GeForce2 GeForce3 GeForce4 GeForce GeForce GeForce
TNT2 GTS Ti 4600 FX 6800 7800 GTX
Ultra
Process (µm) 0.35 0.22 0.18 0.18 0.15 0.13 0.13 0.11
18. CS 354 18
Hardware GeForce GeForce GeForce
Unit FX 5900 6800 Ultra 7800 GTX
Vertex
3 6 8
4+4 16 24
Fragment
2nd Texture
Fetch
4+4 16+16 16+16
Raster Color
Raster Depth
19. CS 354 19
GeForce 7800 GTX
Board Details
SLI Connector Single slot cooling
sVideo
TV Out
DVI x 2
256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32
20. CS 354 20
GeForce 7800 GTX
GPU Details
302 million transistors
430 MHz core clock
256-bit memory interface
Notable Functionality
• Non-power-of-two textures with mipmaps
• Floating-point (fp16) blending and filtering
• sRGB color space texture filtering and
frame buffer blending
• Vertex textures
• 16x anisotropic texture filtering
• Dynamic vertex and fragment branching
• Double-rate depth/stencil-only rendering
• Early depth/stencil culling
• Transparency antialiasing
22. CS 354 22
GeForce Graphics Pipeline
Separate dedicated units
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
23. CS 354 23
GeForce Graphics Pipeline
Vertex Engine
Vertex pulling
Vector floating-point instructions
Dynamic branching
Vertex texture
Vertex stream frequency
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
24. CS 354 24
GeForce Graphics Pipeline
Setup
Prepare triangle for
rasterization
215M triangles/sec setup
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
25. CS 354 25
GeForce Graphics Pipeline
Raster
Compute coverage
Points, lines, and triangles
Rotated grid multisampling
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
26. CS 354 26
GeForce Graphics Pipeline
Z Cull
Discard fragments early based on Z
Up to 64 pixels/clock
Multisampled: 256 samples/clock
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
27. CS 354 27
GeForce Graphics Pipeline
Fragment Shader
User-programmed fragment coloring
Dynamic branching
Long shaders
Multiple render targets
fp16 and fp32 vectors
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
28. CS 354 28
GeForce Graphics Pipeline
Texture
fp16 and sRGB filtering
16x anisotropic filtering
Non-power-of-two mipmapping
Shadow maps, cube maps, and 3D
Floating-point textures
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
29. CS 354 29
GeForce Graphics Pipeline
Texture
2x and 4x multisampling
fp16 and sRGB blending
Multiple render targets
Color and depth compression
Double-speed depth/stencil only
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
30. CS 354 30
Single GeForce 7800
Vertex Unit
Primitive Assembly + Vertex Processing Engine
Attribute Processing • MIMD Architecture
• Dual Issue
• Low-penalty branching
• Shader Model 3.0
• 32 vector registers
Vertex FP32 FP32 • 512 static instructions per
Texture Scalar Vector
Fetch Unit Unit
program
• Indexed input and output
registers
Texture Branch
Vertex Texture Fetch
Cache Unit
• Non-stalling
• Up to 4 texture units
Viewport Processing • Unlimited fetches
• Mipmapping, no filtering
To Setup
31. CS 354 31
Vertex Texturing Example
Vertex
Program
Flat tessellated mesh Displaced mesh
Height field
texture
33. CS 354 33
Vertex Textures to Drive
Particle Systems
Render-to-texture
Simulation runs
in floating-point
frame buffer, also
usable as texture
Vertex textures
Determines particle
location with
vertex texture
fetch
34. CS 354 34
Single GeForce 7800
Fragment Shader Pipeline
Texture Input Fragment Texture Processor
Data Data
16 texture units
1 texture fetch at full speed
Bilinear or tri-linear filtering
FP32 16x anisotropic filtering
Texture
Shader
Processor Floating-point (fp16) texture filtering
Unit 1
Shader Unit 1
FP32 4 MULs + RCP
Texture Dual Issue
Shader
Cache Unit 2 Texture address calculation
Fast fp16 normalize
Branch Free: negate, abs, condition codes
Processor
Shader Unit 2
Output 4 MADs or DP4
Fixed-function
Shaded Dual Issue
Fog Unit
Fragments
Free: negate, abs, condition codes
35. CS 354 35
Operations Per Fragment
Shader Pass
Shader 4 Components 1 Texture /
Unit 1 1 Op / component
fragment at full
4 ops / fragment or
per pass speed per pass
Texture
Shader 4 Components
1 Op / component
Unit 2 4 Ops / fragment
per pass
8 Operations / fragment per pass
36. CS 354 36
Fragment Shader
Component Co-issue
Use 4 components various ways
RGBA all together
RGB and A
RG and GB
Shader
Both shader units Unit 1 R G B A
Two operations Operation 1 Operation 2
per shader unit
Shader
Unit 2 R G B A
Operation 3 Operation 4
37. CS 354 37
Single GeForce 7800
Raster Operations Pipeline
Input
Shaded Pixel Crossbar
Fragment Interconnect Functionality
Data
• OpenEXR
Multisample Antialiasing floating-point
blending
• sRGB
Depth Color blending
Compression Compression • 4x rotated grid
multisampling
Depth Color • Lossless color
Raster Raster and depth
Operations Operations compression
• Multiple
render targets
Memory Frame Buffer Partition
39. CS 354 39
Scalable Link Interface (SLI)
Gang two GeForce 6600, 6800, or 7800
graphics boards together
Can almost double your performance
SLI
Connector
Two 6800 Ultras
pictured
40. CS 354 40
SLI Rendering Modes
Split Frame Rendering (SFR)
One GPU renders top of screen; other renders the bottom
Scales fragment processing but not vertex processing
Alternate Frame Rendering (AFR)
Scales both vertex and fragment processing
Adds frame latency
Rendering must be free of CPU synchronization
SLI Antialiasing: SLI8x and SLI16x
Better antialiasing quality rather than performance
Each card renders with slightly different sub-pixel offset
41. CS 354 41
PC Graphics Hardware Evolution
Viable economics: 650 million GeForce GPUs since 1999
1,000x complexity since 1995
Moore’s Law at work GeForce
580 GTX
3B transistors
GeForce
8800
681M
GeForce FX transistors
GeForce 256 GeForce 3
® 125M
23M 60M transistors
RIVA 128
transistors transistors
3M
transistors
1997 2000 2005 2010
45. CS 354 45
Streaming
Multiprocessor (SM)
Multi-processor
execution unit
32 scalar processor
cores
Warp is a unit of
thread execution of up
to 32 threads
Two workloads
Graphics
Vertex shader
Tessellation
Geometry shader
Fragment shader
Compute
46. CS 354 46
OpenGL Pipeline Programmable
Domains run on Unified Hardware
Unified Streaming Processor Array (SPA) architecture
means same capabilities for all domains
Plus tessellation + compute (not shown below)
,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations
Can be Vertex Primitive Fragment
unified Program Program Program
hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access
Memory Interface
48. CS 354 48
Shader or CUDA Core,
Same Unit but Two Personalities
Execution unit
Scalar floating-point
Scalar integer
49. CS 354 49
Levels of Caching in Fermi GPU
12 KB L1 Texture cache
Per texture unit
SM 64 K cache
Split into dedicated 16K or 48K
Load/Store cache
Shared memory 48K or 16K
L2 unifies texture cache, raster
operation cache, and internal
buffering in prior generation
768 K
Read / write
Fully coherent
50. CS 354 50
Cache Use Strategies
in Fermi GPU
Pipeline stages can communicate efficiently through
GPU’s L1 and L2 caches
Buffering between stages stays all on chip
Only vertex, texel, and pixel read/writes need to go to DRAM
51. CS 354 51
Vertex and Tessellation
Processing Tasks
Fixed-function graphics engines
Pull attributes and assemble vertex
Manage tessellation control and domain shader evaluation
Viewport transform
Attribute setup of plane equations for rasterization
Stream out vertices into buffers
52. CS 354 52
Rasterization Tasks
Turns primitives into fragments
Computes edge equations
Two-stage rasterization
Coarse raster finds tiles the primitive could be in
Fine raster evaluates sample positions within tiles
Zcull efficiently eliminates occluded fragments
56. CS 354 56
GPUs as Compute Nodes
Architecture of GPU has evolved into a high-
performance, high-bandwidth compute node
Small form factor
Compute
Integrated CPU-GPU OEM CPU Server + Workstations
Servers & Blades Compute 1U 2 to 4 Tesla
GPUs
57. CS 354 57
Compute Programming Model
Cooperative Thread Array (CTA)
Single Program, Multiple Data
Organized in 1D, 2D, or 3D
Programming APIs
CUDA, OpenCL, DirectCompute
APIs + language = parallel processing system
OpenGL or Direct3D through shaders
Cg, HLSL, GLSL
58. CS 354 58
Now in World’s Fastest
Supercomputers
Tianhe-1A
2.507 Petaflop
7168 Tesla M2050
GPUs
National Supercomputing Center
in Tianjin
59. CS 354 59
Opposite direction:
Consumer mobile devices
60. CS 354 60
Low-power Mobile
System on a Chip (SoC)
Complete system on a chip
4 ARM cores
Integrated graphics
OpenGL ES 2.0
Power <1W
61. CS 354 61
Mid-term Next Class
Mid-term
Similar in format to the homeworks
15% of your final grade
Arrive on time
Open textbook. Open notes, including lecture slides.
Calculators allowed/encouraged.
No smart phones, no computers, no Internet access.
Show your work to justify your answer and provide a basis for partial
credit.
What to study
All material in lecture slides
Review in-class quiz questions
Study homeworks
Responsible for textbook readings
Have a relaxing spring break
Next lecture: Shadows
Come back to Project 2
Notes de l'éditeur
The technology of graphics processors has evolved amazingly over the last 15 years or so. I’ve been at NVIDIA for more than 10 years and have seen a lot of this first hand. As the hardware increases in performance, the visual quality improves. This is driven by Moore’s law, which says that the number of transistors able to fit on a piece of silicon doubles roughly every 18 months. The great thing about graphics is that has an insatiable appetite for computation. We’re clearly not at photo-realistic quality yet and still have a long way to.
World’s Fastest Known Supercomputer today – official Top500 list comes out next month Peta = 10^15 = thousand trillion floating point operations per second