3. GPU (Graphics Processing Unit) PC hardware dedicated for 3D graphics Massively parallel SIMD processor Performance pushed by game industry 3 NVIDIA SLI System
4. GPGPU General Purpose computation on the GPU Started in computer graphics research community Mapping computational problems to graphics rendering pipeline 4 Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
5. Why GPU for computing? GPU is fast Massively parallel CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core) GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200) High memory bandwidth Programmable NVIDIA CUDA, DirectX Compute Shader, OpenCL High precision floating point support 64bit floating point (IEEE 754) Inexpensive desktop supercomputer NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000 5
11. What’s in a GPU? 11 Input Assembly Rasterizer Output Blend Video Decode Tex Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Tex Tex HW or SW? Work Distributor Tex Heterogeneous chip multi-processor (highly tuned for graphics)
12. CPU-“style” cores 12 Fetch/ Decode Out-of-order control logic Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data Cache (A big one)
13. Slimming down 13 Fetch/ Decode Idea #1: Remove components that help a single instruction stream run fast ALU (Execute) Execution Context
14. Two cores (two threads in parallel) 14 thread1 thread 2 Fetch/ Decode Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) ALU (Execute) Execution Context Execution Context
15. Four cores (four threads in parallel) 15 Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) ALU (Execute) ALU (Execute) Execution Context Execution Context Execution Context Execution Context
16. Sixteen cores (sixteen threads in parallel) 16 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams
17. Instruction stream sharing 17 But… many threads should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)
19. Add ALUs 19 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data
20. Modifying the code 20 Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one thread using scalar ops on scalar registers
21. Modifying the code 21 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes 8 threads using vector ops on vector registers
22. Modifying the code 22 2 3 1 4 6 7 5 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data
23. 128 threads in parallel 23 16 cores = 128 ALUs = 16 simultaneous instruction streams
24. But what about branches? 24 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
25. But what about branches? 25 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
26. But what about branches? 26 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> Not all ALUs do useful work! Worst case: 1/8 performance
27. But what about branches? 27 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
32. NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream
33. Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. 29
34. But we have LOTS of independent threads. Idea #3: Interleave processing of many threads on a single core to avoid stalls caused by high latency operations. 30
35. Hiding stalls 31 Time (clocks) Thread1 … 8 ALU ALU ALU ALU ALU ALU ALU ALU Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SharedCtx Data
36. Hiding stalls 32 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 ALU ALU ALU ALU ALU ALU ALU ALU Fetch/ Decode 1 2 3 4 1 2 3 4
40. Throughput! 36 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Start Start Stall Stall Stall Stall Start Runnable Runnable Done! Runnable Done! Runnable 2 3 4 1 Increase run time of one group To maximum throughput of many groups Done! Done!
41. Storing contexts 37 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage 32KB
42. Twenty small contexts 38 (maximal latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 10 1 2 3 4 5 6 7 8 9 11 15 12 13 14 16 20 17 18 19
43. Twelve medium contexts 39 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 5 6 7 8 9 10 11 12
44. Four large contexts 40 (low latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4 3 1 2
45. GPU block diagram key = single “physical” instruction stream fetch/decode (functional unit control) = SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs” = 32-bit mul-add unit = 32-bit multiply unit = execution context storage = fixed function unit 41
46. Example: NVIDIA GeForce GTX 280 NVIDIA-speak: 240 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 240 mul-adds + 240 muls per clock 1.3 GHz clock 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 threads 8 threads run on 8 SIMD functional units in one clock 42
50. Summary: three key ideas Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of threads) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of threads When one group stalls, work on another group 46
53. Data parallelism Run a single kernel over many elements Each element is independently updated Same operation is applied on each element Fine-grain parallelism Many lightweight threads, easy to switch context Maps well to ALU heavy architecture : GPU 49 Kernel ……. Data P1 P2 P3 P4 P5 Pn …….
54. GPU-friendly Problems Data-parallel processing High arithmetic intensity Keep GPU busy all the time Computation offsets memory latency Coherent data access Access large chunk of contiguous memory Exploit fast on-chip shared memory 50
57. GPU programming languages Using graphics APIs GLSL, Cg, HLSL Computing-specific APIs DX 11 Compute Shaders NVIDIA CUDA OpenCL 53
58. NVIDIA CUDA C-extension programming language No graphics API Supports debugging tools Extensions / API Function type : __global__, __device__, __host__ Variable type : __shared__, __constant__ Low-level functions cudaMalloc(), cudaFree(), cudaMemcpy(),… __syncthread(), atomicAdd(),… Program types Device program (kernel) : runs on the GPU Host program : runs on the CPU to call device programs 54
59. CUDA Programming Model Kernel GPU program that runs on a thread grid Thread hierarchy Grid : a set of blocks Block : a set of threads Grid size * block size = total # of threads 55 Grid Kernel Block 2 Block n Block 1 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) . . . . . Threads Threads Threads
60. CUDA Memory Structure 56 Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 Memory hierarchy PC memory : off-card GPU Global : off-chip / on-card Shared/register/cache : on-chip The host can read/write global memory Each thread communicates using shared memory
61. Synchronization Threads in the same block can communicate using shared memory No HW global synchronization function yet __syncthreads() Barrier for threads only within the current block __threadfence() Flushes global memory writes to make them visible to all threads 57
62. Example: CPU Vector Addition 58 // Pair-wise addition of vector elements // CPU version : serial add void vectorAdd(float* iA, float* iB, float* oC, int num) { for(inti=0; i<num; i++) { oC[i] = iA[i] + iB[i]; } }
63. Example: CUDA Vector Addition 59 // Pair-wise addition of vector elements // CUDA version : one thread per addition __global__ void vectorAdd(float* iA, float* iB, float* oC) { intidx = threadIdx.x + blockDim.x * blockIdx.x; oC[idx] = iA[idx] + iB[idx]; }
64. Example: CUDA Host Code 60 float* h_A = (float*) malloc(N * sizeof(float)); float* h_B = (float*) malloc(N * sizeof(float)); // …initalizeh_A and h_B // allocate device memory float* d_A, d_B, d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice ); // execute the kernel on N/256 blocks of 256 threads each vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
65. OpenCL (Open Computing Language) First industry standard for computing language Based on C language Platform independent NVIDIA, ATI, Intel, …. Data and task parallel compute model Use all computational resources in system CPU, GPU, … Work-item : same as thread / fragment / etc.. Work-group : a group of work-items Work-items in a same work-group can communicate Execute multiple work-groups in parallel 61
66. OpenCL program structure Host program (CPU) Platform layer Query compute devices Create context Runtime Create memory objects Compile and create kernel program objects Issue commands (i.e., kernel launching) to command-queue Synchronization of commands Clean up OpenCL resources Kernel (CPU, GPU) C-like code with some extensions Runs on compute device 62
67. CUDA v.s. OpenCL comparison Conceptually almost identical Work-item == thread Work-group == block Similar memory model Global, local, shared memory Kernel, host program CUDA is highly optimized only for NVIDIA GPUs OpenCL can be widely used for any GPUs/CPUs 63
68. Implementation status of OpenCL Specification 1.0 released by Khronos NVIDIA released Beta 1.2 driver and SDK Available for registered GPU computing developers Apple will include in Mac OS X Snow Leopard Q3 2009 NVIDIA and ATI GPUs, Intel CPU for Mac More companies will join 64
69. GPU optimization tips: configuration Identify bottleneck Computing / bandwidth bound (use profiler) Focus on most expensive but parallelizable parts (Amdahl’s law) Maximize parallel execution Use large input (many threads) Avoid divergent execution Efficient use of limited resource Minimize shared memory / register use 65
70. GPU optimization tips: memory Memory access: the most important optimization Minimize device to host memory overhead Overlap kernel with memory copy (asynchronous copy) Avoid shared memory bank conflict Coalesced global memory access Texture or constant memory can be helpful (cache) Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 66
71. GPU optimization tips: instructions Use less expensive operators division: 32 cycles, multiplication: 4 cycles *0.5 instead of /2.0 Atomic operator is expensive Possible race condition Double precision is much slower than float Use less accurate floating point instruction when possible __sin(), __exp(), __pow() Save unnecessary instructions Loop unrolling 67
73. ITK image filters implemented using CUDA Convolution filters Mean filter Gaussian filter Derivative filter Hessian of Gaussian filter Statistical filter Median filter PDE-based filter Anisotropic diffusion filter 69
74. CUDA ITK CUDA code is integrated into ITK Transparent to the ITK users No need to modify current code using ITK library Check environment variable ITK_CUDA Entry point GenerateData() or ThreadedGenerateData() If ITK_CUDA == 0 Execute original ITK code If ITK_CUDA == 1 Execute CUDA code 70
75.
76. Read from input image whenever needed Naïve C/CUDA implementation 72 intxdim, ydim; // size of input image float *in, *out; // input/output image of size xdim*ydim float w[][]; // convolution kernel of size n*m for(x=0; x<xdim; x++) { for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) { wx = sx – x + n/2; wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } } } xdim*ydim n*m load from global memory, n*m times
77. For size n*m filter, each pixel is reused n*m times Save n*m-1 global memory loads by using shared memory Improved CUDA convolution filter 73 __global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory sharedmem[] = in[][]; __syncthreads(); // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i<=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }} load from global memory (slow), only once n*m load from shared memory (fast), n*m times
78. CUDA Gaussian filter Apply 1D convolution filter along each axis Use temporary buffers: ping-pong rendering 74 // temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w); } out = temp[i%2];} 1D convolution cuda kernel
79. Median filter 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 Viola et al. [VIS 03] Finding median by bisection of histogram bins Log(# bins) iterations 8-bit pixel : log(256) = 8 iterations Intensity : 0 1 2 3 4 5 6 7 1. 16 4 2. Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot); 11 5 3. 4. 75
80. Perona & Malik anisotropic diffusion Nonlinear diffusion Adaptive smoothing based on magnitude of gradient Preserves edges (high gradient) Numerical solution Euler explicit integration (iterative method) Finite difference for derivative computation 76 Input Image Linear diffusion P & M diffusion
82. CUDA ITK Source code available at http://sourceforge.net/projects/cudaitk/ 78
83. CUDA ITK Future Work ITK GPU image class Reduce CPU to GPU memory I/O Pipelining support Native interface for GPU code Similar to ThreadedGenerateData() for GPU threads Numerical library (vnl) Out-of-GPU-core / GPU-cluster Processing large images (10~100 Terabytes) GPU Platform independent implementation OpenCL could be a solution 79
84. Conclusions GPU computing delivers high performance Many scientific computing problems are parallelizable More consistency/stability in HW/SW Main GPU architecture is mature Industry-wide programming standard now exists (OpenCL) Better support/tools available C-based language, compiler, and debugger Issues Not every problem is suitable for GPUs Re-engineering of algorithms/software required Unclear future performance growth of GPU hardware Intel’s Larrabee 80
85. thrust thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA C++ template metaprogramming automatically chooses the fastest code path at compile time
86. thrust::sort #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/generate.h> #include <thrust/sort.h> #include <cstdlib> int main(void) { // generate random data on the host thrust::host_vector<int> h_vec(1000000); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer to device and sort thrust::device_vector<int> d_vec = h_vec; // sort 140M 32b keys/sec on GT200 thrust::sort(d_vec.begin(), d_vec.end()); return 0;} http://thrust.googlecode.com
Editor's Notes
Fluid flow, level set segmentation, DTI image
One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores.
Pack core full of ALUsWe are not going to increase our core’s ability to decode instructionsWe will decode 1 instruction, and execute on all 8 ALUs
How can we make use all these ALUs?
Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones.
So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers.
We continue this process, moving to a new group each time we encounter a stall.If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle.
Described adding contextsIn reality there’s a fixed pool of on chip storage that is partitioned to hold contexts.Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts.
Shadingperformance relies on large scale interleavingNumber of interleaved groups per core ~20-30Could be separate hardware-managed contexts or software-managed using techniques
Fewer contexts fit on chipChip can hide less latencyHigher likelihood of stalls
Loose performance when shaders use a lot of registers
128 simultaneous threads on each core
Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing