General Purpose Computing using Graphics Hardware

General Purpose Computingusing Graphics Hardware Hanspeter Pfister Harvard University

Acknowledgements Won-Ki Jeong, Harvard University KayvonFatahalian, Stanford University 2

GPU (Graphics Processing Unit) PC hardware dedicated for 3D graphics Massively parallel SIMD processor Performance pushed by game industry 3 NVIDIA SLI System

GPGPU General Purpose computation on the GPU Started in computer graphics research community Mapping computational problems to graphics rendering pipeline 4 Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong

Why GPU for computing? GPU is fast Massively parallel CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core) GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200) High memory bandwidth Programmable NVIDIA CUDA, DirectX Compute Shader, OpenCL High precision floating point support 64bit floating point (IEEE 754) Inexpensive desktop supercomputer NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000 5

Memory Bandwidth 7 Image Courtesy NVIDIA

GPGPU Biomedical Examples 8 Level-Set Segmentation (Lefohn et al.) CT/MRI Reconstruction (Sumanaweera et al.) Image Registration (Strzodka et al.) EM Image Processing (Jeong et al.)

Overview GPU Architecture Overview GPU Programming Overview Programming Model NVIDIA CUDA OpenCL Application Example CUDA ITK 9

1. GPU Architecture Overview KayvonFatahalian Stanford University 10

What’s in a GPU? 11 Input Assembly Rasterizer Output Blend Video Decode Tex Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Tex Tex HW or SW? Work Distributor Tex Heterogeneous chip multi-processor (highly tuned for graphics)

CPU-“style” cores 12 Fetch/ Decode Out-of-order control logic Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data Cache (A big one)

Slimming down 13 Fetch/ Decode Idea #1: Remove components that help a single instruction stream run fast ALU (Execute) Execution Context

Two cores (two threads in parallel) 14 thread1 thread 2 Fetch/ Decode Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) ALU (Execute) Execution Context Execution Context

Four cores (four threads in parallel) 15 Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) ALU (Execute) ALU (Execute) Execution Context Execution Context Execution Context Execution Context

Sixteen cores (sixteen threads in parallel) 16 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams

Instruction stream sharing 17 But… many threads should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Recall: simple processing core 18 Fetch/ Decode ALU (Execute) Execution Context

Add ALUs 19 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data

Modifying the code 20 Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one thread using scalar ops on scalar registers

Modifying the code 21 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes 8 threads using vector ops on vector registers

Modifying the code 22 2 3 1 4 6 7 5 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

128 threads in parallel 23 16 cores = 128 ALUs = 16 simultaneous instruction streams

But what about branches? 24 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>

But what about branches? 25 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>

But what about branches? 26 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> Not all ALUs do useful work! Worst case: 1/8 performance

But what about branches? 27 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>

Clarification 28 SIMD processing does not imply SIMD instructions ,[object Object]

Intel/AMD x86 SSE, Intel Larrabee

Option 2: Scalar instructions, implicit HW vectorization

HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)

NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream

Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. 29

But we have LOTS of independent threads. Idea #3: Interleave processing of many threads on a single core to avoid stalls caused by high latency operations. 30

Hiding stalls 31 Time (clocks) Thread1 … 8 ALU ALU ALU ALU ALU ALU ALU ALU Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SharedCtx Data

Hiding stalls 32 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 ALU ALU ALU ALU ALU ALU ALU ALU Fetch/ Decode 1 2 3 4 1 2 3 4

Hiding stalls 33 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Runnable 1 2 3 4

Hiding stalls 34 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Runnable 1 2 3 4

Hiding stalls 35 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Stall Stall Stall Runnable Runnable 1 2 3 4 Runnable

Throughput! 36 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Start Start Stall Stall Stall Stall Start Runnable Runnable Done! Runnable Done! Runnable 2 3 4 1 Increase run time of one group To maximum throughput of many groups Done! Done!

Storing contexts 37 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage 32KB

Twenty small contexts 38 (maximal latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 10 1 2 3 4 5 6 7 8 9 11 15 12 13 14 16 20 17 18 19

Twelve medium contexts 39 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 5 6 7 8 9 10 11 12

Four large contexts 40 (low latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4 3 1 2

GPU block diagram key = single “physical” instruction stream fetch/decode (functional unit control) = SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs” = 32-bit mul-add unit = 32-bit multiply unit = execution context storage = fixed function unit 41

Example: NVIDIA GeForce GTX 280 NVIDIA-speak: 240 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 240 mul-adds + 240 muls per clock 1.3 GHz clock 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 threads 8 threads run on 8 SIMD functional units in one clock 42

GTX 280 core 43 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Zcull/Clip/Rast Output Blend Work Distributor

Example: ATI Radeon 4870 AMD/ATI-speak: 800 stream processors Automatic HW-managed sharing of scalar instruction stream (like “SIMT”) Generic speak: 10 processing cores 16 SIMD functional units per core 5 mul-adds per functional unit (5 * 2 =10 flops/clock) Best case: 800 mul-adds per clock 750 MHz clock 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS Mapping data-parallelism to chip: Instruction stream shared across 64 threads 16 threads run on 16 SIMD functional units in one clock 44

ATI Radeon 4870 core … … … … … … … … … … Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Zcull/Clip/Rast Output Blend Work Distributor 45

Summary: three key ideas Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of threads) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of threads When one group stalls, work on another group 46

2. GPU Programming Models Programming Model NVIDIA CUDA OpenCL 47

Task parallelism Distribute the tasks across processors based on dependency Coarse-grain parallelism 48 Task 1 Task 1 Time Task 2 Task 2 Task 3 Task 3 P1 Task 4 Task 4 P2 Task 5 Task 5 Task 6 Task 6 P3 Task 7 Task 7 Task 8 Task 8 Task 9 Task 9 Task assignment across 3 processors Task dependency graph

Data parallelism Run a single kernel over many elements Each element is independently updated Same operation is applied on each element Fine-grain parallelism Many lightweight threads, easy to switch context Maps well to ALU heavy architecture : GPU 49 Kernel ……. Data P1 P2 P3 P4 P5 Pn …….

GPU-friendly Problems Data-parallel processing High arithmetic intensity Keep GPU busy all the time Computation offsets memory latency Coherent data access Access large chunk of contiguous memory Exploit fast on-chip shared memory 50

The Algorithm Matters ,[object Object],for(inti=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; } ,[object Object],for(inti=0; i<num; i++) { v[i] = (v[i-1] + v[i+1])/2.0; } 51

Example: Reduction Serial version (O(N)) for(inti=1; i<N; i++) { v[0] += v[i]; } Parallel version (O(logN)) width = N/2; while(width > 1) { for(inti=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2; } 52

GPU programming languages Using graphics APIs GLSL, Cg, HLSL Computing-specific APIs DX 11 Compute Shaders NVIDIA CUDA OpenCL 53

NVIDIA CUDA C-extension programming language No graphics API Supports debugging tools Extensions / API Function type : __global__, __device__, __host__ Variable type : __shared__, __constant__ Low-level functions cudaMalloc(), cudaFree(), cudaMemcpy(),… __syncthread(), atomicAdd(),… Program types Device program (kernel) : runs on the GPU Host program : runs on the CPU to call device programs 54

CUDA Programming Model Kernel GPU program that runs on a thread grid Thread hierarchy Grid : a set of blocks Block : a set of threads Grid size * block size = total # of threads 55 Grid Kernel Block 2 Block n Block 1 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) . . . . . Threads Threads Threads

CUDA Memory Structure 56 Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 Memory hierarchy PC memory : off-card GPU Global : off-chip / on-card Shared/register/cache : on-chip The host can read/write global memory Each thread communicates using shared memory

Synchronization Threads in the same block can communicate using shared memory No HW global synchronization function yet __syncthreads() Barrier for threads only within the current block __threadfence() Flushes global memory writes to make them visible to all threads 57

Example: CPU Vector Addition 58 // Pair-wise addition of vector elements // CPU version : serial add void vectorAdd(float* iA, float* iB, float* oC, int num) { for(inti=0; i<num; i++) { oC[i] = iA[i] + iB[i]; } }

Example: CUDA Vector Addition 59 // Pair-wise addition of vector elements // CUDA version : one thread per addition __global__ void vectorAdd(float* iA, float* iB, float* oC) { intidx = threadIdx.x + blockDim.x * blockIdx.x; oC[idx] = iA[idx] + iB[idx]; }

Example: CUDA Host Code 60 float* h_A = (float*) malloc(N * sizeof(float)); float* h_B = (float*) malloc(N * sizeof(float)); // …initalizeh_A and h_B // allocate device memory float* d_A, d_B, d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice ); // execute the kernel on N/256 blocks of 256 threads each vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );

OpenCL (Open Computing Language) First industry standard for computing language Based on C language Platform independent NVIDIA, ATI, Intel, …. Data and task parallel compute model Use all computational resources in system CPU, GPU, … Work-item : same as thread / fragment / etc.. Work-group : a group of work-items Work-items in a same work-group can communicate Execute multiple work-groups in parallel 61

OpenCL program structure Host program (CPU) Platform layer Query compute devices Create context Runtime Create memory objects Compile and create kernel program objects Issue commands (i.e., kernel launching) to command-queue Synchronization of commands Clean up OpenCL resources Kernel (CPU, GPU) C-like code with some extensions Runs on compute device 62

CUDA v.s. OpenCL comparison Conceptually almost identical Work-item == thread Work-group == block Similar memory model Global, local, shared memory Kernel, host program CUDA is highly optimized only for NVIDIA GPUs OpenCL can be widely used for any GPUs/CPUs 63

Implementation status of OpenCL Specification 1.0 released by Khronos NVIDIA released Beta 1.2 driver and SDK Available for registered GPU computing developers Apple will include in Mac OS X Snow Leopard Q3 2009 NVIDIA and ATI GPUs, Intel CPU for Mac More companies will join 64

GPU optimization tips: configuration Identify bottleneck Computing / bandwidth bound (use profiler) Focus on most expensive but parallelizable parts (Amdahl’s law) Maximize parallel execution Use large input (many threads) Avoid divergent execution Efficient use of limited resource Minimize shared memory / register use 65

GPU optimization tips: memory Memory access: the most important optimization Minimize device to host memory overhead Overlap kernel with memory copy (asynchronous copy) Avoid shared memory bank conflict Coalesced global memory access Texture or constant memory can be helpful (cache) Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 66

GPU optimization tips: instructions Use less expensive operators division: 32 cycles, multiplication: 4 cycles *0.5 instead of /2.0 Atomic operator is expensive Possible race condition Double precision is much slower than float Use less accurate floating point instruction when possible __sin(), __exp(), __pow() Save unnecessary instructions Loop unrolling 67

3. Application Example CUDA ITK 68

ITK image filters implemented using CUDA Convolution filters Mean filter Gaussian filter Derivative filter Hessian of Gaussian filter Statistical filter Median filter PDE-based filter Anisotropic diffusion filter 69

CUDA ITK CUDA code is integrated into ITK Transparent to the ITK users No need to modify current code using ITK library Check environment variable ITK_CUDA Entry point GenerateData() or ThreadedGenerateData() If ITK_CUDA == 0 Execute original ITK code If ITK_CUDA == 1 Execute CUDA code 70

Convolution filters ,[object Object],For size n filter, each pixel is reused n times Non-separable filter (Anisotropic) Reusing data using shared memory Separable filter (Gaussian) N-dimensional convolution = N*1D convolution 71 kernel kernel kernel * * *

Read from input image whenever needed Naïve C/CUDA implementation 72 intxdim, ydim; // size of input image float *in, *out; // input/output image of size xdim*ydim float w[][]; // convolution kernel of size n*m for(x=0; x<xdim; x++) { for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) { wx = sx – x + n/2; wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } } } xdim*ydim n*m load from global memory, n*m times

For size n*m filter, each pixel is reused n*m times Save n*m-1 global memory loads by using shared memory Improved CUDA convolution filter 73 __global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory sharedmem[] = in[][]; __syncthreads(); // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i<=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }} load from global memory (slow), only once n*m load from shared memory (fast), n*m times

CUDA Gaussian filter Apply 1D convolution filter along each axis Use temporary buffers: ping-pong rendering 74 // temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w); } out = temp[i%2];} 1D convolution cuda kernel

Median filter 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 Viola et al. [VIS 03] Finding median by bisection of histogram bins Log(# bins) iterations 8-bit pixel : log(256) = 8 iterations Intensity : 0 1 2 3 4 5 6 7 1. 16 4 2. Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot); 11 5 3. 4. 75

Perona & Malik anisotropic diffusion Nonlinear diffusion Adaptive smoothing based on magnitude of gradient Preserves edges (high gradient) Numerical solution Euler explicit integration (iterative method) Finite difference for derivative computation 76 Input Image Linear diffusion P & M diffusion

Performance Convolution filters Mean filter : ~140x Gaussian filter : ~60x Derivative filter Hessian of Gaussian filter Statistical filter Median filter : ~25x PDE-based filter Anisotropic diffusion filter : ~70x 77

CUDA ITK Source code available at http://sourceforge.net/projects/cudaitk/ 78

General Purpose Computing using Graphics Hardware

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to General Purpose Computing using Graphics Hardware

Similar to General Purpose Computing using Graphics Hardware (20)

Recently uploaded

Recently uploaded (20)

General Purpose Computing using Graphics Hardware

Editor's Notes