SlideShare a Scribd company logo
1 of 82
General Purpose Computingusing Graphics Hardware Hanspeter Pfister Harvard University
Acknowledgements Won-Ki Jeong, Harvard University KayvonFatahalian, Stanford University  2
GPU (Graphics Processing Unit) PC hardware dedicated for 3D graphics Massively parallel SIMD processor Performance pushed by game industry 3 NVIDIA SLI System
GPGPU General Purpose computation on the GPU Started in computer graphics research community Mapping computational problems to graphics rendering pipeline 4 Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
Why GPU for computing? GPU is fast Massively parallel CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core) GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200) High memory bandwidth Programmable NVIDIA CUDA, DirectX Compute Shader, OpenCL High precision floating point support 64bit floating point (IEEE 754) Inexpensive desktop supercomputer NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000 5
FLOPS 6 Image Courtesy NVIDIA
Memory Bandwidth 7 Image Courtesy NVIDIA
GPGPU Biomedical Examples 8 Level-Set Segmentation (Lefohn et al.) CT/MRI Reconstruction (Sumanaweera et al.) Image Registration (Strzodka et al.) EM Image Processing (Jeong et al.)
Overview GPU Architecture Overview GPU Programming Overview Programming Model NVIDIA CUDA OpenCL Application Example CUDA ITK 9
1. GPU Architecture Overview KayvonFatahalian Stanford University 10
What’s in a GPU? 11 Input Assembly Rasterizer Output Blend Video Decode Tex Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Tex Tex HW or SW? Work Distributor Tex Heterogeneous chip multi-processor (highly tuned for graphics)
CPU-“style” cores 12 Fetch/ Decode Out-of-order control logic Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data Cache (A big one)
Slimming down 13 Fetch/ Decode Idea #1:  Remove components that help a single instruction stream run fast  ALU (Execute) Execution Context
Two cores   (two threads in parallel) 14 thread1 thread 2 Fetch/ Decode Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul  r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul  o0, r0, r3 mul  o1, r1, r3 mul  o2, r2, r3 mov  o3, l(1.0) <diffuseShader>: sample r0, v4, t0, s0 mul  r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul  o0, r0, r3 mul  o1, r1, r3 mul  o2, r2, r3 mov  o3, l(1.0) ALU (Execute) ALU (Execute) Execution Context Execution Context
Four cores   (four threads in parallel) 15 Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) ALU (Execute) ALU (Execute) Execution Context Execution Context Execution Context Execution Context
Sixteen cores   (sixteen threads in parallel) 16 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams
Instruction stream sharing 17 But… many threads should be able to share an instruction stream!  <diffuseShader>: sample r0, v4, t0, s0 mul  r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul  o0, r0, r3 mul  o1, r1, r3 mul  o2, r2, r3 mov  o3, l(1.0)
Recall: simple processing core 18 Fetch/ Decode ALU (Execute) Execution Context
Add ALUs 19 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data
Modifying the code 20 Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul  r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul  o0, r0, r3 mul  o1, r1, r3 mul  o2, r2, r3 mov  o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data  Processes one thread using scalar ops on scalar registers
Modifying the code 21 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul  vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul  vec_o0, vec_r0, vec_r3 VEC8_mul  vec_o1, vec_r1, vec_r3 VEC8_mul  vec_o2, vec_r2, vec_r3 VEC8_mov  vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data  Processes 8 threads using vector ops on vector registers
Modifying the code 22 2 3 1 4 6 7 5 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul  vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul  vec_o0, vec_r0, vec_r3 VEC8_mul  vec_o1, vec_r1, vec_r3 VEC8_mul  vec_o2, vec_r2, vec_r3 VEC8_mov  vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data
128 threads in parallel  23 16 cores = 128 ALUs = 16 simultaneous instruction streams
But what about branches? 24 2 ...  1 ... 8 Time (clocks) ALU 1 ALU 2 . . .  ALU 8 . . .  <unconditional shader code> if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka;   } else { x = 0;  refl = Ka;   } <resume unconditional shader code>
But what about branches? 25 2 ...  1 ... 8 Time (clocks) ALU 1 ALU 2 . . .  ALU 8 . . .  <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka;   } else { x = 0;  refl = Ka;   } <resume unconditional shader code>
But what about branches? 26 2 ...  1 ... 8 Time (clocks) ALU 1 ALU 2 . . .  ALU 8 . . .  <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka;   } else { x = 0;  refl = Ka;   } <resume unconditional shader code> Not all ALUs do useful work!  Worst case: 1/8 performance
But what about branches? 27 2 ...  1 ... 8 Time (clocks) ALU 1 ALU 2 . . .  ALU 8 . . .  <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka;   } else { x = 0;  refl = Ka;   } <resume unconditional shader code>
Clarification 28 SIMD processing does not imply SIMD instructions  ,[object Object]
Intel/AMD x86 SSE, Intel Larrabee
Option 2:  Scalar instructions, implicit HW vectorization
HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream
Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. 29
But we have  LOTS of independent threads. Idea #3: Interleave processing of many threads on a single core to avoid stalls caused by high latency operations. 30
Hiding stalls 31 Time (clocks) Thread1 … 8 ALU  ALU  ALU  ALU  ALU  ALU  ALU  ALU  Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SharedCtx Data
Hiding stalls 32 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 ALU  ALU  ALU  ALU  ALU  ALU  ALU  ALU  Fetch/ Decode 1 2 3 4 1 2 3 4
Hiding stalls 33 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Runnable 1 2 3 4
Hiding stalls 34 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Runnable 1 2 3 4
Hiding stalls 35 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Stall Stall Stall Runnable Runnable 1 2 3 4 Runnable
Throughput! 36 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Start Start Stall Stall Stall Stall Start Runnable Runnable Done! Runnable Done! Runnable 2 3 4 1 Increase run time of one group To maximum throughput of many groups Done! Done!
Storing contexts 37 Fetch/ Decode ALU  ALU  ALU  ALU  ALU  ALU  ALU  ALU  Pool of context storage 32KB
Twenty small contexts 38 (maximal latency hiding ability) Fetch/ Decode ALU  ALU  ALU  ALU  ALU  ALU  ALU  ALU  10 1 2 3 4 5 6 7 8 9 11 15 12 13 14 16 20 17 18 19
Twelve medium contexts 39 Fetch/ Decode ALU  ALU  ALU  ALU  ALU  ALU  ALU  ALU  1 2 3 4 5 6 7 8 9 10 11 12
Four large contexts 40 (low latency hiding ability) Fetch/ Decode ALU  ALU  ALU  ALU  ALU  ALU  ALU  ALU  4 3 1 2
GPU block diagram key = single “physical” instruction stream fetch/decode     (functional unit control) = SIMD programmable functional unit (FU), control shared with other    functional units.  This functional unit may contain multiple 32-bit “ALUs” = 32-bit mul-add unit = 32-bit multiply unit = execution context storage  = fixed function unit 41
Example: NVIDIA GeForce GTX 280 NVIDIA-speak: 240 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 240 mul-adds + 240 muls per clock 1.3 GHz clock 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 threads 8 threads run on 8 SIMD functional units in one clock 42
GTX 280 core 43 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Zcull/Clip/Rast Output Blend Work Distributor
Example: ATI Radeon 4870 AMD/ATI-speak: 800 stream processors Automatic HW-managed sharing of scalar instruction stream (like “SIMT”) Generic speak: 10 processing cores 16 SIMD functional units per core 5 mul-adds per functional unit (5 * 2 =10 flops/clock) Best case: 800 mul-adds per clock 750 MHz clock 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS Mapping data-parallelism to chip: Instruction stream shared across 64 threads 16 threads run on 16 SIMD functional units in one clock 44
ATI Radeon 4870 core … … … … … … … … … … Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Zcull/Clip/Rast Output Blend Work Distributor 45
Summary: three key ideas Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of threads) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of threads When one group stalls, work on another group 46
2. GPU Programming Models Programming Model NVIDIA CUDA OpenCL 47
Task parallelism Distribute the tasks across processors based on dependency Coarse-grain parallelism 48 Task 1 Task 1 Time Task 2 Task 2 Task 3 Task 3 P1 Task 4 Task 4 P2 Task 5 Task 5 Task 6 Task 6 P3 Task 7 Task 7 Task 8 Task 8 Task 9 Task 9 Task assignment across 3 processors Task dependency graph
Data parallelism Run a single kernel over many elements Each element is independently updated Same operation is applied on each element Fine-grain parallelism Many lightweight threads, easy to switch context Maps well to ALU heavy architecture : GPU 49 Kernel ……. Data P1 P2 P3 P4 P5 Pn …….
GPU-friendly Problems Data-parallel processing High arithmetic intensity Keep GPU busy all the time Computation offsets memory latency Coherent data access Access large chunk of contiguous memory Exploit fast on-chip shared memory 50
The Algorithm Matters ,[object Object],for(inti=0; i<num; i++) 	{    	vn+1[i] = (vn[i-1] + vn[i+1])/2.0; 	} ,[object Object],for(inti=0; i<num; i++) 	{ v[i] = (v[i-1] + v[i+1])/2.0; 	} 51
Example: Reduction Serial version (O(N)) for(inti=1; i<N; i++) 	{   	 v[0] += v[i]; 	} Parallel version (O(logN)) 	width = N/2; while(width > 1) 	{ for(inti=0; i<width; i++)    	{ v[i] += v[i+width]; // computed in parallel    	}    	width /= 2; 	} 52
GPU programming languages Using graphics APIs GLSL, Cg, HLSL Computing-specific APIs DX 11 Compute Shaders NVIDIA CUDA OpenCL 53
NVIDIA CUDA C-extension programming language No graphics API Supports debugging tools Extensions / API Function type : __global__, __device__, __host__ Variable type : __shared__, __constant__ Low-level functions cudaMalloc(), cudaFree(), cudaMemcpy(),… __syncthread(), atomicAdd(),… Program types Device program (kernel) : runs on the GPU Host program : runs on the CPU to call device programs 54
CUDA Programming Model Kernel GPU program that runs on a thread grid Thread hierarchy Grid : a set of blocks Block : a set of threads Grid size * block size = total # of threads 55 Grid Kernel Block 2 Block n Block 1 <diffuseShader>: sample r0, v4, t0, s0 mul  r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul  o0, r0, r3 mul  o1, r1, r3 mul  o2, r2, r3 mov  o3, l(1.0) . . . . . Threads Threads Threads
CUDA Memory Structure 56 Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 Memory hierarchy PC memory : off-card GPU Global : off-chip / on-card Shared/register/cache : on-chip The host can read/write global memory Each thread communicates using shared memory
Synchronization Threads in the same block can communicate using shared memory No HW global synchronization function yet __syncthreads() Barrier for threads only within the current block __threadfence() Flushes global memory writes to make them visible to all threads 57
Example: CPU Vector Addition 58 // Pair-wise addition of vector elements // CPU version : serial add void vectorAdd(float* iA, float* iB, float* oC, int num)  { 	  for(inti=0; i<num; i++) 	  { oC[i] = iA[i] + iB[i]; 	  } }
Example: CUDA Vector Addition 59 // Pair-wise addition of vector elements // CUDA version : one thread per addition __global__ void vectorAdd(float* iA, float* iB, float* oC)  { intidx = threadIdx.x             + blockDim.x * blockIdx.x; oC[idx] = iA[idx] + iB[idx]; }
Example: CUDA Host Code 60 float* h_A = (float*) malloc(N * sizeof(float)); float* h_B = (float*) malloc(N * sizeof(float)); // …initalizeh_A and h_B // allocate device memory float* d_A, d_B, d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice ); // execute the kernel on N/256 blocks of 256 threads each vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
OpenCL	(Open Computing Language) First industry standard for computing language Based on C language Platform independent NVIDIA, ATI, Intel, …. Data and task parallel compute model Use all computational resources in system CPU, GPU, … Work-item : same as thread / fragment / etc.. Work-group : a group of work-items Work-items in a same work-group can communicate Execute multiple work-groups in parallel 61
OpenCL program structure Host program (CPU) Platform layer Query compute devices Create context Runtime Create memory objects Compile and create kernel program objects Issue commands (i.e., kernel launching) to command-queue Synchronization of commands Clean up OpenCL resources Kernel (CPU, GPU) C-like code with some extensions Runs on compute device 62
CUDA v.s. OpenCL comparison Conceptually almost identical Work-item == thread Work-group == block Similar memory model Global, local, shared memory Kernel, host program CUDA is highly optimized only for NVIDIA GPUs OpenCL can be widely used for any GPUs/CPUs 63
Implementation status of OpenCL Specification 1.0 released by Khronos NVIDIA released Beta 1.2 driver and SDK Available for registered GPU computing developers Apple will include in Mac OS X Snow Leopard Q3 2009 NVIDIA and ATI GPUs, Intel CPU for Mac More companies will join 64
GPU optimization tips: configuration Identify bottleneck Computing / bandwidth bound (use profiler) Focus on most expensive but parallelizable parts (Amdahl’s law) Maximize parallel execution Use large input (many threads) Avoid divergent execution Efficient use of limited resource Minimize shared memory / register use 65
GPU optimization tips: memory Memory access: the most important optimization Minimize device to host memory overhead Overlap kernel with memory copy (asynchronous copy) Avoid shared memory bank conflict Coalesced global memory access Texture or constant memory can be helpful (cache) Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 66
GPU optimization tips: instructions Use less expensive operators division: 32 cycles, multiplication: 4 cycles *0.5 instead of /2.0 Atomic operator is expensive Possible race condition Double precision is much slower than float Use less accurate floating point instruction when possible __sin(), __exp(), __pow() Save unnecessary instructions Loop unrolling 67
3. Application Example CUDA ITK 68
ITK image filters implemented using CUDA Convolution filters Mean filter Gaussian filter Derivative filter Hessian of Gaussian filter Statistical filter Median filter PDE-based filter Anisotropic diffusion filter 69
CUDA ITK CUDA code is integrated into ITK Transparent to the ITK users No need to modify current code using ITK library Check environment variable ITK_CUDA Entry point GenerateData() or ThreadedGenerateData() If ITK_CUDA == 0 Execute original ITK code If ITK_CUDA == 1 Execute CUDA code 70
Convolution filters ,[object Object],For size n filter, each pixel is reused n times Non-separable filter (Anisotropic) Reusing data using shared memory Separable filter (Gaussian) N-dimensional convolution = N*1D convolution 71 kernel kernel kernel * * *
Read from input image whenever needed Naïve C/CUDA implementation 72 intxdim, ydim;  // size of input image float *in, *out; // input/output image of size xdim*ydim float w[][];     // convolution kernel of size n*m for(x=0; x<xdim; x++) {    for(y=0; y<ydim; y++)    {       // compute convolution       for(sx=x-n/2; sx<=x+n/2; sx++)       {          for(sy=y-m/2; sy<=y+m/2; sy++)          { wx = sx – x + n/2; wy = sy – y + m/2;             out[x][y] = w[wx][wy]*in[sx][sy];          }       }    } } xdim*ydim n*m load from global memory, n*m times
For size n*m filter, each pixel is reused n*m times Save n*m-1 global memory loads by using shared memory Improved CUDA convolution filter 73 __global__ cudaConvolutionFilter2DKernel(in, out, w){  // copy global to shared memory sharedmem[] = in[][];   __syncthreads();   // sum neighbor pixel values  float _sum = 0;  for(uint j=threadIdx.y; j<=threadIdx.y + m; j++)  {    for(uinti=threadIdx.x; i<=threadIdx.x + n; i++)    {wx = i – threadIdx.x;wy = j – threadIdx.y;      _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i];    }  }} load from global memory (slow),  only once n*m load from shared memory (fast), n*m times
CUDA Gaussian filter Apply 1D convolution filter along each axis Use temporary buffers: ping-pong rendering 74 // temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){   // create Gaussian weight   w = ComputeGaussKernel(stddev);   temp[0] = in;   // call 1D convolution with Gaussian weight   dim3 G, B;   for(i=0; i<dimension; i++)   {      cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w);    }    out = temp[i%2];} 1D convolution cuda kernel
Median filter 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 Viola et al. [VIS 03] Finding median by bisection of histogram bins Log(# bins) iterations 8-bit pixel : log(256) = 8 iterations Intensity : 0 1 2 3 4 5 6 7 1. 16 4 2. Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) {    count = 0;    For(j=0; j<kernelsize; j++)    {       if(kernel[j] > pivot) count++:    }	    if(count <kernelsize/2) max = floor(pivot);        else min = ceil(pivot);    pivot = (min + max)/2.0f; } return floor(pivot); 11 5 3. 4. 75
Perona & Malik anisotropic diffusion Nonlinear diffusion Adaptive smoothing based on magnitude of gradient Preserves edges (high gradient) Numerical solution Euler explicit integration (iterative method) Finite difference for derivative computation 76 Input Image Linear diffusion P & M diffusion
Performance Convolution filters Mean filter : ~140x Gaussian filter : ~60x Derivative filter Hessian of Gaussian filter Statistical filter Median filter : ~25x PDE-based filter Anisotropic diffusion filter : ~70x 77
CUDA ITK Source code available at http://sourceforge.net/projects/cudaitk/ 78

More Related Content

What's hot

SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモShinya Takamaeda-Y
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Hajime Tazaki
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopenHajime Tazaki
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux KernelKernel TLV
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation Jiann-Fuh Liaw
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
D itg-manual
D itg-manualD itg-manual
D itg-manualVeggax
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)Jyh-Miin Lin
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion CullingIntel® Software
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
 

What's hot (20)

SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Linux Device Tree
Linux Device TreeLinux Device Tree
Linux Device Tree
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux Kernel
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
D itg-manual
D itg-manualD itg-manual
D itg-manual
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network Interfaces
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 

Similar to General Purpose Computing using Graphics Hardware

GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra Umbra Software
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.pptAtef46
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSantosh Verma
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 
23_Advanced_Processors controller system
23_Advanced_Processors controller system23_Advanced_Processors controller system
23_Advanced_Processors controller systemstellan7
 
Potapenko, vyukov forewarned is forearmed. a san and tsan
Potapenko, vyukov   forewarned is forearmed. a san and tsanPotapenko, vyukov   forewarned is forearmed. a san and tsan
Potapenko, vyukov forewarned is forearmed. a san and tsanDefconRussia
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkmarkdgray
 

Similar to General Purpose Computing using Graphics Hardware (20)

GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
23_Advanced_Processors controller system
23_Advanced_Processors controller system23_Advanced_Processors controller system
23_Advanced_Processors controller system
 
Aes
AesAes
Aes
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Potapenko, vyukov forewarned is forearmed. a san and tsan
Potapenko, vyukov   forewarned is forearmed. a san and tsanPotapenko, vyukov   forewarned is forearmed. a san and tsan
Potapenko, vyukov forewarned is forearmed. a san and tsan
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 

Recently uploaded

Call Girls Bangalore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Bangalore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableDipal Arora
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...chandars293
 
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Genuine Call Girls
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomdiscovermytutordmt
 
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...parulsinha
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...hotbabesbook
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Tirupati Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Tirupati Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...chandars293
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...jageshsingh5554
 
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...indiancallgirl4rent
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...perfect solution
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Dipal Arora
 

Recently uploaded (20)

Call Girls Bangalore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Bangalore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
 
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
 
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Tirupati Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Tirupati Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 9907093804 Top Class Call Girl Service Available
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
(Rocky) Jaipur Call Girl - 09521753030 Escorts Service 50% Off with Cash ON D...
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
 

General Purpose Computing using Graphics Hardware

  • 1. General Purpose Computingusing Graphics Hardware Hanspeter Pfister Harvard University
  • 2. Acknowledgements Won-Ki Jeong, Harvard University KayvonFatahalian, Stanford University 2
  • 3. GPU (Graphics Processing Unit) PC hardware dedicated for 3D graphics Massively parallel SIMD processor Performance pushed by game industry 3 NVIDIA SLI System
  • 4. GPGPU General Purpose computation on the GPU Started in computer graphics research community Mapping computational problems to graphics rendering pipeline 4 Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
  • 5. Why GPU for computing? GPU is fast Massively parallel CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core) GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200) High memory bandwidth Programmable NVIDIA CUDA, DirectX Compute Shader, OpenCL High precision floating point support 64bit floating point (IEEE 754) Inexpensive desktop supercomputer NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000 5
  • 6. FLOPS 6 Image Courtesy NVIDIA
  • 7. Memory Bandwidth 7 Image Courtesy NVIDIA
  • 8. GPGPU Biomedical Examples 8 Level-Set Segmentation (Lefohn et al.) CT/MRI Reconstruction (Sumanaweera et al.) Image Registration (Strzodka et al.) EM Image Processing (Jeong et al.)
  • 9. Overview GPU Architecture Overview GPU Programming Overview Programming Model NVIDIA CUDA OpenCL Application Example CUDA ITK 9
  • 10. 1. GPU Architecture Overview KayvonFatahalian Stanford University 10
  • 11. What’s in a GPU? 11 Input Assembly Rasterizer Output Blend Video Decode Tex Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Compute Core Tex Tex HW or SW? Work Distributor Tex Heterogeneous chip multi-processor (highly tuned for graphics)
  • 12. CPU-“style” cores 12 Fetch/ Decode Out-of-order control logic Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data Cache (A big one)
  • 13. Slimming down 13 Fetch/ Decode Idea #1: Remove components that help a single instruction stream run fast ALU (Execute) Execution Context
  • 14. Two cores (two threads in parallel) 14 thread1 thread 2 Fetch/ Decode Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) ALU (Execute) Execution Context Execution Context
  • 15. Four cores (four threads in parallel) 15 Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) ALU (Execute) ALU (Execute) Execution Context Execution Context Execution Context Execution Context
  • 16. Sixteen cores (sixteen threads in parallel) 16 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams
  • 17. Instruction stream sharing 17 But… many threads should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)
  • 18. Recall: simple processing core 18 Fetch/ Decode ALU (Execute) Execution Context
  • 19. Add ALUs 19 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data
  • 20. Modifying the code 20 Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one thread using scalar ops on scalar registers
  • 21. Modifying the code 21 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes 8 threads using vector ops on vector registers
  • 22. Modifying the code 22 2 3 1 4 6 7 5 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data
  • 23. 128 threads in parallel 23 16 cores = 128 ALUs = 16 simultaneous instruction streams
  • 24. But what about branches? 24 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  • 25. But what about branches? 25 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  • 26. But what about branches? 26 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> Not all ALUs do useful work! Worst case: 1/8 performance
  • 27. But what about branches? 27 2 ... 1 ... 8 Time (clocks) ALU 1 ALU 2 . . . ALU 8 . . . <unconditional shader code> T T T F F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  • 28.
  • 29. Intel/AMD x86 SSE, Intel Larrabee
  • 30. Option 2: Scalar instructions, implicit HW vectorization
  • 31. HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
  • 32. NVIDIA GeForce (“SIMT” warps), ATI Radeon architecturesIn practice: 16 to 64 threads share an instruction stream
  • 33. Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. 29
  • 34. But we have LOTS of independent threads. Idea #3: Interleave processing of many threads on a single core to avoid stalls caused by high latency operations. 30
  • 35. Hiding stalls 31 Time (clocks) Thread1 … 8 ALU ALU ALU ALU ALU ALU ALU ALU Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SharedCtx Data
  • 36. Hiding stalls 32 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 ALU ALU ALU ALU ALU ALU ALU ALU Fetch/ Decode 1 2 3 4 1 2 3 4
  • 37. Hiding stalls 33 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Runnable 1 2 3 4
  • 38. Hiding stalls 34 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Runnable 1 2 3 4
  • 39. Hiding stalls 35 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Stall Stall Stall Stall Runnable Runnable 1 2 3 4 Runnable
  • 40. Throughput! 36 Time (clocks) Thread9… 16 Thread17 … 24 Thread25 … 32 Thread1 … 8 Start Start Stall Stall Stall Stall Start Runnable Runnable Done! Runnable Done! Runnable 2 3 4 1 Increase run time of one group To maximum throughput of many groups Done! Done!
  • 41. Storing contexts 37 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage 32KB
  • 42. Twenty small contexts 38 (maximal latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 10 1 2 3 4 5 6 7 8 9 11 15 12 13 14 16 20 17 18 19
  • 43. Twelve medium contexts 39 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 5 6 7 8 9 10 11 12
  • 44. Four large contexts 40 (low latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4 3 1 2
  • 45. GPU block diagram key = single “physical” instruction stream fetch/decode (functional unit control) = SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs” = 32-bit mul-add unit = 32-bit multiply unit = execution context storage = fixed function unit 41
  • 46. Example: NVIDIA GeForce GTX 280 NVIDIA-speak: 240 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 240 mul-adds + 240 muls per clock 1.3 GHz clock 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 threads 8 threads run on 8 SIMD functional units in one clock 42
  • 47. GTX 280 core 43 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Zcull/Clip/Rast Output Blend Work Distributor
  • 48. Example: ATI Radeon 4870 AMD/ATI-speak: 800 stream processors Automatic HW-managed sharing of scalar instruction stream (like “SIMT”) Generic speak: 10 processing cores 16 SIMD functional units per core 5 mul-adds per functional unit (5 * 2 =10 flops/clock) Best case: 800 mul-adds per clock 750 MHz clock 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS Mapping data-parallelism to chip: Instruction stream shared across 64 threads 16 threads run on 16 SIMD functional units in one clock 44
  • 49. ATI Radeon 4870 core … … … … … … … … … … Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Zcull/Clip/Rast Output Blend Work Distributor 45
  • 50. Summary: three key ideas Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of threads) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of threads When one group stalls, work on another group 46
  • 51. 2. GPU Programming Models Programming Model NVIDIA CUDA OpenCL 47
  • 52. Task parallelism Distribute the tasks across processors based on dependency Coarse-grain parallelism 48 Task 1 Task 1 Time Task 2 Task 2 Task 3 Task 3 P1 Task 4 Task 4 P2 Task 5 Task 5 Task 6 Task 6 P3 Task 7 Task 7 Task 8 Task 8 Task 9 Task 9 Task assignment across 3 processors Task dependency graph
  • 53. Data parallelism Run a single kernel over many elements Each element is independently updated Same operation is applied on each element Fine-grain parallelism Many lightweight threads, easy to switch context Maps well to ALU heavy architecture : GPU 49 Kernel ……. Data P1 P2 P3 P4 P5 Pn …….
  • 54. GPU-friendly Problems Data-parallel processing High arithmetic intensity Keep GPU busy all the time Computation offsets memory latency Coherent data access Access large chunk of contiguous memory Exploit fast on-chip shared memory 50
  • 55.
  • 56. Example: Reduction Serial version (O(N)) for(inti=1; i<N; i++) { v[0] += v[i]; } Parallel version (O(logN)) width = N/2; while(width > 1) { for(inti=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2; } 52
  • 57. GPU programming languages Using graphics APIs GLSL, Cg, HLSL Computing-specific APIs DX 11 Compute Shaders NVIDIA CUDA OpenCL 53
  • 58. NVIDIA CUDA C-extension programming language No graphics API Supports debugging tools Extensions / API Function type : __global__, __device__, __host__ Variable type : __shared__, __constant__ Low-level functions cudaMalloc(), cudaFree(), cudaMemcpy(),… __syncthread(), atomicAdd(),… Program types Device program (kernel) : runs on the GPU Host program : runs on the CPU to call device programs 54
  • 59. CUDA Programming Model Kernel GPU program that runs on a thread grid Thread hierarchy Grid : a set of blocks Block : a set of threads Grid size * block size = total # of threads 55 Grid Kernel Block 2 Block n Block 1 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) . . . . . Threads Threads Threads
  • 60. CUDA Memory Structure 56 Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 Memory hierarchy PC memory : off-card GPU Global : off-chip / on-card Shared/register/cache : on-chip The host can read/write global memory Each thread communicates using shared memory
  • 61. Synchronization Threads in the same block can communicate using shared memory No HW global synchronization function yet __syncthreads() Barrier for threads only within the current block __threadfence() Flushes global memory writes to make them visible to all threads 57
  • 62. Example: CPU Vector Addition 58 // Pair-wise addition of vector elements // CPU version : serial add void vectorAdd(float* iA, float* iB, float* oC, int num) { for(inti=0; i<num; i++) { oC[i] = iA[i] + iB[i]; } }
  • 63. Example: CUDA Vector Addition 59 // Pair-wise addition of vector elements // CUDA version : one thread per addition __global__ void vectorAdd(float* iA, float* iB, float* oC) { intidx = threadIdx.x + blockDim.x * blockIdx.x; oC[idx] = iA[idx] + iB[idx]; }
  • 64. Example: CUDA Host Code 60 float* h_A = (float*) malloc(N * sizeof(float)); float* h_B = (float*) malloc(N * sizeof(float)); // …initalizeh_A and h_B // allocate device memory float* d_A, d_B, d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice ); // execute the kernel on N/256 blocks of 256 threads each vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
  • 65. OpenCL (Open Computing Language) First industry standard for computing language Based on C language Platform independent NVIDIA, ATI, Intel, …. Data and task parallel compute model Use all computational resources in system CPU, GPU, … Work-item : same as thread / fragment / etc.. Work-group : a group of work-items Work-items in a same work-group can communicate Execute multiple work-groups in parallel 61
  • 66. OpenCL program structure Host program (CPU) Platform layer Query compute devices Create context Runtime Create memory objects Compile and create kernel program objects Issue commands (i.e., kernel launching) to command-queue Synchronization of commands Clean up OpenCL resources Kernel (CPU, GPU) C-like code with some extensions Runs on compute device 62
  • 67. CUDA v.s. OpenCL comparison Conceptually almost identical Work-item == thread Work-group == block Similar memory model Global, local, shared memory Kernel, host program CUDA is highly optimized only for NVIDIA GPUs OpenCL can be widely used for any GPUs/CPUs 63
  • 68. Implementation status of OpenCL Specification 1.0 released by Khronos NVIDIA released Beta 1.2 driver and SDK Available for registered GPU computing developers Apple will include in Mac OS X Snow Leopard Q3 2009 NVIDIA and ATI GPUs, Intel CPU for Mac More companies will join 64
  • 69. GPU optimization tips: configuration Identify bottleneck Computing / bandwidth bound (use profiler) Focus on most expensive but parallelizable parts (Amdahl’s law) Maximize parallel execution Use large input (many threads) Avoid divergent execution Efficient use of limited resource Minimize shared memory / register use 65
  • 70. GPU optimization tips: memory Memory access: the most important optimization Minimize device to host memory overhead Overlap kernel with memory copy (asynchronous copy) Avoid shared memory bank conflict Coalesced global memory access Texture or constant memory can be helpful (cache) Graphics card GPU Core PC Memory (DRAM) GPU GlobalMemory(DRAM) GPU SharedMemory(On-Chip) ALUs 1 200 4000 66
  • 71. GPU optimization tips: instructions Use less expensive operators division: 32 cycles, multiplication: 4 cycles *0.5 instead of /2.0 Atomic operator is expensive Possible race condition Double precision is much slower than float Use less accurate floating point instruction when possible __sin(), __exp(), __pow() Save unnecessary instructions Loop unrolling 67
  • 72. 3. Application Example CUDA ITK 68
  • 73. ITK image filters implemented using CUDA Convolution filters Mean filter Gaussian filter Derivative filter Hessian of Gaussian filter Statistical filter Median filter PDE-based filter Anisotropic diffusion filter 69
  • 74. CUDA ITK CUDA code is integrated into ITK Transparent to the ITK users No need to modify current code using ITK library Check environment variable ITK_CUDA Entry point GenerateData() or ThreadedGenerateData() If ITK_CUDA == 0 Execute original ITK code If ITK_CUDA == 1 Execute CUDA code 70
  • 75.
  • 76. Read from input image whenever needed Naïve C/CUDA implementation 72 intxdim, ydim; // size of input image float *in, *out; // input/output image of size xdim*ydim float w[][]; // convolution kernel of size n*m for(x=0; x<xdim; x++) { for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) { wx = sx – x + n/2; wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } } } xdim*ydim n*m load from global memory, n*m times
  • 77. For size n*m filter, each pixel is reused n*m times Save n*m-1 global memory loads by using shared memory Improved CUDA convolution filter 73 __global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory sharedmem[] = in[][]; __syncthreads(); // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i<=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }} load from global memory (slow), only once n*m load from shared memory (fast), n*m times
  • 78. CUDA Gaussian filter Apply 1D convolution filter along each axis Use temporary buffers: ping-pong rendering 74 // temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w); } out = temp[i%2];} 1D convolution cuda kernel
  • 79. Median filter 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 1 4 3 1 8 2 1 0 Viola et al. [VIS 03] Finding median by bisection of histogram bins Log(# bins) iterations 8-bit pixel : log(256) = 8 iterations Intensity : 0 1 2 3 4 5 6 7 1. 16 4 2. Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot); 11 5 3. 4. 75
  • 80. Perona & Malik anisotropic diffusion Nonlinear diffusion Adaptive smoothing based on magnitude of gradient Preserves edges (high gradient) Numerical solution Euler explicit integration (iterative method) Finite difference for derivative computation 76 Input Image Linear diffusion P & M diffusion
  • 81. Performance Convolution filters Mean filter : ~140x Gaussian filter : ~60x Derivative filter Hessian of Gaussian filter Statistical filter Median filter : ~25x PDE-based filter Anisotropic diffusion filter : ~70x 77
  • 82. CUDA ITK Source code available at http://sourceforge.net/projects/cudaitk/ 78
  • 83. CUDA ITK Future Work ITK GPU image class Reduce CPU to GPU memory I/O Pipelining support Native interface for GPU code Similar to ThreadedGenerateData() for GPU threads Numerical library (vnl) Out-of-GPU-core / GPU-cluster Processing large images (10~100 Terabytes) GPU Platform independent implementation OpenCL could be a solution 79
  • 84. Conclusions GPU computing delivers high performance Many scientific computing problems are parallelizable More consistency/stability in HW/SW Main GPU architecture is mature Industry-wide programming standard now exists (OpenCL) Better support/tools available C-based language, compiler, and debugger Issues Not every problem is suitable for GPUs Re-engineering of algorithms/software required Unclear future performance growth of GPU hardware Intel’s Larrabee 80
  • 85. thrust thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA C++ template metaprogramming automatically chooses the fastest code path at compile time
  • 86. thrust::sort #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/generate.h> #include <thrust/sort.h> #include <cstdlib> int main(void) { // generate random data on the host thrust::host_vector<int> h_vec(1000000); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer to device and sort thrust::device_vector<int> d_vec = h_vec; // sort 140M 32b keys/sec on GT200 thrust::sort(d_vec.begin(), d_vec.end()); return 0;} http://thrust.googlecode.com

Editor's Notes

  1. Fluid flow, level set segmentation, DTI image
  2. One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores.
  3. Pack core full of ALUsWe are not going to increase our core’s ability to decode instructionsWe will decode 1 instruction, and execute on all 8 ALUs
  4. How can we make use all these ALUs?
  5. Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones.
  6. So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers.
  7. We continue this process, moving to a new group each time we encounter a stall.If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle.
  8. Described adding contextsIn reality there’s a fixed pool of on chip storage that is partitioned to hold contexts.Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts.
  9. Shadingperformance relies on large scale interleavingNumber of interleaved groups per core ~20-30Could be separate hardware-managed contexts or software-managed using techniques
  10. Fewer contexts fit on chipChip can hide less latencyHigher likelihood of stalls
  11. Loose performance when shaders use a lot of registers
  12. 128 simultaneous threads on each core
  13. Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing
  14. Numbers are relative cost of communication
  15. Runs on each thread – is parallel
  16. G = grid size, B = block size