3. What we will cover
• GPUs and their history
• Why use GPUs
• Architecture
• Getting Started with GPU Programming
• Challenges, Techniques & Pitfalls
• Where not to use GPUs ?
• Resources
• The Future
4. What is a GPU
• Graphics Programming Unit
– Coined in 1999 by NVidia
– Specialized add‐on board
• Accelerates interactive 3D rendering
– 60 image updates (or more) on large data
– Solves embarrassingly parallel problem
– Game driven volume economics
• NVidia v/s ATI, just like Intel v/s AMD
• Demand for better effects led to
– programmable GPUs
– floating point capabilities
– this led to General Purpose GPU(GPGPU) Computation
5. History of GPUs : a GPGPU Perspective
Date Product Trans Cores Flops Technology
1997 RIVA 128 3 M Rasterization
1999 GeForce 256 25 M Transform & Lighting
2001 GeForce 3 60 M Programmable shaders
2002 GeForce FX 125 M 16, 32 bit FP, long shaders
2004 GeForce 6800 222 M Infinite length shaders, branching
2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA,
64 bit FP
2008 GeForce GTX 1.4 B 240 933 G IEEE FP, CUDA C, OpenCL and
280 78 M DirectCompute, PCI‐express Gen 2
2009 Tesla M2050 3.0 B 512 1.03 T Improved 64 bit perf, caching, ECC
515 G memory, 64‐bit unified addressing,
asynchronous bidirectional data
transfer, multiple kernels
Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
6. The GPU Advantage
30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth
Add to these a
3x Performance/$
Energy Efficient : 5x Performance/Watt
All Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/
7. People use GPUs for…
Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
8. More “why to use GPUs”
• Proliferation of GPUs
– Mobile devices will have capable GPUs soon !
• Make more things possible
– Make things real‐time
• From seconds to real‐time interactive performance
– Reduce offline processing overhead
• Research Opportunities
– New & efficient algorithms
– Pairing Multi‐core CPUs and massively multi‐threaded
GPUs
12. CPU versus GPU
• CPU
– Optimized for latency
– Speedup techniques
• Vectorization (MMX, SSE, …)
• Coarse Grained Parallelism using multiple CPUs and cores
– Memory approaching a TB
• GPU
– Optimized for throughput
– Speedup techniques
• Massive multithreading
• Fine grained parallelism
– A few GBs of memory max
13. Getting Started
• Software
– CUDA (NVidia specific)
– OpenCL (Cross‐platform, GPU/CPU)
– DirectCompute (MS specific)
• Hardware
– A system equipped with GPU
• OS no bar
– But Windows, RedHat Enterprise Linux seem better
supported
14. CUDA
• Compute Unified Device
Architecture
• Most popular GPGPU toolkit
• CUDA C extends C with
constructs
– Easy to write programs
• Lower level “driver” API is
available
Source: NVIDIA CUDA Architecture, Introduction and Overview
– Provides more control
– Use multiple GPUs in the same
application
– Mix graphics & compute code
• Language bindings available
– PyCUDA, Java, .NET
• Toolkit provides conveniences
CUDA Toolkit
15. CUDA Architecture
• 1 more streaming
multiprocessors (“cores”)
• Thread Blocks
– Single Instruction, Multiple
Thread (SIMT)
– Hide latency by parallelism
• Memory Hierarchy
– Fermi GPUs can access
system memory
• Primitives for
– Thread synchronization
– Atomic Operations on
memory
Source : The GPU Computing Era
16. Simple Example : Vector Addition
C/C++ ‐ serial code
void VecAdd(const float *A, const float*B, float *C, int N) {
for(unsigned int i=0;i<N;i++)
C[i]=A[i]+B[i];
}
VecAdd(A,B,C,N);
C/C++ with OpenMP – thread level parallelism
void VecAdd(const float *A, const float*B, float *C, int N) {
#pragma omp for
for(unsigned int i=0;i<N;i++)
C[i]=A[i]+B[i];
}
VecAdd(A,B,C,N);
17. Vector Addition using CUDA
CUDA C – element level parallelism
__global__ void VecAdd(const float *A, const float*B, float *C, int N) {
int I = blockDim.x * blockIdx.x + threadIdx.x;
if(i<N)
C[i]=A[i]+B[i];
}
Invoking the function
cudaMalloc((void**)&d_A, size);
Allocate Memory on GPU
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); Copy Arrays to GPU
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; Invoke function
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
Copy Result Back to Main Memory
cudaFree(d_A);
cudaFree(d_B);
Free GPU Memory
cudaFree(d_C);
Compilation
# nvcc vectorAdd.cu –I ../../common/inc
18. GPU Programming Challenges
• Need high “occupancy” for best performance
• Extracting parallelism with limited resources
– Limited Registers
– Limited Shared Memory
• Preferred Approach
– Small Kernels
– Multiple Passes if needed
• Decompose Problem into Parallel Pieces
– Write once, scale perform everywhere!
19. GPU Programming
• Use Shared Memory when possible
– Cooperation between threads in a block
– Reduce access to global memory
• Reduce Data Transfer over the Bus
• It’s still a GPU !
– use textures to your advantage
– use vector data types if you can
• Watch out for GPU capability differences!
22. Resources
• CUDA
– Tools on NVIDIA Developer Site
http://developer.nvidia.com/object/gpucomputing.html
– CUDPP
http://code.google.com/p/cudpp/
• OpenCL
• Google Search !
23. The Future
• Better throughput
– More GPU cores, scaling by Moore’s law
– PCIe Gen 3
• Easier to program
• Arbitrary control and data access patterns