GPU: Understanding CUDA

J.A.R.
J.C.G.
T.R.G.B.
GPU: UNDERSTANDING CUDA

TALK STRUCTURE
• What is CUDA?
• History of GPU
• Hardware Presentation
• How does it work?
• Code Example
• Examples & Videos
• Results & Conclusion

WHAT IS CUDA
• Compute Unified Device Architecture
• Is a parallel computing platform and
programming model created by NVIDIA and
implemented by the graphics processing
units (GPUs) that they produce
• CUDA gives developers access to the
virtual instruction set and memory of the
parallel computational elements in CUDA GPUs

HISTORY
• 1981 – Monochrome Display Adapter
• 1988 – VGA Standard (VGA Controller) – VESA Founded
• 1989 – SVGA
• 1993 – PCI – NVidia Founded
• 1996 – AGP – Voodoo Graphics – Pentium
• 1999 – NVidia GeForce 256 – P3
• 2004 – PCI Express – GeForce6600 – P4
• 2006 – GeForce 8800
• 2008 – GeForce GTX280 / Core2

HISTORICAL PC
CPU
North Bridge Memory
South Bridge
VGA
Controller
Screen
Memory
Buffer
LAN UART
System Bus
PCI Bus

VOODOO GRAPHICS SYSTEM ARCHITECTURE
Geom
Gather
Geom
Proc
Triangle
Proc
Pixel
Proc
Z / Blend
CPU
Core
Logic
FBI
FB
Memory
System
Memory
TMU
TEX
Memory
GPUCPU

GEFORCE GTX280 SYSTEM ARCHITECTURE
Geom
Gather
Geom
Proc
Triangle
Proc
Pixel
Proc
Z /
Blend
CPU
Core
Logic
GPU
GPU
Memory
System
Memory
GPUCPU
Physics
and AI
Scene
Mgmt

SOUL OF NVIDIA’S GPU ROADMAP
• Increase Performance / Watt
• Make Parallel Programming Easier
• Run more of the Application on the GPU

MYTHS ABOUT CUDA
• You have to port your entire application to the
GPU
• It is really hard to accelerate your application
• There is a PCI-e Bottleneck

CUDA MODELS
• Device Model
• Execution Model

DEVICE MODEL
Scalar
Processor
Many Scalar Processors + Register File + Shared Memory

DEVICE MODEL
Multiprocessor Device

DEVICE MODEL
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Load/store Load/store Load/store Load/store Load/store

HARDWARE PRESENTATION
Geforce GTS450

Geforce GTS450 Especificaciones

Geforce GTX470

Geforce GTX470 Especificaciones

Geforce 8600 GT/GTS Especificaciones

EXECUTION MODEL
Vocabulary:
• Host: CPU.
• Device: GPU.
• Kernel: A piece of code executed on GPU. ( function, program.. )
• SIMT: Single Instruction Multiple Threads
• Warps: A set of 32 threads. Minimum size of the data processed in
SIMT.

EXECUTION MODEL
All threads execute same code.
Each thread have an
unique identifier (threadID (x,y,z))
A CUDA kernel is executed by
an array of threads
SIMT

EXECUTION MODEL - SOFTWARE
Grid: A set of Blocks
Thread: Smallest logict unit
Block: A set of Threads.
(Max 512)
• Private Shared Memory
• Barrier (Threads synchronization)
• Barrier ( Grid synchronization)
• Without synchronization between blocks

EXECUTION MODEL
Specified by the programmer at Runtime
- Number of blocks (gridDim)
- Block size (BlockDim)
CUDA kernel invocation
f <<<G, B>>>(a, b, c)

EXECUTION MODEL - MEMORY ARCHITECTURE

EXECUTION MODEL
Each thread runs on a
scalar processor
Thread blocks are
running on the multiprocessor
A Grid only run a CUDA Kernel

SCHEDULE
tiempo
warp 8 instrucción 11
.
.
.
Bloque 1 Bloque 2 Bloque n
warp 1
2
m
warp 2
2
m
warp 2
2
m
• Threads are grouped into blocks
• IDs are assigned to blocks and
threads
• Blocks threads are distributed
among the multiprocessors
• Threads of a block are grouped into
warps
• A warp is the smallest unit of
planning and consists of 32 threads
• Various warps on each
multiprocessor, but only one is
running

CODE EXAMPLE
The following program calculates and prints the square of first 100 integers.
// 1) Include header files
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
// 2) Kernel that executes on the CUDA device
__global__ void square_array(float*a,int N) {
int idx=blockIdx.x*blockDim.x+threadIdx.x;
if (idx <N )
a[idx]=a[idx]*a[idx];
}
// 3) main( ) routine, the CPU must find
int main(void) {

CODE EXAMPLE
// 3.1:- Define pointer to host and device arrays
float*a_h,*a_d;
// 3.2:- Define other variables used in the program e.g. arrays etc.
const int N=100;
size_t size=N*sizeof(float);
// 3.3:- Allocate array on the host
a_h=(float*)malloc(size);
// 3.4:- Allocate array on device (DRAM of the GPU)
cudaMalloc((void**)&a_d,size);
for (int i=0;i<N;i ++)
a_h[i]=(float)i;

CODE EXAMPLE
// 3.5:- Copy the data from host array to device array.
cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);
// 3.6:- Kernel Call, Execution Configuration
int block_size=4;
int n_blocks=N / block_size + ( N % block_size ==0);
square_array<<<n_blocks,block_size>>>(a_d,N);
// 3.7:- Retrieve result from device to host in the host memory
cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);

CODE EXAMPLE
// 3.8:- Print result
for(int i=0;i<N;i++)
printf("%dt%fn",i,a_h[i]);
// 3.9:- Free allocated memories on the device and host
free(a_h);
cudaFree(a_d);
getch(); } )

EXAMPLES
• Video Example with a NVidia Tesla
• Development Environment

RADIX SORT RESULTS.
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1.000.000 10.000.000 51.000.000 100.000.000
GTS 450
GTX 470
GeForce 8600
GTX 560M

CONCLUSION
• Easy to use and powerful so it is worth!
• GPU computing is the future. The Results
confirm our theory and the industry is giving
more and more importance.
• In the next years we will see more applications
that are using parallel computing

DOCUMENTATION & LINKS
• http://www.nvidia.es/object/cuda_home_new_es.html
• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf
• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf
• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf
• http://www.geforce.com/hardware/technology/cuda/supported-gpus
• http://en.wikipedia.org/wiki/GeForce_256
• http://en.wikipedia.org/wiki/CUDA
• https://developer.nvidia.com/technologies/Libraries
• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47
• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-
ubuntu-12-10

GPU: Understanding CUDA

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à GPU: Understanding CUDA

Similaire à GPU: Understanding CUDA (20)

Dernier

Dernier (20)

GPU: Understanding CUDA