SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Naga Vydyanathan
GPU Computing with CUDA
Accelerated Computing
GPU Teaching Kit
2
3 Ways to Accelerate Applications
Applications
Libraries
Easy to use
Most Performance
Programming
Languages
Most Performance
Most Flexibility
Easy to use
Portable code
Compiler
Directives
CUDA
3
Objective
– To understand the CUDA memory hierarchy
– Registers, shared memory, global memory
– Scope and lifetime
– To learn the basic memory API functions in CUDA host code
– Device Memory Allocation
– Host-Device Data Transfer
– To learn about CUDA threads, the main mechanism for exploiting of
data parallelism
– Hierarchical thread organization
– Launching parallel execution
– Thread index to data index mapping
4
Global Memory
Processing Unit
I/O
ALU
Processor (SM)
Shared
Memory
Register
File
Control Unit
PC IR
Hardware View of CUDA Memories
5
Programmer View of CUDA Memories
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
6
Declaring CUDA Variables
– __device__ is optional when used with __shared__, or __constant__
– Automatic variables reside in a register
– Except per-thread arrays that reside in global memory
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
7
Global Memory in CUDA
– GPU Device memory (DRAM)
– Large in size (~16/32 GB)
– Bandwidth ~700-900 GB/s
– High latency (~1000 cycles)
– Scope of access and sharing - kernel
– Lifetime – application
– main means of communicating data between host and GPU.
8
Shared Memory in CUDA
– A special type of memory whose contents are explicitly defined and
used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput) than global
memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the corresponding thread block
finishes and terminates execution
– A form of scratchpad memory in computer architecture
We will re-visit shared memory later.
9
A[0]
vector A
vector B
vector C
A[1] A[2] A[N-1]
B[0] B[1] B[2]
…
… B[N-1]
C[0] C[1] C[2] C[N-1]
…
+ + + +
Data Parallelism - Vector Addition Example
10
Vector Addition – Traditional C Code
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}
10
11
CPU
Host Memory
GPU
Device Memory
Part 1
Part 3
Heterogeneous Computing vecAdd CUDA Host Code
#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Part 1
// Allocate device memory for A, B, and C
// copy A and B to device memory
// Part 2
// Kernel launch code – the device performs the actual vector addition
// Part 3
// copy C from the device memory
// Free device vectors
}
11
Part 2
12
Partial Overview of CUDA Memories
– Device code can:
– R/W per-thread registers
– R/W all-shared global
memory
– Host code can
– Transfer data to/from per
grid global memory
12
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
13
CUDA Device Memory Management API functions
– cudaMalloc()
– Allocates an object in the device
global memory
– Two parameters
– Address of a pointer to the
allocated object
– Size of allocated object in terms
of bytes
– cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
14
Host-Device Data Transfer API functions
– cudaMemcpy()
– memory data transfer
– Requires four parameters
– Pointer to destination
– Pointer to source
– Number of bytes copied
– Type/Direction of transfer
– Transfer to device is synchronous
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
15
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code – to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
15
16
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}
10
17
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
18
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Each thread has indexes that it uses to compute memory addresses and make
control decisions
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
…
0 1 2 254 255
…
19
Thread Blocks: Scalable Cooperation
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
– Threads in different blocks do not interact
19
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
0 1 2 254 255
Thread Block 0
…
1 2 254 255
Thread Block 1
0
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
1 2 254 255
Thread Block N-1
0
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
…
… …
20
blockIdx and threadIdx
• Each thread uses indices to decide what data to work
on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0)
– threadIdx: 1D, 2D, or 3D
• Simplifies memory
addressing when processing
multidimensional data
– Image processing
– Solving PDEs on volumes
– …
20
device
Grid Block (0,
0)
Block (1,
1)
Block (1,
0)
Block (0,
1)
Block (1,1)
Thread
(0,0,0)
Thread
(0,1,3)
Thread
(0,1,0)
Thread
(0,1,1)
Thread
(0,1,2)
Thread
(0,0,0)
Thread
(0,0,1)
Thread
(0,0,2)
Thread
(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
21
Example: Vector Addition Kernel
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
Device Code
22
Example: Vector Addition Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
Host Code
4
The ceiling function makes sure that there
are enough threads to cover all elements.
23
More on Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
23
Host Code
This is an equivalent way to express the
ceiling function.
24
__host__
void vecAdd(…)
{
dim3 DimGrid(ceil(n/256.0),1,1);
dim3 DimBlock(256,1,1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B
,d_C,n);
}
Kernel execution in a nutshell
24
Grid
Blk 0 Blk N-1
• • •
GPU
M0
RAM
Mk
• • •
__global__
void vecAddKernel(float *A,
float *B, float *C, int n)
{
int i = blockIdx.x * blockDim.x
+ threadIdx.x;
if( i<n ) C[i] = A[i]+B[i];
}
25
More on CUDA Function Declarations
− __global__ defines a kernel function
− Each “__” consists of two underscore characters
− A kernel function must return void
− __device__ and __host__ can be used together
− __host__ is optional if used alone
25
host
host
__host__ float HostFunc()
host
device
__global__ void KernelFunc()
device
device
__device__ float DeviceFunc()
Only callable from
the:
Executed on
the:
26
CUDA Device Query
– Enumerates properties of CUDA devices in your system
26
27
Next Webinar
– Learn to write, compile and run a multi-dimensional CUDA program
– Learn some CUDA optimization techniques
– Memory optimization
– Compute optimization
– Transfer optimization
GPU Teaching Kit
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
Accelerated Computing
Naga Vydyanathan: nvydyanathan@nvidia.com
NVIDIA Developer Zone: https://developer.nvidia.com/

Contenu connexe

Tendances

A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionGennaro Caccavale
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
 

Tendances (20)

Lecture 04
Lecture 04Lecture 04
Lecture 04
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Cuda
CudaCuda
Cuda
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Cuda
CudaCuda
Cuda
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Keccak
KeccakKeccak
Keccak
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge function
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Sha3
Sha3Sha3
Sha3
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
 

Similaire à GPU Computing with CUDA

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple coresLee Hanxue
 

Similaire à GPU Computing with CUDA (20)

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Paralell
ParalellParalell
Paralell
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
posix.pdf
posix.pdfposix.pdf
posix.pdf
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 

Dernier

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Dernier (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

GPU Computing with CUDA

  • 1. Naga Vydyanathan GPU Computing with CUDA Accelerated Computing GPU Teaching Kit
  • 2. 2 3 Ways to Accelerate Applications Applications Libraries Easy to use Most Performance Programming Languages Most Performance Most Flexibility Easy to use Portable code Compiler Directives CUDA
  • 3. 3 Objective – To understand the CUDA memory hierarchy – Registers, shared memory, global memory – Scope and lifetime – To learn the basic memory API functions in CUDA host code – Device Memory Allocation – Host-Device Data Transfer – To learn about CUDA threads, the main mechanism for exploiting of data parallelism – Hierarchical thread organization – Launching parallel execution – Thread index to data index mapping
  • 4. 4 Global Memory Processing Unit I/O ALU Processor (SM) Shared Memory Register File Control Unit PC IR Hardware View of CUDA Memories
  • 5. 5 Programmer View of CUDA Memories Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory
  • 6. 6 Declaring CUDA Variables – __device__ is optional when used with __shared__, or __constant__ – Automatic variables reside in a register – Except per-thread arrays that reside in global memory Variable declaration Memory Scope Lifetime int LocalVar; register thread thread __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application __device__ __constant__ int ConstantVar; constant grid application
  • 7. 7 Global Memory in CUDA – GPU Device memory (DRAM) – Large in size (~16/32 GB) – Bandwidth ~700-900 GB/s – High latency (~1000 cycles) – Scope of access and sharing - kernel – Lifetime – application – main means of communicating data between host and GPU.
  • 8. 8 Shared Memory in CUDA – A special type of memory whose contents are explicitly defined and used in the kernel source code – One in each SM – Accessed at much higher speed (in both latency and throughput) than global memory – Scope of access and sharing - thread blocks – Lifetime – thread block, contents will disappear after the corresponding thread block finishes and terminates execution – A form of scratchpad memory in computer architecture We will re-visit shared memory later.
  • 9. 9 A[0] vector A vector B vector C A[1] A[2] A[N-1] B[0] B[1] B[2] … … B[N-1] C[0] C[1] C[2] C[N-1] … + + + + Data Parallelism - Vector Addition Example
  • 10. 10 Vector Addition – Traditional C Code // Compute vector sum C = A + B void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int i; for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i]; } int main() { // Memory allocation for h_A, h_B, and h_C // I/O to read h_A and h_B, N elements … vecAdd(h_A, h_B, h_C, N); } 10
  • 11. 11 CPU Host Memory GPU Device Memory Part 1 Part 3 Heterogeneous Computing vecAdd CUDA Host Code #include <cuda.h> void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n* sizeof(float); float *d_A, *d_B, *d_C; // Part 1 // Allocate device memory for A, B, and C // copy A and B to device memory // Part 2 // Kernel launch code – the device performs the actual vector addition // Part 3 // copy C from the device memory // Free device vectors } 11 Part 2
  • 12. 12 Partial Overview of CUDA Memories – Device code can: – R/W per-thread registers – R/W all-shared global memory – Host code can – Transfer data to/from per grid global memory 12 Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 13. 13 CUDA Device Memory Management API functions – cudaMalloc() – Allocates an object in the device global memory – Two parameters – Address of a pointer to the allocated object – Size of allocated object in terms of bytes – cudaFree() – Frees object from device global memory – One parameter – Pointer to freed object Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 14. 14 Host-Device Data Transfer API functions – cudaMemcpy() – memory data transfer – Requires four parameters – Pointer to destination – Pointer to source – Number of bytes copied – Type/Direction of transfer – Transfer to device is synchronous Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 15. 15 Vector Addition Host Code void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n * sizeof(float); float *d_A, *d_B, *d_C; cudaMalloc((void **) &d_A, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_B, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_C, size); // Kernel invocation code – to be shown later cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); } 15
  • 16. 16 In Practice, Check for API Errors in Host Code cudaError_t err = cudaMalloc((void **) &d_A, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); } 10
  • 17. 17 CUDA Execution Model – Heterogeneous host (CPU) + device (GPU) application C program – Serial parts in host C code – Parallel parts in device SPMD kernel code Serial Code (host) . . . . . . Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
  • 18. 18 Arrays of Parallel Threads • A CUDA kernel is executed by a grid (array) of threads – All threads in a grid run the same kernel code (Single Program Multiple Data) – Each thread has indexes that it uses to compute memory addresses and make control decisions i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 0 1 2 254 255 …
  • 19. 19 Thread Blocks: Scalable Cooperation – Divide thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks do not interact 19 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 0 1 2 254 255 Thread Block 0 … 1 2 254 255 Thread Block 1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 1 2 254 255 Thread Block N-1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … … … …
  • 20. 20 blockIdx and threadIdx • Each thread uses indices to decide what data to work on – blockIdx: 1D, 2D, or 3D (CUDA 4.0) – threadIdx: 1D, 2D, or 3D • Simplifies memory addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – … 20 device Grid Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3)
  • 21. 21 Example: Vector Addition Kernel // Compute vector sum C = A + B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x+blockDim.x*blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } Device Code
  • 22. 22 Example: Vector Addition Kernel Launch (Host Code) void vecAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n); } Host Code 4 The ceiling function makes sure that there are enough threads to cover all elements.
  • 23. 23 More on Kernel Launch (Host Code) void vecAdd(float* h_A, float* h_B, float* h_C, int n) { dim3 DimGrid((n-1)/256 + 1, 1, 1); dim3 DimBlock(256, 1, 1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n); } 23 Host Code This is an equivalent way to express the ceiling function.
  • 24. 24 __host__ void vecAdd(…) { dim3 DimGrid(ceil(n/256.0),1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B ,d_C,n); } Kernel execution in a nutshell 24 Grid Blk 0 Blk N-1 • • • GPU M0 RAM Mk • • • __global__ void vecAddKernel(float *A, float *B, float *C, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C[i] = A[i]+B[i]; }
  • 25. 25 More on CUDA Function Declarations − __global__ defines a kernel function − Each “__” consists of two underscore characters − A kernel function must return void − __device__ and __host__ can be used together − __host__ is optional if used alone 25 host host __host__ float HostFunc() host device __global__ void KernelFunc() device device __device__ float DeviceFunc() Only callable from the: Executed on the:
  • 26. 26 CUDA Device Query – Enumerates properties of CUDA devices in your system 26
  • 27. 27 Next Webinar – Learn to write, compile and run a multi-dimensional CUDA program – Learn some CUDA optimization techniques – Memory optimization – Compute optimization – Transfer optimization
  • 28. GPU Teaching Kit The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. Accelerated Computing Naga Vydyanathan: nvydyanathan@nvidia.com NVIDIA Developer Zone: https://developer.nvidia.com/