SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
CUDA LAB
LSALAB
OVERVIEW
Programming Environment
Compile &Run CUDAprogram
CUDATools
Lab Tasks
CUDAProgramming Tips
References
GPU SERVER
IntelE5-2670 V2 10Cores CPUX 2
NVIDIAK20X GPGPUCARD X 2
Command to getyour GPGPUHW spec:
$/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
Device0:"TeslaK20Xm"
CUDADriverVersion/RuntimeVersion 5.5/5.5
CUDACapabilityMajor/Minorversionnumber: 3.5
Totalamountofglobalmemory: 5760MBytes(6039339008bytes)
(14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores
GPUClockrate: 732MHz(0.73GHz)
MemoryClockrate: 2600Mhz
MemoryBusWidth: 384-bit
L2CacheSize: 1572864bytes
Totalamountofconstantmemory: 65536bytes
Totalamountofsharedmemoryperblock: 49152bytes
Totalnumberofregistersavailableperblock:65536
Warpsize: 32
Maximumnumberofthreadspermultiprocessor: 2048
Maximumnumberofthreadsperblock: 1024
Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64)
Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535)
theoreticalmemorybandwidth:$2600 times 10^{6} times
(384 /8) times 2 ÷ 1024^3 = 243 GB/s$
OfficialHW Spec details:
http://www.nvidia.com/object/tesla-servers.html
COMPILE & RUN CUDA
Directlycompile to executable code
GPUand CPUcode are compiled and linked separately
#compilethesourcecodetoexecutablefile
$nvcca.cu-oa.out
COMPILE & RUN CUDA
The nvcc compiler willtranslate CUDAsource code into Parallel
Thread Execution (PTX) language in the intermediate phase.
#keepallintermediatephasefiles
$nvcca.cu-keep
#or
$nvcca.cu-save-temps
$nvcca.cu-keep
$ls
a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a
a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a
a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a
#cleanallintermediatephasefiles
$nvcca.cu-keep-clean
USEFUL NVCC USAGE
Printcode generation statistics:
$nvcc-Xptxas-vreduce.cu
ptxasinfo :0bytesgmem
ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10'
ptxasinfo :Used6registers,32bytessmem,4bytescmem[1]
-Xptxas
--ptxas-options
Specifyoptionsdirectlytotheptxoptimizingassembler.
register number:should be less than the number of available
registers, otherwises the restregisters willbe mapped into
the localmemory(off-chip).
smem stands for shared memory.
cmem stands for constantmemory. The bank-#1 constant
memorystores 4 bytes of constantvariables.
CUDA TOOLS
cuda-memcheck:functionalcorrectness checking suite.
nvidia-smi:NVIDIASystem ManagementInterface
CUDA-MEMCHECK
This toolchecks the following memoryerrors of your program,
and italso reports hardware exceptions encountered bythe
GPU.
These errors maynotcause program crash, buttheycould
unexpected program and memorymisusage.
Table .Memcheck reported errortypes
Name Description Location Precision
Memoryaccess
error
Errorsdue to out of boundsormisaligned accessesto memorybyaglobal,
local,shared orglobal atomic access.
Device Precise
Hardware
exception
Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise
Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise
CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise
cudaMalloc
memoryleaks
Allocationsof device memoryusing cudaMalloc()that have not beenfreed
bythe application.
Host Precise
Device Heap
MemoryLeaks
Allocationsof device memoryusing malloc()indevice code that have not
beenfreed bythe application.
Device Imprecise
CUDA-MEMCHECK
EXAMPLE
Program with double free fault
intmain(intargc,char*argv[])
{
constintelemNum=1024;
inth_data[elemNum];
int*d_data;
initArray(h_data);
intarraySize=elemNum*sizeof(int);
cudaMalloc((void**)&d_data,arraySize);
incrOneForAll<<<1,1024>>>(d_data);
cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_data); //fault
printArray(h_data);
return0;
}
CUDA-MEMCHECK
EXAMPLE
$nvcc-g-Gexample.cu
$cuda-memcheck./a.out
=========CUDA-MEMCHECK
=========Programhiterror17onCUDAAPIcalltocudaFree
========= Savedhostbacktraceuptodriverentrypointaterror
========= HostFrame:/usr/lib64/libcuda.so[0x26d660]
========= HostFrame:./a.out[0x42af6]
========= HostFrame:./a.out[0x2a29]
========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd]
========= HostFrame:./a.out[0x2769]
=========
No error is shown if itis run directly, butCUDA-MEMCHECK
can detectthe error.
NVIDIA SYSTEM MANAGEMENT INTERFACE
(NVIDIA-SMI)
Purpose:Queryand modifyGPUdevices' state.
$nvidia-smi
+------------------------------------------------------+
|NVIDIA-SMI5.319.37 DriverVersion:319.37 |
|-------------------------------+----------------------+----------------------+
|GPU Name Persistence-M|Bus-Id Disp.A|VolatileUncorr.ECC|
|Fan Temp Perf Pwr:Usage/Cap| Memory-Usage|GPU-Util ComputeM.|
|===============================+======================+======================|
| 0 TeslaK20Xm On |0000:0B:00.0 Off| 0|
|N/A 35C P0 60W/235W| 84MB/ 5759MB| 0% Default|
+-------------------------------+----------------------+----------------------+
| 1 TeslaK20Xm On |0000:85:00.0 Off| 0|
|N/A 39C P0 60W/235W| 14MB/ 5759MB| 0% Default|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
|Computeprocesses: GPUMemory|
| GPU PID Processname Usage |
|=============================================================================|
| 0 33736 ./RS 69MB |
+-----------------------------------------------------------------------------+
NVIDIA-SMI
You can querymore specific information on temperature,
memory, power, etc.
$nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...]
For example:
$nvidia-smi-q-dPOWER
==============NVSMILOG==============
Timestamp :
DriverVersion :319.37
AttachedGPUs :2
GPU0000:0B:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :60.71W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
EnforcedPowerLimit :235.00W
MinPowerLimit :150.00W
MaxPowerLimit :235.00W
GPU0000:85:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :31.38W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
LAB ASSIGNMENTS
1. Program-#1:increase each elementin an arraybyone.
(You are required to rewrite a CPUprogram into a CUDA
one.)
2. Program-#2:use parallelreduction to calculate the sum of all
the elements in an array.
(You are required to fillin the blanks of a template CUDA
program, and reportyour GPUbandwidth to TAafter you
finish each assignment.)
1. SUM CUDAprogramming with "multi-kerneland shared
memory"
2. SUM CUDAprogramming with "interleaved addressing"
3. SUM CUDAprogramming with "sequentialaddressing"
4. SUM CUDAprogramming with "firstadd during load"
0.2 scores per task.
LABS ASSIGNMENT #1
Rewrite the following CPUfunction into a CUDAkernel
function and complete the main function byyourself:
//increaseoneforalltheelements
voidincrOneForAll(int*array,constintelemNum)
{
inti;
for(i=0;i<elemNum;++i)
{
array[i]++;
}
}
LABS ASSIGNMENT #2
Fillin the CUDAkernelfunction:
Partof the main function is given, you are required to fillin the
blanks according to the comments:
__global__voidreduce(int*g_idata,int*g_odata)
{
extern__shared__intsdata[];
//TODO:loadthecontentofglobalmemorytosharedmemory
//NOTE:synchronizeallthethreadsafterthisstep
//TODO:sumcalculation
//NOTE:synchronizeallthethreadsaftereachiteration
//TODO:writebacktheresultintothecorrespondingentryofglobalmemory
//NOTE:onlyonethreadisenoughtodothejob
}
//parametersforthefirstkernel
//TODO:setgridandblocksize
//threadNum=?
//blockNum=?
intsMemSize=1024*sizeof(int);
reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
Hint:for "firstadd during globalload" optimization (Assignment
#2-4), the third kernelis unnecessary.
LABS ASSIGNMENT #2
Given $10^{22}$ INTs, each block has the maximum block
size $10^{10}$
How to use 3 kernelto synchronize between iterations?
LABS ASSIGNMENT #2-1
Implementthe naïve data parallelism assignmentas follows:
LABS ASSIGNMENT #2-2
Reduce number of active warps of your program:
LABS ASSIGNMENT #2-3
Preventshared memoryaccess bank confliction:
LABS ASSIGNMENT #2-4
Reduce the number of blocks in each kernel:
Notice:
Only2 kernels are needed in this case because each kernel
can now process twice amountof data than before.
Globalmemoryshould be accessed in a sequential
addressing way.
CUDA PROGRAMMING TIPS
KERNEL LAUNCH
mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args);
gridSize:number of blocks per grid
blockSize:number of threads per block
sMemSize[optional]:shared memorysize (in bytes)
streamID[optional]:stream ID, defaultis 0
BUILT-IN VARIABLES FOR INDEXING IN A
KERNEL FUNCTION
blockIdx.x, blockIdx.y, blockIdx.z:block index
threadIdx.x, threadIdx.y, threadIdx.z:thread index
gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks
per grid) per dimension
blockDim.x, blockDim.y, blockDim.z:block size (number of
threads per block) per dimension
CUDAMEMCPY
cudaError_tcudaMemcpy(void*dst,
constvoid*src,
size_t count,
enumcudaMemcpyKindkind
)
Enumerator:
cudaMemcpyHostToHost:Host-> Host
cudaMemcpyHostToDevice:Host-> Device
cudaMemcpyDeviceToHost;Device -> Host
cudaMemcpyDeviceToDevice:Device -> Device
SYNCHRONIZATION
__synthread():synchronizes allthreads in a block (used inside
the kernelfunction).
cudaDeviceSynchronize():blocks untilthe device has
completed allpreceding requested tasks (used between two
kernellaunches).
kernel1<<<gridSize,blockSize>>>(args);
cudaDeviceSynchronize();
kernel2<<<gridSize,blockSize>>>(args);
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
Methods:
cudaEventCreate():inittimer
cudaEventDestory():destorytimer
cudaEventRecord():settimer
cudaEventSynchronize():sync timer after each kernelcall
cudaEventElapsedTime():returns the elapsed time in
milliseconds
Example:
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
cudaEvent_tstart,stop;
floattime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
kernel<<<grid,threads>>>(d_idata,d_odata);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
REFERENCES
1.
2.
3.
4.
5.
6.
7.
NVIDIACUDARuntime API
Programming Guide ::CUDAToolkitDocumentation
BestPractices Guide ::CUDAToolkitDocumentation
NVCC ::CUDAToolkitDocumentation
CUDA-MEMCHECK::CUDAToolkitDocumentation
nvidia-smidocumentation
CUDAerror types
THE END
ENJOY CUDA & HAPPY NEW YEAR!

Contenu connexe

Tendances

ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Teddy Hsiung
 
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
CODE BLUE
 
OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified! OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified!
DVClub
 

Tendances (20)

Kernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPIKernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
 
Npc14
Npc14Npc14
Npc14
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Kotlin coroutine - the next step for RxJava developer?
Kotlin coroutine - the next step for RxJava developer?Kotlin coroutine - the next step for RxJava developer?
Kotlin coroutine - the next step for RxJava developer?
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver
 
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
 
Making a Process
Making a ProcessMaking a Process
Making a Process
 
Down to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap DumpsDown to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap Dumps
 
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
 
OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified! OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified!
 
Book
BookBook
Book
 
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakWorkflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
 
Workflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesWorkflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large Enterprises
 
計算機性能の限界点とその考え方
計算機性能の限界点とその考え方計算機性能の限界点とその考え方
計算機性能の限界点とその考え方
 
The Ring programming language version 1.8 book - Part 54 of 202
The Ring programming language version 1.8 book - Part 54 of 202The Ring programming language version 1.8 book - Part 54 of 202
The Ring programming language version 1.8 book - Part 54 of 202
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
 
Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2
 

En vedette

Pedagogizacja rodziców
Pedagogizacja rodzicówPedagogizacja rodziców
Pedagogizacja rodziców
siwonas
 
my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"
Shuai Yuan
 
Poradnik gimnazjalisty
Poradnik gimnazjalistyPoradnik gimnazjalisty
Poradnik gimnazjalisty
siwonas
 
Prezentacja typologia form zachowania sie bezrobotnych
Prezentacja   typologia form zachowania sie bezrobotnychPrezentacja   typologia form zachowania sie bezrobotnych
Prezentacja typologia form zachowania sie bezrobotnych
siwonas
 
Bezrobocie jako problem społeczny
Bezrobocie jako problem społecznyBezrobocie jako problem społeczny
Bezrobocie jako problem społeczny
siwonas
 
Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1
siwonas
 
11 writing pp elaboration examples
11 writing pp elaboration examples11 writing pp elaboration examples
11 writing pp elaboration examples
kartia79
 

En vedette (14)

Marcas (flora y fauna en chimborazo)
Marcas  (flora y fauna en chimborazo)Marcas  (flora y fauna en chimborazo)
Marcas (flora y fauna en chimborazo)
 
Pedagogizacja rodziców
Pedagogizacja rodzicówPedagogizacja rodziców
Pedagogizacja rodziców
 
Soccer maddie
Soccer maddieSoccer maddie
Soccer maddie
 
14776451 reacciones-redox
14776451 reacciones-redox14776451 reacciones-redox
14776451 reacciones-redox
 
my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"
 
Poradnik gimnazjalisty
Poradnik gimnazjalistyPoradnik gimnazjalisty
Poradnik gimnazjalisty
 
Prezentacja typologia form zachowania sie bezrobotnych
Prezentacja   typologia form zachowania sie bezrobotnychPrezentacja   typologia form zachowania sie bezrobotnych
Prezentacja typologia form zachowania sie bezrobotnych
 
Bezrobocie jako problem społeczny
Bezrobocie jako problem społecznyBezrobocie jako problem społeczny
Bezrobocie jako problem społeczny
 
Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1
 
11 writing pp elaboration examples
11 writing pp elaboration examples11 writing pp elaboration examples
11 writing pp elaboration examples
 
Sony vegas pro 11 manual de inicio rápido
Sony vegas pro 11   manual de inicio rápidoSony vegas pro 11   manual de inicio rápido
Sony vegas pro 11 manual de inicio rápido
 
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
اتجاهات المعلمين نحو استخدام التعليم الالكترونياتجاهات المعلمين نحو استخدام التعليم الالكتروني
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
 
Hsp pkjr
Hsp pkjrHsp pkjr
Hsp pkjr
 
Bo rang claim perjalanan
Bo rang claim perjalananBo rang claim perjalanan
Bo rang claim perjalanan
 

Similaire à CUDA lab's slides of "parallel programming" course

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
pepe464163
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 
CUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesCUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : Notes
Subhajit Sahu
 

Similaire à CUDA lab's slides of "parallel programming" course (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Roll your own toy unix clone os
Roll your own toy unix clone osRoll your own toy unix clone os
Roll your own toy unix clone os
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
CUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesCUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : Notes
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
Cuda debugger
Cuda debuggerCuda debugger
Cuda debugger
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 

Dernier

valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
nirzagarg
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 

Dernier (20)

Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Salem Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 

CUDA lab's slides of "parallel programming" course

  • 2. OVERVIEW Programming Environment Compile &Run CUDAprogram CUDATools Lab Tasks CUDAProgramming Tips References
  • 3. GPU SERVER IntelE5-2670 V2 10Cores CPUX 2 NVIDIAK20X GPGPUCARD X 2
  • 4. Command to getyour GPGPUHW spec: $/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Device0:"TeslaK20Xm" CUDADriverVersion/RuntimeVersion 5.5/5.5 CUDACapabilityMajor/Minorversionnumber: 3.5 Totalamountofglobalmemory: 5760MBytes(6039339008bytes) (14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores GPUClockrate: 732MHz(0.73GHz) MemoryClockrate: 2600Mhz MemoryBusWidth: 384-bit L2CacheSize: 1572864bytes Totalamountofconstantmemory: 65536bytes Totalamountofsharedmemoryperblock: 49152bytes Totalnumberofregistersavailableperblock:65536 Warpsize: 32 Maximumnumberofthreadspermultiprocessor: 2048 Maximumnumberofthreadsperblock: 1024 Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64) Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535) theoreticalmemorybandwidth:$2600 times 10^{6} times (384 /8) times 2 ÷ 1024^3 = 243 GB/s$ OfficialHW Spec details: http://www.nvidia.com/object/tesla-servers.html
  • 5. COMPILE & RUN CUDA Directlycompile to executable code GPUand CPUcode are compiled and linked separately #compilethesourcecodetoexecutablefile $nvcca.cu-oa.out
  • 6. COMPILE & RUN CUDA The nvcc compiler willtranslate CUDAsource code into Parallel Thread Execution (PTX) language in the intermediate phase. #keepallintermediatephasefiles $nvcca.cu-keep #or $nvcca.cu-save-temps $nvcca.cu-keep $ls a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a #cleanallintermediatephasefiles $nvcca.cu-keep-clean
  • 7. USEFUL NVCC USAGE Printcode generation statistics: $nvcc-Xptxas-vreduce.cu ptxasinfo :0bytesgmem ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10' ptxasinfo :Used6registers,32bytessmem,4bytescmem[1] -Xptxas --ptxas-options Specifyoptionsdirectlytotheptxoptimizingassembler. register number:should be less than the number of available registers, otherwises the restregisters willbe mapped into the localmemory(off-chip). smem stands for shared memory. cmem stands for constantmemory. The bank-#1 constant memorystores 4 bytes of constantvariables.
  • 8. CUDA TOOLS cuda-memcheck:functionalcorrectness checking suite. nvidia-smi:NVIDIASystem ManagementInterface
  • 9. CUDA-MEMCHECK This toolchecks the following memoryerrors of your program, and italso reports hardware exceptions encountered bythe GPU. These errors maynotcause program crash, buttheycould unexpected program and memorymisusage. Table .Memcheck reported errortypes Name Description Location Precision Memoryaccess error Errorsdue to out of boundsormisaligned accessesto memorybyaglobal, local,shared orglobal atomic access. Device Precise Hardware exception Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise cudaMalloc memoryleaks Allocationsof device memoryusing cudaMalloc()that have not beenfreed bythe application. Host Precise Device Heap MemoryLeaks Allocationsof device memoryusing malloc()indevice code that have not beenfreed bythe application. Device Imprecise
  • 10. CUDA-MEMCHECK EXAMPLE Program with double free fault intmain(intargc,char*argv[]) { constintelemNum=1024; inth_data[elemNum]; int*d_data; initArray(h_data); intarraySize=elemNum*sizeof(int); cudaMalloc((void**)&d_data,arraySize); incrOneForAll<<<1,1024>>>(d_data); cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost); cudaFree(d_data); cudaFree(d_data); //fault printArray(h_data); return0; }
  • 11. CUDA-MEMCHECK EXAMPLE $nvcc-g-Gexample.cu $cuda-memcheck./a.out =========CUDA-MEMCHECK =========Programhiterror17onCUDAAPIcalltocudaFree ========= Savedhostbacktraceuptodriverentrypointaterror ========= HostFrame:/usr/lib64/libcuda.so[0x26d660] ========= HostFrame:./a.out[0x42af6] ========= HostFrame:./a.out[0x2a29] ========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd] ========= HostFrame:./a.out[0x2769] ========= No error is shown if itis run directly, butCUDA-MEMCHECK can detectthe error.
  • 12. NVIDIA SYSTEM MANAGEMENT INTERFACE (NVIDIA-SMI) Purpose:Queryand modifyGPUdevices' state. $nvidia-smi +------------------------------------------------------+ |NVIDIA-SMI5.319.37 DriverVersion:319.37 | |-------------------------------+----------------------+----------------------+ |GPU Name Persistence-M|Bus-Id Disp.A|VolatileUncorr.ECC| |Fan Temp Perf Pwr:Usage/Cap| Memory-Usage|GPU-Util ComputeM.| |===============================+======================+======================| | 0 TeslaK20Xm On |0000:0B:00.0 Off| 0| |N/A 35C P0 60W/235W| 84MB/ 5759MB| 0% Default| +-------------------------------+----------------------+----------------------+ | 1 TeslaK20Xm On |0000:85:00.0 Off| 0| |N/A 39C P0 60W/235W| 14MB/ 5759MB| 0% Default| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ |Computeprocesses: GPUMemory| | GPU PID Processname Usage | |=============================================================================| | 0 33736 ./RS 69MB | +-----------------------------------------------------------------------------+
  • 13. NVIDIA-SMI You can querymore specific information on temperature, memory, power, etc. $nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...] For example: $nvidia-smi-q-dPOWER ==============NVSMILOG============== Timestamp : DriverVersion :319.37 AttachedGPUs :2 GPU0000:0B:00.0 PowerReadings PowerManagement :Supported PowerDraw :60.71W PowerLimit :235.00W DefaultPowerLimit :235.00W EnforcedPowerLimit :235.00W MinPowerLimit :150.00W MaxPowerLimit :235.00W GPU0000:85:00.0 PowerReadings PowerManagement :Supported PowerDraw :31.38W PowerLimit :235.00W DefaultPowerLimit :235.00W
  • 14. LAB ASSIGNMENTS 1. Program-#1:increase each elementin an arraybyone. (You are required to rewrite a CPUprogram into a CUDA one.) 2. Program-#2:use parallelreduction to calculate the sum of all the elements in an array. (You are required to fillin the blanks of a template CUDA program, and reportyour GPUbandwidth to TAafter you finish each assignment.) 1. SUM CUDAprogramming with "multi-kerneland shared memory" 2. SUM CUDAprogramming with "interleaved addressing" 3. SUM CUDAprogramming with "sequentialaddressing" 4. SUM CUDAprogramming with "firstadd during load" 0.2 scores per task.
  • 15. LABS ASSIGNMENT #1 Rewrite the following CPUfunction into a CUDAkernel function and complete the main function byyourself: //increaseoneforalltheelements voidincrOneForAll(int*array,constintelemNum) { inti; for(i=0;i<elemNum;++i) { array[i]++; } }
  • 16. LABS ASSIGNMENT #2 Fillin the CUDAkernelfunction: Partof the main function is given, you are required to fillin the blanks according to the comments: __global__voidreduce(int*g_idata,int*g_odata) { extern__shared__intsdata[]; //TODO:loadthecontentofglobalmemorytosharedmemory //NOTE:synchronizeallthethreadsafterthisstep //TODO:sumcalculation //NOTE:synchronizeallthethreadsaftereachiteration //TODO:writebacktheresultintothecorrespondingentryofglobalmemory //NOTE:onlyonethreadisenoughtodothejob } //parametersforthefirstkernel //TODO:setgridandblocksize //threadNum=? //blockNum=? intsMemSize=1024*sizeof(int); reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
  • 17. Hint:for "firstadd during globalload" optimization (Assignment #2-4), the third kernelis unnecessary. LABS ASSIGNMENT #2 Given $10^{22}$ INTs, each block has the maximum block size $10^{10}$ How to use 3 kernelto synchronize between iterations?
  • 18. LABS ASSIGNMENT #2-1 Implementthe naïve data parallelism assignmentas follows:
  • 19. LABS ASSIGNMENT #2-2 Reduce number of active warps of your program:
  • 20. LABS ASSIGNMENT #2-3 Preventshared memoryaccess bank confliction:
  • 21. LABS ASSIGNMENT #2-4 Reduce the number of blocks in each kernel: Notice: Only2 kernels are needed in this case because each kernel can now process twice amountof data than before. Globalmemoryshould be accessed in a sequential addressing way.
  • 23. KERNEL LAUNCH mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args); gridSize:number of blocks per grid blockSize:number of threads per block sMemSize[optional]:shared memorysize (in bytes) streamID[optional]:stream ID, defaultis 0
  • 24. BUILT-IN VARIABLES FOR INDEXING IN A KERNEL FUNCTION blockIdx.x, blockIdx.y, blockIdx.z:block index threadIdx.x, threadIdx.y, threadIdx.z:thread index gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks per grid) per dimension blockDim.x, blockDim.y, blockDim.z:block size (number of threads per block) per dimension
  • 26. SYNCHRONIZATION __synthread():synchronizes allthreads in a block (used inside the kernelfunction). cudaDeviceSynchronize():blocks untilthe device has completed allpreceding requested tasks (used between two kernellaunches). kernel1<<<gridSize,blockSize>>>(args); cudaDeviceSynchronize(); kernel2<<<gridSize,blockSize>>>(args);
  • 27. HOW TO MEASURE KERNEL EXECUTION TIME USING CUDA GPU TIMERS Methods: cudaEventCreate():inittimer cudaEventDestory():destorytimer cudaEventRecord():settimer cudaEventSynchronize():sync timer after each kernelcall cudaEventElapsedTime():returns the elapsed time in milliseconds
  • 28. Example: HOW TO MEASURE KERNEL EXECUTION TIME USING CUDA GPU TIMERS cudaEvent_tstart,stop; floattime; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); kernel<<<grid,threads>>>(d_idata,d_odata); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time,start,stop); cudaEventDestroy(start); cudaEventDestroy(stop);
  • 29. REFERENCES 1. 2. 3. 4. 5. 6. 7. NVIDIACUDARuntime API Programming Guide ::CUDAToolkitDocumentation BestPractices Guide ::CUDAToolkitDocumentation NVCC ::CUDAToolkitDocumentation CUDA-MEMCHECK::CUDAToolkitDocumentation nvidia-smidocumentation CUDAerror types
  • 30. THE END ENJOY CUDA & HAPPY NEW YEAR!