SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
CUDA – AN
INTRODUCTION
               Raymond Tay
CUDA - What and Why
    CUDA™ is a C/C++ SDK developed by Nvidia. Released in 2006 world-wide for
     the GeForce™ 8800 graphics card. CUDA 4.0 SDK released in 2011.
    CUDA allows HPC developers, researchers to model complex problems and achieve
     up to 100x performance.




                                                                CUDA
                                                                SDK
Nvidia GPUs FPS
    FPS – Floating-point per second aka flops. A measure of how many flops can a
     GPU do. More is Better 


                                                    GPUs beat CPUs
Nvidia GPUs Memory Bandwidth
    With massively parallel processors in Nvidia’s GPUs, providing high memory
     bandwidth plays a big role in high performance computing.


                                                  GPUs beat CPUs
GPU vs CPU




CPU                                  GPU
"   Optimised for low-latency        "   Optimised for data-parallel,
    access to cached data sets           throughput computation
"   Control logic for out-of-order   "   Architecture tolerant of
    and speculative execution            memory latency
                                     "   More transistors dedicated to
                                         computation
I don’t know C/C++, should I leave?

    Relax, no worries. Not to fret.


                Your Brain Asks:
                Wait a minute, why should I learn
                the C/C++ SDK?

                CUDA Answers:
                Efficiency!!!
I’ve heard about OpenCL. What is it?


                                     Entry point for developers
                                     who prefer high-level C


    Entry point for developers
       who want low-level API

Shared back-end compiler and
      optimization technology
What do I need to begin with CUDA?

    A Nvidia CUDA enabled graphics card e.g. Fermi
How does CUDA work


                                       PCI Bus




1.  Copy input data from CPU memory to GPU
    memory
2.  Load GPU program and execute,
    caching data on chip for performance
3.  Copy results from GPU memory to CPU memory
CUDA Kernels: Subdivide into Blocks




    Threads are grouped into blocks
    Blocks are grouped into a grid
    A kernel is executed as a grid of blocks of threads
Transparent Scalability – G80

    1   2   3      4    5     6     7    8     9    10    11   12




                                  9     10   11    12

                                  1     2     3    4     5     6       7   8



                As maximum blocks are executing on the GPU, blocks 9
                – 12 will wait
Transparent Scalability – GT200

        1   2   3   4   5   6   7    8    9   10     11    12




1   2   3   4   5   6   7   8   9   10   11   12   Idle
                                                          ...   Idle   Idle
Arrays of Parallel Threads
   ALL threads run the same kernel code
   Each thread has an ID that’s used to compute

    address & make control decisions
Block 0                                                     Block (N -1)

0       1       2       3       4      5       6     7      0       1       2       3       4      5       6     7


 …                                                           …
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;   unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    int shifted = input_array[tid] + shift_amount;              int shifted = input_array[tid] + shift_amount;
    if ( shifted > alphabet_max )                               if ( shifted > alphabet_max )
      shifted = shifted % (alphabet_max + 1);                     shifted = shifted % (alphabet_max + 1);

 output_array[tid] = shifted;                                output_array[tid] = shifted;
…                                                           …
                                    Parallel code                                                Parallel code
Compiling a CUDA program
            C/C++ CUDA    float4 me = gx[gtid];
                          me.x += me.y * me.z;
            Application
                                                   •    Parallel Thread
                                                        eXecution (PTX)‫‏‬
                                                         –  Virtual Machine
              NVCC              CPU Code                    and ISA
                                                         –  Programming
                                                            model
Virtual      PTX Code                                    –  Execution
                                                            resources and
                                                            state
          PTX to Target       ld.global.v4.f32   {$f1,$f3,$f5,$f7}, [$r9+0];
                              mad.f32            $f1, $f5, $f3, $f1;
             Compiler



      G80       …       GPU

          Target code
Example: Block Cypher
void host_shift_cypher(unsigned int *input_array,    __global__ void shift_cypher(unsigned int
    unsigned int *output_array, unsigned int             *input_array, unsigned int *output_array,
    shift_amount, unsigned int alphabet_max,             unsigned int shift_amount, unsigned int
    unsigned int array_length)	
                         alphabet_max, unsigned int array_length)	
{	
                                                  {	
  for(unsigned int i=0;i<array_length;i++)	
           unsigned int tid = threadIdx.x + blockIdx.x *
                                                          blockDim.x;	
 {	
                                                       int shifted = input_array[tid] + shift_amount;	
       int element = input_array[i];	
                                                       if ( shifted > alphabet_max )	
       int shifted = element + shift_amount;	
                                                           	
shifted = shifted % (alphabet_max + 1);	
       if(shifted > alphabet_max)	
       {	
                                                       output_array[tid] = shifted;	
         shifted = shifted % (alphabet_max + 1);	
                                                     }	
       }	
       output_array[i] = shifted;	
                                                     Int main() {	
  }	
                                                     dim3 dimGrid(ceil(array_length)/block_size);	
}	
                                                     dim3 dimBlock(block_size);	
Int main() {	
                                                     shift_cypher<<<dimGrid,dimBlock>>>(input_array,
host_shift_cypher(input_array, output_array,
                                                          output_array, shift_amount, alphabet_max,
    shift_amount, alphabet_max, array_length);	
                                                          array_length);	
}	
                                                     }	
                    CPU Program                                       GPU Program
I see some WEIRD syntax..is it still C?

  CUDA C is an extension of C
  <<< Dg, Db, Ns, S>>> is the execution

   configuration for the call to __global__ ; defines
   the dimensions of the grid and blocks that’ll be used
     (dynamically allocated shared memory & stream is optional)
  __global__ declares a function is a kernel which is
   executed on the GPU and callable from the host
   only. This call is asynchronous.
  See the CUDA C Programming Guide.
How does the CUDA Kernel get Data?

  Allocate CPU memory for n integers e.g. malloc(…)
  Allocate GPU memory for n integers e.g. cudaMalloc(…)

  Copy the CPU memory to GPU memory for n
   integers e.g. cudaMemcpy(…, cudaMemcpyHostToDevice)
  Copy the GPU memory to CPU once computation is

   done e.g. cudaMemcpy(…, cudaMemcpyDeviceToHost)
  Free the GPU & CPU memory e.g. cudaFree(…)
Example: Block Cypher (Host Code)
#include <stdio.h>	

Int main() {	
 unsigned int num_bytes = sizeof(int) * (1 << 22); 	
 unsigned int * input_array = 0;	
 unsigned int * output_array = 0;	
…	
 cudaMalloc((void**)&input_array, num_bytes);	
 cudaMalloc((void**)&output_array, num_bytes);	
 cudaMemcpy(input_array, host_input_array, num_bytes, cudaMemcpyHostToDevice);	
…	
// gpu will compute the kernel and transfer the results out of the gpu to host.	
cudaMemcpy(host_output_array, output_array, num_bytes,
cudaMemcpyDeviceToHost);	
…	
 // free the memory	
 cudaFree(input_array);	
 cudaFree(output_array);	
}
Compiling the Block Cypher GPU Code

    nvcc is the compiler and should be accessible from
     your PATH variable. Set the dynamic library load
     path
       UNIX: $PATH, Win: %PATH%
       UNIX: $LD_LIBRARY_PATH / $DYLD_LIBRARY_PATH

    nvcc block-cypher.cu –arch=sm_12
       Compile   the GPU code for the GPU architecture sm_12
    nvcc –g –G block-cypher.cu –arch=sm_12
       Compiled
               the program s.t. CPU + GPU code is in
       debugged mode
Debugger
               CUDA-GDB	
           • Based on GDB
           • Linux
           • Mac OS X



                              Parallel Nsight	
                            • Plugin inside Visual
                            Studio
Visual Profiler & Memcheck
                                    Profiler	
                             •  Microsoft Windows
                             •  Linux
                             •  Mac OS X

                             •  Analyze Performance




    CUDA-MEMCHECK	
   •  Microsoft Windows
   •  Linux
   •  Mac OS X

   •  Detect memory access
   errors
Hints
    Think about producing a serial algorithm that can
     execute correctly on a CPU
    Think about producing a parallel (CUDA/OpenCL)
     algorithm from that serial algorithm
    Obtain a initial run time (call it gold standard?)
       Use
          the profiler to profile this initial run (Typically its quite
       bad )
    Fine tune your code to take advantage of shared
     memory, improving memory coalescing, reduce shared
     memory conflicts etc (Consult the best practices guide &
     SDK)
       Use   the profiler to conduct cross comparisons
Hints (Not exhaustive!)
    Be aware of the trade offs when your kernel becomes
     too complicated:
       If you noticed the kernel has a lot of local (thread) variables
        e.g. int i, float j : register spilling
       If you noticed the run time is still slow EVEN AFTER you’ve
        used shared memory, re-assess the memory access patterns :
        shared memory conflicts
       TRY to reduce the number of conditionals e.g. Ifs : thread
        divergence
       TRY to unroll ANY loops in the kernel code e.g. #pragma
        unroll n
       Don’t use thread blocks that are not a multiple of warpSize.
Other cool things in the CUDA SDK 4.0
    GPUDirect
    Unified Virtual Address Space
    Multi-GPU
         P2P Memory Access/Copy (gels with the UVA)
    Concurrent Execution
         Kernel + Data
         Streams, Events
    GPU Memories
         Shared, Texture, Surface, Constant, Registers, Portable, Write-combining, Page-locked/
          Pinned
    OpenGL, Direct3D interoperability
    Atomic functions, Fast Math Functions
    Dynamic Global Memory Allocation (in-kernel)
         Determine how much the device supports e.g. cudaDeviceGetLimit
         Set it before you launch the kernel e.g. cudaDeviceSetLimit
         Free it!
Additional Resources
    CUDA FAQ (http://tegradeveloper.nvidia.com/cuda-faq)
    CUDA Tools & Ecosystem (http://tegradeveloper.nvidia.com/cuda-tools-ecosystem)
    CUDA Downloads (http://tegradeveloper.nvidia.com/cuda-downloads)
    NVIDIA Forums (http://forums.nvidia.com/index.php?showforum=62)
    GPGPU (http://gpgpu.org )
    CUDA By Example (
     http://tegradeveloper.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-
     programming-0)
         Jason Sanders & Edward Kandrot
    GPU Computing Gems Emerald Edition (
     http://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ )
         Editor in Chief: Prof Hwu Wen-Mei
CUDA Libraries
  Visit this site
   http://developer.nvidia.com/cuda-tools-
   ecosystem#Libraries
  Thrust, CUFFT, CUBLAS, CUSP, NPP, OpenCV, GPU

   AI-Tree Search, GPU AI-Path Finding
  A lot of the libraries are hosted in Google Code.

   Many more gems in there too!
Questions?
THANK YOU
GPU memories: Shared

             More than 1 Tbyte/sec
              aggregate memory bandwidth
             Use it
                    As a cache
                    To reorganize global memory accesses into
                     coalesced pattern
                    To share data between threads

             16 kbytes per SM (Before Fermi)
             64 kbytes per SM (Fermi)
GPU memories: Texture

                  Texture is an object for reading data
                  Data is cached
                  Host actions
                         Allocate memory on GPU
                         Create a texture memory reference object
                         Bind the texture object to memory
                         Clean up after use
                  GPU actions
                         Fetch using texture references
                          text1Dfetch(), tex1D(), tex2D(), tex3D()
GPU memories: Constant

             Write by host, read by GPU
             Data is cached

             Useful for tables of constants

             64 kbytes

Contenu connexe

Tendances

Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)MuntasirMuhit
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 

Tendances (20)

Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Gpu
GpuGpu
Gpu
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 

Similaire à Introduction to CUDA

Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Divekrasul
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 

Similaire à Introduction to CUDA (20)

Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Dive
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 

Plus de Raymond Tay

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distributionRaymond Tay
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloodsRaymond Tay
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scalaRaymond Tay
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to ErlangRaymond Tay
 

Plus de Raymond Tay (7)

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distribution
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 
Practical cats
Practical catsPractical cats
Practical cats
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloods
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scala
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to Erlang
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Introduction to CUDA

  • 2. CUDA - What and Why   CUDA™ is a C/C++ SDK developed by Nvidia. Released in 2006 world-wide for the GeForce™ 8800 graphics card. CUDA 4.0 SDK released in 2011.   CUDA allows HPC developers, researchers to model complex problems and achieve up to 100x performance. CUDA SDK
  • 3. Nvidia GPUs FPS   FPS – Floating-point per second aka flops. A measure of how many flops can a GPU do. More is Better  GPUs beat CPUs
  • 4. Nvidia GPUs Memory Bandwidth   With massively parallel processors in Nvidia’s GPUs, providing high memory bandwidth plays a big role in high performance computing. GPUs beat CPUs
  • 5. GPU vs CPU CPU GPU "   Optimised for low-latency "   Optimised for data-parallel, access to cached data sets throughput computation "   Control logic for out-of-order "   Architecture tolerant of and speculative execution memory latency "   More transistors dedicated to computation
  • 6. I don’t know C/C++, should I leave?   Relax, no worries. Not to fret. Your Brain Asks: Wait a minute, why should I learn the C/C++ SDK? CUDA Answers: Efficiency!!!
  • 7. I’ve heard about OpenCL. What is it? Entry point for developers who prefer high-level C Entry point for developers who want low-level API Shared back-end compiler and optimization technology
  • 8. What do I need to begin with CUDA?   A Nvidia CUDA enabled graphics card e.g. Fermi
  • 9. How does CUDA work PCI Bus 1.  Copy input data from CPU memory to GPU memory 2.  Load GPU program and execute, caching data on chip for performance 3.  Copy results from GPU memory to CPU memory
  • 10. CUDA Kernels: Subdivide into Blocks   Threads are grouped into blocks   Blocks are grouped into a grid   A kernel is executed as a grid of blocks of threads
  • 11. Transparent Scalability – G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6 7 8 As maximum blocks are executing on the GPU, blocks 9 – 12 will wait
  • 12. Transparent Scalability – GT200 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Idle ... Idle Idle
  • 13. Arrays of Parallel Threads   ALL threads run the same kernel code   Each thread has an ID that’s used to compute address & make control decisions Block 0 Block (N -1) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 … … unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; int shifted = input_array[tid] + shift_amount; int shifted = input_array[tid] + shift_amount; if ( shifted > alphabet_max ) if ( shifted > alphabet_max ) shifted = shifted % (alphabet_max + 1); shifted = shifted % (alphabet_max + 1); output_array[tid] = shifted; output_array[tid] = shifted; … … Parallel code Parallel code
  • 14. Compiling a CUDA program C/C++ CUDA float4 me = gx[gtid]; me.x += me.y * me.z; Application •  Parallel Thread eXecution (PTX)‫‏‬ –  Virtual Machine NVCC CPU Code and ISA –  Programming model Virtual PTX Code –  Execution resources and state PTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; mad.f32 $f1, $f5, $f3, $f1; Compiler G80 … GPU Target code
  • 15. Example: Block Cypher void host_shift_cypher(unsigned int *input_array, __global__ void shift_cypher(unsigned int unsigned int *output_array, unsigned int *input_array, unsigned int *output_array, shift_amount, unsigned int alphabet_max, unsigned int shift_amount, unsigned int unsigned int array_length) alphabet_max, unsigned int array_length) { { for(unsigned int i=0;i<array_length;i++) unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; { int shifted = input_array[tid] + shift_amount; int element = input_array[i]; if ( shifted > alphabet_max ) int shifted = element + shift_amount; shifted = shifted % (alphabet_max + 1); if(shifted > alphabet_max) { output_array[tid] = shifted; shifted = shifted % (alphabet_max + 1); } } output_array[i] = shifted; Int main() { } dim3 dimGrid(ceil(array_length)/block_size); } dim3 dimBlock(block_size); Int main() { shift_cypher<<<dimGrid,dimBlock>>>(input_array, host_shift_cypher(input_array, output_array, output_array, shift_amount, alphabet_max, shift_amount, alphabet_max, array_length); array_length); } } CPU Program GPU Program
  • 16. I see some WEIRD syntax..is it still C?   CUDA C is an extension of C   <<< Dg, Db, Ns, S>>> is the execution configuration for the call to __global__ ; defines the dimensions of the grid and blocks that’ll be used (dynamically allocated shared memory & stream is optional)   __global__ declares a function is a kernel which is executed on the GPU and callable from the host only. This call is asynchronous.   See the CUDA C Programming Guide.
  • 17. How does the CUDA Kernel get Data?   Allocate CPU memory for n integers e.g. malloc(…)   Allocate GPU memory for n integers e.g. cudaMalloc(…)   Copy the CPU memory to GPU memory for n integers e.g. cudaMemcpy(…, cudaMemcpyHostToDevice)   Copy the GPU memory to CPU once computation is done e.g. cudaMemcpy(…, cudaMemcpyDeviceToHost)   Free the GPU & CPU memory e.g. cudaFree(…)
  • 18. Example: Block Cypher (Host Code) #include <stdio.h> Int main() { unsigned int num_bytes = sizeof(int) * (1 << 22); unsigned int * input_array = 0; unsigned int * output_array = 0; … cudaMalloc((void**)&input_array, num_bytes); cudaMalloc((void**)&output_array, num_bytes); cudaMemcpy(input_array, host_input_array, num_bytes, cudaMemcpyHostToDevice); … // gpu will compute the kernel and transfer the results out of the gpu to host. cudaMemcpy(host_output_array, output_array, num_bytes, cudaMemcpyDeviceToHost); … // free the memory cudaFree(input_array); cudaFree(output_array); }
  • 19. Compiling the Block Cypher GPU Code   nvcc is the compiler and should be accessible from your PATH variable. Set the dynamic library load path   UNIX: $PATH, Win: %PATH%   UNIX: $LD_LIBRARY_PATH / $DYLD_LIBRARY_PATH   nvcc block-cypher.cu –arch=sm_12   Compile the GPU code for the GPU architecture sm_12   nvcc –g –G block-cypher.cu –arch=sm_12   Compiled the program s.t. CPU + GPU code is in debugged mode
  • 20. Debugger CUDA-GDB • Based on GDB • Linux • Mac OS X Parallel Nsight • Plugin inside Visual Studio
  • 21. Visual Profiler & Memcheck Profiler •  Microsoft Windows •  Linux •  Mac OS X •  Analyze Performance CUDA-MEMCHECK •  Microsoft Windows •  Linux •  Mac OS X •  Detect memory access errors
  • 22. Hints   Think about producing a serial algorithm that can execute correctly on a CPU   Think about producing a parallel (CUDA/OpenCL) algorithm from that serial algorithm   Obtain a initial run time (call it gold standard?)   Use the profiler to profile this initial run (Typically its quite bad )   Fine tune your code to take advantage of shared memory, improving memory coalescing, reduce shared memory conflicts etc (Consult the best practices guide & SDK)   Use the profiler to conduct cross comparisons
  • 23. Hints (Not exhaustive!)   Be aware of the trade offs when your kernel becomes too complicated:   If you noticed the kernel has a lot of local (thread) variables e.g. int i, float j : register spilling   If you noticed the run time is still slow EVEN AFTER you’ve used shared memory, re-assess the memory access patterns : shared memory conflicts   TRY to reduce the number of conditionals e.g. Ifs : thread divergence   TRY to unroll ANY loops in the kernel code e.g. #pragma unroll n   Don’t use thread blocks that are not a multiple of warpSize.
  • 24. Other cool things in the CUDA SDK 4.0   GPUDirect   Unified Virtual Address Space   Multi-GPU   P2P Memory Access/Copy (gels with the UVA)   Concurrent Execution   Kernel + Data   Streams, Events   GPU Memories   Shared, Texture, Surface, Constant, Registers, Portable, Write-combining, Page-locked/ Pinned   OpenGL, Direct3D interoperability   Atomic functions, Fast Math Functions   Dynamic Global Memory Allocation (in-kernel)   Determine how much the device supports e.g. cudaDeviceGetLimit   Set it before you launch the kernel e.g. cudaDeviceSetLimit   Free it!
  • 25. Additional Resources   CUDA FAQ (http://tegradeveloper.nvidia.com/cuda-faq)   CUDA Tools & Ecosystem (http://tegradeveloper.nvidia.com/cuda-tools-ecosystem)   CUDA Downloads (http://tegradeveloper.nvidia.com/cuda-downloads)   NVIDIA Forums (http://forums.nvidia.com/index.php?showforum=62)   GPGPU (http://gpgpu.org )   CUDA By Example ( http://tegradeveloper.nvidia.com/content/cuda-example-introduction-general-purpose-gpu- programming-0)   Jason Sanders & Edward Kandrot   GPU Computing Gems Emerald Edition ( http://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ )   Editor in Chief: Prof Hwu Wen-Mei
  • 26. CUDA Libraries   Visit this site http://developer.nvidia.com/cuda-tools- ecosystem#Libraries   Thrust, CUFFT, CUBLAS, CUSP, NPP, OpenCV, GPU AI-Tree Search, GPU AI-Path Finding   A lot of the libraries are hosted in Google Code. Many more gems in there too!
  • 29. GPU memories: Shared   More than 1 Tbyte/sec aggregate memory bandwidth   Use it   As a cache   To reorganize global memory accesses into coalesced pattern   To share data between threads   16 kbytes per SM (Before Fermi)   64 kbytes per SM (Fermi)
  • 30. GPU memories: Texture   Texture is an object for reading data   Data is cached   Host actions   Allocate memory on GPU   Create a texture memory reference object   Bind the texture object to memory   Clean up after use   GPU actions   Fetch using texture references text1Dfetch(), tex1D(), tex2D(), tex3D()
  • 31. GPU memories: Constant   Write by host, read by GPU   Data is cached   Useful for tables of constants   64 kbytes