SlideShare une entreprise Scribd logo
1  sur  24
“GPU With CUDA Architecture”
Presented By-
Dhaval Kaneria (13014061010)
Guided By-
Mr. Rajesh k Navandar
Table Of Contents
• Introduction of GPU
• Performance Factors Of GPU
• GPU Pipeline
• Block Diagram Of Pipeline Process Flow
• Introduction Of CUDA
• Thread Batching
• Simple Processing Flow
• CUDA C/C++
• Applications
• The Future Scope Of CUDA Technology
• Conclusion
• References
2
Introduction of GPU
• A Graphics Processing Unit (GPU) is a microprocessor that has been designed
specifically for the processing of 3D graphics.
• The processor is built with integrated transform, lighting, triangle setup/clipping,
and rendering engines, capable of handling millions of math-intensive processes
per second.
• GPUs form the heart of modern graphics cards, relieving the CPU (central
processing units) of much of the graphics processing load. GPUs allow products
such as desktop PCs, portable computers, and game consoles to process real-time
3D graphics that only a few years ago were only available on high-end workstations.
• Used primarily for 3-D applications, a graphics processing unit is a single-chip
processor that creates lighting effects and transforms objects every time a 3D
scene is redrawn. These are mathematically-intensive tasks, which otherwise,
would put quite a strain on the CPU. Lifting this burden from the CPU frees up
cycles that can be used for other jobs.
3
Performance Factors Of GPU
• Fill Rate:
It is defined as the number of pixels or texels (textured pixels) rendered per second by the
GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as
high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given
to it.
• Memory Bandwidth:
It is the data transfer speed between the graphics chip and its local frame buffer. More
bandwidth usually gives better performance with the image to be rendered is of high quality
and at very high resolution.
• Memory Management:
The performance of the GPU also depends on how efficiently the memory is managed,
because memory bandwidth may become the only bottle neck if not managed properly.
• Hidden Surface removal:
A term to describe the reducing of overdraws when rendering a scene by not rendering
surfaces that are not visible. This helps a lot in increasing performance of GPU.
4
GPU Pipeline
• The GPU receives geometry information from the CPU as an input and provides a
picture as an output
• The host interface is the communication bridge between the CPU and the GPU
• It receives commands from the CPU and also pulls geometry information from
system memory.
• It outputs a stream of vertices in object space with all their associated information
(normals, texture coordinates, per vertex color etc)
• The vertex processing stage receives vertices from the host interface in object
space and outputs them in screen space
• This may be a simple linear transformation, or a complex operation involving
morphing effects
host
interface
vertex
processing
triangle
setup
pixel
processing
memory
interface
Cont..
• A fragment is generated if and only if its center is inside the triangle
• Every fragment generated has its attributes computed to be the
perspective correct interpolation of the three vertices that make up the
triangle
• Each fragment provided by triangle setup is fed into fragment processing
as a set of attributes (position, normal, texcord etc), which are used to
compute the final color for this pixel Before the final write occurs, some
fragments are rejected by the zbuffer, stencil and alpha tests
6
Block Diagram Of Pipeline Process Flow
7
Cont..
• Allow shader to be applied to each vertex Transformation and other per
vertex ops
• Allow vertex shader to fetch texture data
• Cull/clip–per primitive operation and data preparation for rasterization
• Rasterization: primitive to pixel mapping
• Z culling: quick pixel elimination based on Depth
• Fragment : a candidate pixel Varying number of pixel pipelines
• SIMD processing hides texture fetch latency
8
Introduction Of CUDA
9
•CUDA aka Compute unified device architecture is parallel computing platform and
programing model which is implemented by graphics processing unit.
CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which
run in parallel on many threads
• Differences between GPU and CPU threads
 GPU threads are extremely lightweight
 Very little creation overhead
 GPU needs 1000s of threads for full efficiency
 Multi-core CPU needs only a few
Thread Batching: Grids and Blocks
•A kernel is executed as a grid of thread
blocks
–All threads share data memory
space
•A thread block is a batch of threads that
can cooperate with each other by:
–Synchronizing their execution
•For hazard-free shared memory
accesses
–Efficiently sharing data through a
low latency shared memory
•Two threads from two different blocks
cannot cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
Block and Thread IDs
•Threads and blocks have IDs
–So each thread can decide what data to
work on
–Block ID: 1D or 2D
–Thread ID: 1D, 2D, or 3D
•Simplifies memory
•addressing when processing
•multidimensional data
–Image processing
–Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
CUDA Device Memory Space Overview
•Each thread can:
–R/W per-thread registers
–R/W per-thread local memory
–R/W per-block shared memory
–R/W per-grid global memory
–Read only per-grid constant memory
–Read only per-grid texture memory
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
The host can R/W global,
constant, and texture memories
Global, Constant, and Texture Memories
•Global memory
–Main means of communicating R/W
- Data between host and device
–Contents visible to all threads
•Texture and Constant Memories
–Constants initialized by host
–Contents visible to all threads
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA
Simple Processing Flow
1. Copy input data from CPU memory to GPU memory
2. CPU instruct process to GPU
3. Load GPU program and execute, caching data on chip for performance
4. Copy results from GPU memory to CPU memory
15
CUDA C/C++
16
• CUDA Language:
C with Minimal Extensions
• Philosophy: provide minimal set of extensions necessary to expose power
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel function, runs on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // variable in per-block shared memory
• Extend function invocation syntax for parallel kernel launch
KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each
• Special variables for thread identification in kernels
dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
• Intrinsics that expose specific operations in kernel code
__syncthreads(); // barrier synchronization within kernel
Applications
17
•Military (lots)
•Mine planning
•Molecular dynamics
•MRI reconstruction
•Network processing
•Neural network
•Protein folding
•Quantum chemistry
•Ray tracing
•Radar
•Reservoir simulation
•Robotic vision/AI
•Robotic surgery
•Satellite data analysis
•Seismic imaging
•Surgery simulation
•3D image analysis
•Adaptive radiation therapy
•Astronomy
•Automobile vision
•Bio informatics
•Biological simulation
•Broadcast
•Computational Fluid Dynamics
•Computer Vision
•Cryptography
•CT reconstruction
•Data Mining
•Electromagnetic simulation
•Equity training
•Financial - lots of areas
•Mathematics research
Simulation Result
18
•If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
19
•Valid Results from bandwidth Test CUDA Sample
20
• Create an Array at the size of BLOCKS, allocate space for the array on the device, and
call,
generateArray<<<BLOCKS,1>>>( deviceArray );.
•This function will now run in BLOCKS parallel kernels, creating the entire array in one
call .
The Future Scope Of CUDA Technology
• Currently most of research is going on general purpose GPU. As GPU have a highly-
efficient and flexible parallel programmable features, a growing number of
researchers and business organizations started to use some of the non-graphical
rendering with GPU to implement the calculations, and create a new field of study:
GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to
implement more extensive scientific computing. GPGPU has been successfully used in
algebra, fluid simulation, database applications, spectrum analysis, and other non-
graphical applications
• Region-based Software Virtual Memory (RSVM), a software virtual memory running
on both CPU and GPU in a distributed and cooperative way.
• Size reduction
• Cooling technique
21
Conclusion.
• CUDA is a powerful parallel programming model
Heterogeneous - mixed serial-parallel programming
Scalable - hierarchical thread execution model
Accessible - minimal but expressive changes to C
• CUDA on GPUs can achieve great results on data parallel computations with a
few simple performance optimization strategies:
• Structure your application and select execution configurations to maximize
exploitation of the GPU’s parallel capabilities.
• Minimize CPU ↔GPU data transfers.
• Coalesce global memory accesses.
• Take advantage of shared memory.
• Minimize divergent warps.
• Minimize use of low-throughput instructions.
22
References
1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation
of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year -
2004
2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel
Processing in CPU-GPU Collaborative Environment ”,CSIT-2008
3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory
for GPU’,IEEE-2013
4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia
Corporation
5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation
6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer
Technology
23
Thank-You
24

Contenu connexe

Tendances

GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology義洋 顏
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorNeel Patel
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101Yoss Cohen
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computingHeman Pathak
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 

Tendances (20)

Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
GPU
GPUGPU
GPU
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology
 
CUDA
CUDACUDA
CUDA
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 

En vedette

Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Piyush Mittal
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLJanakiRam Raghumandala
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIAVirginia Grubert
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programmingkim.mens
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unitDayakar Siddula
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now MarketingGil Reich
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application DevelopmentDhaval Kaneria
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedTommaso Campari
 

En vedette (20)

GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
Mobile gpu cloud computing
Mobile gpu cloud computing Mobile gpu cloud computing
Mobile gpu cloud computing
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programming
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
nvidia-intro
nvidia-intronvidia-intro
nvidia-intro
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unit
 
Introduction of Xcode
Introduction of XcodeIntroduction of Xcode
Introduction of Xcode
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now Marketing
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application Development
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Siri Vs Google Now
Siri Vs Google NowSiri Vs Google Now
Siri Vs Google Now
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Swine flu
Swine flu Swine flu
Swine flu
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-based
 
HDMI
HDMIHDMI
HDMI
 

Similaire à Gpu with cuda architecture

Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Matthias Trapp
 

Similaire à Gpu with cuda architecture (20)

Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
IMQA Poster
IMQA PosterIMQA Poster
IMQA Poster
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
Gpu microprocessors
Gpu microprocessorsGpu microprocessors
Gpu microprocessors
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...
 

Plus de Dhaval Kaneria

Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Dhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Dhaval Kaneria
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processorDhaval Kaneria
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture Dhaval Kaneria
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYDhaval Kaneria
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceDhaval Kaneria
 

Plus de Dhaval Kaneria (16)

Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Hdmi
HdmiHdmi
Hdmi
 
open source hardware
open source hardwareopen source hardware
open source hardware
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
 
VERILOG CODE
VERILOG CODEVERILOG CODE
VERILOG CODE
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processor
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGY
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute Difference
 
Mems technology
Mems technologyMems technology
Mems technology
 
Network security
Network securityNetwork security
Network security
 
Token bus standard
Token bus standardToken bus standard
Token bus standard
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Gpu with cuda architecture

  • 1. “GPU With CUDA Architecture” Presented By- Dhaval Kaneria (13014061010) Guided By- Mr. Rajesh k Navandar
  • 2. Table Of Contents • Introduction of GPU • Performance Factors Of GPU • GPU Pipeline • Block Diagram Of Pipeline Process Flow • Introduction Of CUDA • Thread Batching • Simple Processing Flow • CUDA C/C++ • Applications • The Future Scope Of CUDA Technology • Conclusion • References 2
  • 3. Introduction of GPU • A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for the processing of 3D graphics. • The processor is built with integrated transform, lighting, triangle setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per second. • GPUs form the heart of modern graphics cards, relieving the CPU (central processing units) of much of the graphics processing load. GPUs allow products such as desktop PCs, portable computers, and game consoles to process real-time 3D graphics that only a few years ago were only available on high-end workstations. • Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the CPU frees up cycles that can be used for other jobs. 3
  • 4. Performance Factors Of GPU • Fill Rate: It is defined as the number of pixels or texels (textured pixels) rendered per second by the GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it. • Memory Bandwidth: It is the data transfer speed between the graphics chip and its local frame buffer. More bandwidth usually gives better performance with the image to be rendered is of high quality and at very high resolution. • Memory Management: The performance of the GPU also depends on how efficiently the memory is managed, because memory bandwidth may become the only bottle neck if not managed properly. • Hidden Surface removal: A term to describe the reducing of overdraws when rendering a scene by not rendering surfaces that are not visible. This helps a lot in increasing performance of GPU. 4
  • 5. GPU Pipeline • The GPU receives geometry information from the CPU as an input and provides a picture as an output • The host interface is the communication bridge between the CPU and the GPU • It receives commands from the CPU and also pulls geometry information from system memory. • It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc) • The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space • This may be a simple linear transformation, or a complex operation involving morphing effects host interface vertex processing triangle setup pixel processing memory interface
  • 6. Cont.. • A fragment is generated if and only if its center is inside the triangle • Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle • Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcord etc), which are used to compute the final color for this pixel Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests 6
  • 7. Block Diagram Of Pipeline Process Flow 7
  • 8. Cont.. • Allow shader to be applied to each vertex Transformation and other per vertex ops • Allow vertex shader to fetch texture data • Cull/clip–per primitive operation and data preparation for rasterization • Rasterization: primitive to pixel mapping • Z culling: quick pixel elimination based on Depth • Fragment : a candidate pixel Varying number of pixel pipelines • SIMD processing hides texture fetch latency 8
  • 9. Introduction Of CUDA 9 •CUDA aka Compute unified device architecture is parallel computing platform and programing model which is implemented by graphics processing unit.
  • 10. CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads  GPU threads are extremely lightweight  Very little creation overhead  GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few
  • 11. Thread Batching: Grids and Blocks •A kernel is executed as a grid of thread blocks –All threads share data memory space •A thread block is a batch of threads that can cooperate with each other by: –Synchronizing their execution •For hazard-free shared memory accesses –Efficiently sharing data through a low latency shared memory •Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 12. Block and Thread IDs •Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D •Simplifies memory •addressing when processing •multidimensional data –Image processing –Solving PDEs on volumes Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 13. CUDA Device Memory Space Overview •Each thread can: –R/W per-thread registers –R/W per-thread local memory –R/W per-block shared memory –R/W per-grid global memory –Read only per-grid constant memory –Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories
  • 14. Global, Constant, and Texture Memories •Global memory –Main means of communicating R/W - Data between host and device –Contents visible to all threads •Texture and Constant Memories –Constants initialized by host –Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy: NDVIA
  • 15. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. CPU instruct process to GPU 3. Load GPU program and execute, caching data on chip for performance 4. Copy results from GPU memory to CPU memory 15
  • 16. CUDA C/C++ 16 • CUDA Language: C with Minimal Extensions • Philosophy: provide minimal set of extensions necessary to expose power • Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory • Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each • Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; • Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel
  • 17. Applications 17 •Military (lots) •Mine planning •Molecular dynamics •MRI reconstruction •Network processing •Neural network •Protein folding •Quantum chemistry •Ray tracing •Radar •Reservoir simulation •Robotic vision/AI •Robotic surgery •Satellite data analysis •Seismic imaging •Surgery simulation •3D image analysis •Adaptive radiation therapy •Astronomy •Automobile vision •Bio informatics •Biological simulation •Broadcast •Computational Fluid Dynamics •Computer Vision •Cryptography •CT reconstruction •Data Mining •Electromagnetic simulation •Equity training •Financial - lots of areas •Mathematics research
  • 18. Simulation Result 18 •If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
  • 19. 19 •Valid Results from bandwidth Test CUDA Sample
  • 20. 20 • Create an Array at the size of BLOCKS, allocate space for the array on the device, and call, generateArray<<<BLOCKS,1>>>( deviceArray );. •This function will now run in BLOCKS parallel kernels, creating the entire array in one call .
  • 21. The Future Scope Of CUDA Technology • Currently most of research is going on general purpose GPU. As GPU have a highly- efficient and flexible parallel programmable features, a growing number of researchers and business organizations started to use some of the non-graphical rendering with GPU to implement the calculations, and create a new field of study: GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to implement more extensive scientific computing. GPGPU has been successfully used in algebra, fluid simulation, database applications, spectrum analysis, and other non- graphical applications • Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. • Size reduction • Cooling technique 21
  • 22. Conclusion. • CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C • CUDA on GPUs can achieve great results on data parallel computations with a few simple performance optimization strategies: • Structure your application and select execution configurations to maximize exploitation of the GPU’s parallel capabilities. • Minimize CPU ↔GPU data transfers. • Coalesce global memory accesses. • Take advantage of shared memory. • Minimize divergent warps. • Minimize use of low-throughput instructions. 22
  • 23. References 1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year - 2004 2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment ”,CSIT-2008 3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory for GPU’,IEEE-2013 4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia Corporation 5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation 6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer Technology 23

Notes de l'éditeur

  1. 1
  2. Global, constant, and texture memory spaces are persistent across kernels called by the same application.
  3. 15
  4. 17