GPU programming

•

0 j'aime•1,428 vues

Roberto Bonvallet

Technologie

CPU and GPU architectures

Control ALU ALU
ALU ALU
Cache

DRAM DRAM

Task and data parallelism

Task parallelism:
distributed
processing
distributed memory
message passing

Task and data parallelism

Task parallelism:
distributed
processing
distributed memory
message passing
Data parallelism:
same instruction on
different data
shared memory

Thread and memory hierarchies

Thread hierarchy:

Thread and memory hierarchies

Thread hierarchy:
grid of blocks

Thread and memory hierarchies

Thread hierarchy:
grid of blocks
blocks of threads

Thread and memory hierarchies

Thread hierarchy:
grid of blocks
blocks of threads
Memory hierarchy:
global memory (large, slow)
shared memory (per-block, small, fast)
registers (per-thread, small, fast)

Matrix-matrix multiplication

cij = aik bkj
k

Matrix-matrix multiplication

cij = aik bkj
k

Cij = Aik Bkj
k

Matrix-matrix multiplication

cij = aik bkj
k

Cij = Aik Bkj
k
Multiplication kernel:
initialize element of
Cij = 0
for each k:
fetch element of
Aik , Bkj into shared
memory
synchronize
compute element
of Cij = Cij + Aik Bkj
synchronize

Nvidia C1060

Core clock 602 Mhz
Multiprocessors 30
Thread processors 240 = 30 × 8
Memory size 4 GB
Memory bandwidth 102.4 GB/s
Single precision pp 933.12 Gﬂop
Double precision pp 77.76 Gﬂop

CUDA programming

Array allocation and copying
cudaMalloc((void **) &p, mem_size);

cudaMemcpy(host_p, dev_p, mem_size,
cudaMemcpyHostToDevice);

[...]

cudaMemcpy(dev_p, host_p, mem_size,
cudaMemcpyDeviceToHost);

cudaFree(p);

$CUDA programming Kernel deﬁnition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }$

$CUDA programming Kernel deﬁnition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } Kernel launch f<<<grid_size, block_size, sh_mem_size>>>(a, b, c);$

Vortex Methods

Fluid discretized as vortices
(x, y, α)

Vortex Methods

Fluid discretized as vortices
(x, y, α)
Vortex interaction:
1
K(x, y) = − (−y, x)
2π x

Vortex Methods

Fluid discretized as vortices
(x, y, α)
Vortex interaction:
1
K(x, y) = − (−y, x)
2π x

Biot-Savart law:

u(x) = αp K(x − xp )
p

Contenu connexe

Tendances

Using GPUs for parallel processingasm100

Introduction to CUDARaymond Tay

CUDARachel Miller

Computing using GPUsShree Kumar

Intro to GPGPU Programming with CudaRob Gillen

Cuda tutorialMahesh Khadatare

Gpu perf-presentationGiannisTsagatakis

NVidia CUDA Tutorial - June 15, 2009Randall Hand

Advanced Scenegraph Rendering PipelineNarann29

Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.

Java script dom-cheatsheet)Fafah Ranaivo

Introduction to parallel computing using CUDAMartin Peniak

Seeing with Python presented at PyCon AU 2014Mark Rees

CudaGopi Saiteja

GPU: Understanding CUDAJoaquín Aparicio Ramos

TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTaegyun Jeon

OpenGL 4.4 - Scene Rendering TechniquesNarann29

2011.02.18 marco parenzan - modelli di programmazione per le gpuMarco Parenzan

Separable bilateral filtering for fast video preprocessingTuan Q. Pham

Chainer ui v0.3 and imagereportPreferred Networks

Tendances (20)

Using GPUs for parallel processing

Introduction to CUDA

CUDA

Computing using GPUs

Intro to GPGPU Programming with Cuda

Cuda tutorial

Gpu perf-presentation

NVidia CUDA Tutorial - June 15, 2009

Advanced Scenegraph Rendering Pipeline

Nvidia cuda tutorial_no_nda_apr08

Java script dom-cheatsheet)

Introduction to parallel computing using CUDA

Seeing with Python presented at PyCon AU 2014

Cuda

GPU: Understanding CUDA

TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution

OpenGL 4.4 - Scene Rendering Techniques

2011.02.18 marco parenzan - modelli di programmazione per le gpu

Separable bilateral filtering for fast video preprocessing

Chainer ui v0.3 and imagereport

En vedette

Test 101Oli

Windows 7, DespliegueMicrosoft

Austin Xmas 2008dbranigan

Windows7 Venta En El MercadoMicrosoft

Programación funcional en HaskellRoberto Bonvallet

Tobacco Useguestfcdd1f8

Edición eficiente de texto con VimRoberto Bonvallet

En vedette (7)

Test 101

Windows 7, Despliegue

Austin Xmas 2008

Windows7 Venta En El Mercado

Programación funcional en Haskell

Tobacco Use

Edición eficiente de texto con Vim

Similaire à GPU programming

Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica

Intro to Machine Learning for GPUsSri Ambati

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto

Gdc03 ericson memory_optimizationbrettlevin

Sparse Content Map Storage Systemianeboston

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore

Intro to threpHong Wu

Windows to reality getting the most out of direct3 d 10 graphics in your gameschangehee lee

Drupal Camp Kiev 2012 - High Performance Drupal Web SitesJean-Baptiste Guerraz

An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35

Cuda ArchitecturePiyush Mittal

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen

VAST-Tree, EDBT'12Takeshi Yamamuro

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain

BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto

High Performance Cloud ComputingDeepak Singh

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli

Lecture 25Berkay TURAN

Similaire à GPU programming (20)

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

Intro to Machine Learning for GPUs

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...

Gdc03 ericson memory_optimization

Sparse Content Map Storage System

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Intro to threp

Windows to reality getting the most out of direct3 d 10 graphics in your games

Drupal Camp Kiev 2012 - High Performance Drupal Web Sites

An Introduction to CUDA-OpenCL - University.pptx

Cuda Architecture

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

VAST-Tree, EDBT'12

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012

BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...

High Performance Cloud Computing

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Lecture 25

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Artificial Intelligence: Facts and MythsJoaquim Jorge

MINDCTI Revenue Release Quarter One 2024MIND CTI

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Real Time Object Detection Using Open CVKhem

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Manulife - Insurer Innovation Award 2024The Digital Insurer

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Artificial Intelligence: Facts and Myths

MINDCTI Revenue Release Quarter One 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Real Time Object Detection Using Open CV

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

HTML Injection Attacks: Impact and Mitigation Strategies

Strategies for Landing an Oracle DBA Job as a Fresher

Manulife - Insurer Innovation Award 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

AWS Community Day CPH - Three problems of Terraform

Boost Fertility New Invention Ups Success Rates.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Apidays New York 2024 - The value of a flexible API Management solution for O...

GPU programming

1. GPU Programming Roberto Bonvallet Departamento de Inform´ tica a Universidad T´ cnica Federico Santa Mar´a e ı Junio de 2010

2. CPU vs GPU peak performance

3. CPU and GPU architectures Control ALU ALU ALU ALU Cache DRAM DRAM

4. CPU and GPU architectures DRAM

5. CPU and GPU architectures

6. Nvidia Tesla architecture

7. Task and data parallelism

8. Task and data parallelism Task parallelism: distributed processing distributed memory message passing

9. Task and data parallelism Task parallelism: distributed processing distributed memory message passing Data parallelism: same instruction on different data shared memory

10. Thread and memory hierarchies Thread hierarchy:

11. Thread and memory hierarchies Thread hierarchy: grid of blocks

12. Thread and memory hierarchies Thread hierarchy: grid of blocks blocks of threads

13. Thread and memory hierarchies Thread hierarchy: grid of blocks blocks of threads Memory hierarchy: global memory (large, slow) shared memory (per-block, small, fast) registers (per-thread, small, fast)

14. Matrix-matrix multiplication

15. Matrix-matrix multiplication cij = aik bkj k

16. Matrix-matrix multiplication cij = aik bkj k Cij = Aik Bkj k

17. Matrix-matrix multiplication cij = aik bkj k Cij = Aik Bkj k Multiplication kernel: initialize element of Cij = 0 for each k: fetch element of Aik , Bkj into shared memory synchronize compute element of Cij = Cij + Aik Bkj synchronize

18. Nvidia C1060 Core clock 602 Mhz Multiprocessors 30 Thread processors 240 = 30 × 8 Memory size 4 GB Memory bandwidth 102.4 GB/s Single precision pp 933.12 Gﬂop Double precision pp 77.76 Gﬂop

19. CUDA programming Array allocation and copying cudaMalloc((void **) &p, mem_size); cudaMemcpy(host_p, dev_p, mem_size, cudaMemcpyHostToDevice); [...] cudaMemcpy(dev_p, host_p, mem_size, cudaMemcpyDeviceToHost); cudaFree(p);

20. CUDA programming Kernel deﬁnition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }

21. CUDA programming Kernel deﬁnition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } Kernel launch f<<<grid_size, block_size, sh_mem_size>>>(a, b, c);

22. Vortex Methods Fluid discretized as vortices (x, y, α)

23. Vortex Methods Fluid discretized as vortices (x, y, α) Vortex interaction: 1 K(x, y) = − (−y, x) 2π x

24. Vortex Methods Fluid discretized as vortices (x, y, α) Vortex interaction: 1 K(x, y) = − (−y, x) 2π x Biot-Savart law: u(x) = αp K(x − xp ) p

25. GPU velocity evaluation

GPU programming

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à GPU programming

Similaire à GPU programming (20)

Dernier

Dernier (20)

GPU programming