SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
GPU Programming

       Roberto Bonvallet
      Departamento de Inform´ tica
                               a
Universidad T´ cnica Federico Santa Mar´a
             e                         ı


          Junio de 2010
CPU vs GPU peak performance
CPU and GPU architectures


  Control   ALU ALU
            ALU ALU
  Cache



  DRAM                DRAM
CPU and GPU architectures




                      DRAM
CPU and GPU architectures
Nvidia Tesla architecture
Task and data parallelism
Task and data parallelism



                            Task parallelism:
                                distributed
                                processing
                                distributed memory
                                message passing
Task and data parallelism



                            Task parallelism:
                                distributed
                                processing
                                distributed memory
                                message passing
                            Data parallelism:
                                same instruction on
                                different data
                                shared memory
Thread and memory hierarchies




                      Thread hierarchy:
Thread and memory hierarchies




                      Thread hierarchy:
                          grid of blocks
Thread and memory hierarchies




                      Thread hierarchy:
                          grid of blocks
                          blocks of threads
Thread and memory hierarchies




                      Thread hierarchy:
                          grid of blocks
                          blocks of threads
                      Memory hierarchy:
                          global memory (large, slow)
                          shared memory (per-block, small, fast)
                          registers (per-thread, small, fast)
Matrix-matrix multiplication
Matrix-matrix multiplication

                        cij =       aik bkj
                                k
Matrix-matrix multiplication

                        cij =       aik bkj
                                k

                        Cij =       Aik Bkj
                                k
Matrix-matrix multiplication

                        cij =         aik bkj
                                 k

                        Cij =         Aik Bkj
                                  k
                        Multiplication kernel:
                                initialize element of
                                Cij = 0
                                for each k:
                                      fetch element of
                                      Aik , Bkj into shared
                                      memory
                                      synchronize
                                      compute element
                                      of Cij = Cij + Aik Bkj
                                      synchronize
Nvidia C1060



 Core clock            602 Mhz
 Multiprocessors       30
 Thread processors     240 = 30 × 8
 Memory size           4 GB
 Memory bandwidth      102.4 GB/s
 Single precision pp   933.12 Gflop
 Double precision pp   77.76 Gflop
CUDA programming

    Array allocation and copying
    cudaMalloc((void **) &p, mem_size);

    cudaMemcpy(host_p, dev_p, mem_size,
               cudaMemcpyHostToDevice);

    [...]

    cudaMemcpy(dev_p, host_p, mem_size,
               cudaMemcpyDeviceToHost);

    cudaFree(p);
CUDA programming

    Kernel definition
    __global__ void
    vector_sum(float *a, float *b, float *c) {
        int i = blockIdx.x * blockDim.x +
                   threadIdx.x;
        c[i] = a[i] + b[i];
    }
CUDA programming

    Kernel definition
    __global__ void
    vector_sum(float *a, float *b, float *c) {
        int i = blockIdx.x * blockDim.x +
                   threadIdx.x;
        c[i] = a[i] + b[i];
    }

    Kernel launch
    f<<<grid_size, block_size,
        sh_mem_size>>>(a, b, c);
Vortex Methods


                 Fluid discretized as vortices
                 (x, y, α)
Vortex Methods


                 Fluid discretized as vortices
                 (x, y, α)
                 Vortex interaction:
                                    1
                    K(x, y) = −        (−y, x)
                                  2π x
Vortex Methods


                 Fluid discretized as vortices
                 (x, y, α)
                 Vortex interaction:
                                    1
                    K(x, y) = −        (−y, x)
                                  2π x

                 Biot-Savart law:

                     u(x) =       αp K(x − xp )
                              p
GPU velocity evaluation

Contenu connexe

Tendances

Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
Java script dom-cheatsheet)
Java script dom-cheatsheet)Java script dom-cheatsheet)
Java script dom-cheatsheet)
Fafah Ranaivo
 

Tendances (20)

Using GPUs for parallel processing
Using GPUs for parallel processingUsing GPUs for parallel processing
Using GPUs for parallel processing
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
CUDA
CUDACUDA
CUDA
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Java script dom-cheatsheet)
Java script dom-cheatsheet)Java script dom-cheatsheet)
Java script dom-cheatsheet)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014
 
Cuda
CudaCuda
Cuda
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
2011.02.18 marco parenzan - modelli di programmazione per le gpu
2011.02.18   marco parenzan - modelli di programmazione per le gpu2011.02.18   marco parenzan - modelli di programmazione per le gpu
2011.02.18 marco parenzan - modelli di programmazione per le gpu
 
Separable bilateral filtering for fast video preprocessing
Separable bilateral filtering for fast video preprocessingSeparable bilateral filtering for fast video preprocessing
Separable bilateral filtering for fast video preprocessing
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 

En vedette (7)

Test 101
Test 101Test 101
Test 101
 
Windows 7, Despliegue
Windows 7, DespliegueWindows 7, Despliegue
Windows 7, Despliegue
 
Austin Xmas 2008
Austin Xmas 2008Austin Xmas 2008
Austin Xmas 2008
 
Windows7 Venta En El Mercado
Windows7  Venta En El  MercadoWindows7  Venta En El  Mercado
Windows7 Venta En El Mercado
 
Programación funcional en Haskell
Programación funcional en HaskellProgramación funcional en Haskell
Programación funcional en Haskell
 
Tobacco Use
Tobacco UseTobacco Use
Tobacco Use
 
Edición eficiente de texto con Vim
Edición eficiente de texto con VimEdición eficiente de texto con Vim
Edición eficiente de texto con Vim
 

Similaire à GPU programming

Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
brettlevin
 
Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to reality   getting the most out of direct3 d 10 graphics in your gamesWindows to reality   getting the most out of direct3 d 10 graphics in your games
Windows to reality getting the most out of direct3 d 10 graphics in your games
changehee lee
 
Drupal Camp Kiev 2012 - High Performance Drupal Web Sites
Drupal Camp Kiev 2012 - High Performance Drupal Web SitesDrupal Camp Kiev 2012 - High Performance Drupal Web Sites
Drupal Camp Kiev 2012 - High Performance Drupal Web Sites
Jean-Baptiste Guerraz
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
mjfrankli
 

Similaire à GPU programming (20)

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
 
Sparse Content Map Storage System
Sparse Content Map Storage SystemSparse Content Map Storage System
Sparse Content Map Storage System
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to reality   getting the most out of direct3 d 10 graphics in your gamesWindows to reality   getting the most out of direct3 d 10 graphics in your games
Windows to reality getting the most out of direct3 d 10 graphics in your games
 
Drupal Camp Kiev 2012 - High Performance Drupal Web Sites
Drupal Camp Kiev 2012 - High Performance Drupal Web SitesDrupal Camp Kiev 2012 - High Performance Drupal Web Sites
Drupal Camp Kiev 2012 - High Performance Drupal Web Sites
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
VAST-Tree, EDBT'12
VAST-Tree, EDBT'12VAST-Tree, EDBT'12
VAST-Tree, EDBT'12
 
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 

Dernier

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Dernier (20)

Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 

GPU programming

  • 1. GPU Programming Roberto Bonvallet Departamento de Inform´ tica a Universidad T´ cnica Federico Santa Mar´a e ı Junio de 2010
  • 2. CPU vs GPU peak performance
  • 3. CPU and GPU architectures Control ALU ALU ALU ALU Cache DRAM DRAM
  • 4. CPU and GPU architectures DRAM
  • 5. CPU and GPU architectures
  • 7. Task and data parallelism
  • 8. Task and data parallelism Task parallelism: distributed processing distributed memory message passing
  • 9. Task and data parallelism Task parallelism: distributed processing distributed memory message passing Data parallelism: same instruction on different data shared memory
  • 10. Thread and memory hierarchies Thread hierarchy:
  • 11. Thread and memory hierarchies Thread hierarchy: grid of blocks
  • 12. Thread and memory hierarchies Thread hierarchy: grid of blocks blocks of threads
  • 13. Thread and memory hierarchies Thread hierarchy: grid of blocks blocks of threads Memory hierarchy: global memory (large, slow) shared memory (per-block, small, fast) registers (per-thread, small, fast)
  • 15. Matrix-matrix multiplication cij = aik bkj k
  • 16. Matrix-matrix multiplication cij = aik bkj k Cij = Aik Bkj k
  • 17. Matrix-matrix multiplication cij = aik bkj k Cij = Aik Bkj k Multiplication kernel: initialize element of Cij = 0 for each k: fetch element of Aik , Bkj into shared memory synchronize compute element of Cij = Cij + Aik Bkj synchronize
  • 18. Nvidia C1060 Core clock 602 Mhz Multiprocessors 30 Thread processors 240 = 30 × 8 Memory size 4 GB Memory bandwidth 102.4 GB/s Single precision pp 933.12 Gflop Double precision pp 77.76 Gflop
  • 19. CUDA programming Array allocation and copying cudaMalloc((void **) &p, mem_size); cudaMemcpy(host_p, dev_p, mem_size, cudaMemcpyHostToDevice); [...] cudaMemcpy(dev_p, host_p, mem_size, cudaMemcpyDeviceToHost); cudaFree(p);
  • 20. CUDA programming Kernel definition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }
  • 21. CUDA programming Kernel definition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } Kernel launch f<<<grid_size, block_size, sh_mem_size>>>(a, b, c);
  • 22. Vortex Methods Fluid discretized as vortices (x, y, α)
  • 23. Vortex Methods Fluid discretized as vortices (x, y, α) Vortex interaction: 1 K(x, y) = − (−y, x) 2π x
  • 24. Vortex Methods Fluid discretized as vortices (x, y, α) Vortex interaction: 1 K(x, y) = − (−y, x) 2π x Biot-Savart law: u(x) = αp K(x − xp ) p