Ce diaporama a bien été signalé.

[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

3

Partager

1 sur 107
1 sur 107

[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

3

Partager

Télécharger pour lire hors ligne

http://cs264.org

Abstract:

High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.

Speaker biography:

Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.

In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.

His research interests include:

- Discontinuous Galerkin and integral equation methods for wave
propagation

- Programming tools for parallel architectures

- High-order unstructured particle-in-cell methods for plasma simulation

http://cs264.org

Abstract:

High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.

Speaker biography:

Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.

In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.

His research interests include:

- Discontinuous Galerkin and integral equation methods for wave
propagation

- Programming tools for parallel architectures

- High-order unstructured particle-in-cell methods for plasma simulation

Plus De Contenu Connexe

Plus par npinto

Livres associés

Gratuit avec un essai de 14 jours de Scribd

Tout voir

Livres audio associés

Gratuit avec un essai de 14 jours de Scribd

Tout voir

[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

  1. 1. Intro PyOpenCL RTCG Perspectives Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Andreas Kl¨ckner o Courant Institute of Mathematical Sciences New York University March 31, 2011 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  2. 2. Intro PyOpenCL RTCG Perspectives Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) PyOpenCL, PyCUDA contributors Nvidia Corp., AMD Corp. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  3. 3. Intro PyOpenCL RTCG Perspectives Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  4. 4. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Outline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  5. 5. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Outline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  6. 6. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL How are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  7. 7. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Outline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  8. 8. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL What is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  9. 9. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL What is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  10. 10. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL What is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Big deal! Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  11. 11. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Who? OpenCL Working Group • Diverse industry participation - Processor vendors, system OEMs, middleware vendors, application developers • Many industry-leading experts involved in OpenCL’s design - A healthy diversity of industry perspectives • Apple made initial proposal and is very active in the working group - Serving as specification editor © Copyright Khronos Group, 2010 - Page 4 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  12. 12. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL When? OpenCL Timeline • Six months from proposal to released OpenCL 1.0 specification - Due to a strong initial proposal and a shared commercial incentive • Multiple conformant implementations shipping - Apple’s Mac OS X Snow Leopard now ships with OpenCL • 18 month cadence between OpenCL 1.0 and OpenCL 1.1 - Backwards compatibility protect software investment Khronos publicly Multiple conformant releases OpenCL 1.0 as implementations ship royalty-free across diverse OS specification and platforms Jun08 May09 Jun10 Dec08 2H09 Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1 working group and 1.0 conformance tests to Specification released and contributes draft specification ensure high-quality first implementations ship to Khronos implementations © Copyright Khronos Group, 2010 - Page 5 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  13. 13. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Why? Processor Parallelism CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing Multi- Heterogeneous Graphics processor Computing APIs and programming Shading – e.g. OpenMP Languages OpenCL is a programming framework for heterogeneous compute resources © Copyright Khronos Group, 2010 - Page 3 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  14. 14. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL CL vs CUDA side-by-side CUDA source code: OpenCL source code: global void transpose( void transpose( float ∗A t, float ∗A, global float ∗a t, global float ∗a, int a width, int a height ) unsigned a width, unsigned a height ) { { int base idx a = int base idx a = blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE + blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE; int base idx a t = int base idx a t = blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE + blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE; int glob idx a = int glob idx a = base idx a + threadIdx.x base idx a + get local id (0) + a width ∗ threadIdx.y; + a width ∗ get local id (1); int glob idx a t = int glob idx a t = base idx a t + threadIdx.x base idx a t + get local id (0) + a height ∗ threadIdx .y; + a height ∗ get local id (1); shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1]; A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] = A[ glob idx a ]; a[ glob idx a ]; syncthreads (); barrier (CLK LOCAL MEM FENCE); A t[ glob idx a t ] = a t [ glob idx a t ] = A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)]; } } Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  15. 15. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL ↔ CUDA: A dictionary OpenCL CUDA Grid Grid Work Group Block Work Item Thread kernel global global device local shared private local imagend t texture<type, n, ...> barrier(LMF) syncthreads() get local id(012) threadIdx.xyz get group id(012) blockIdx.xyz get global id(012) – (reimplement) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  16. 16. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Execution Model nD Grid Group Group Group (0, 0) (1, 0) (2, 0) Two-tiered Parallelism Group Group Group (0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups Work group = Sx × Sy × Sz work items Total: i∈{x,y ,z} Si Ni work items Work Group (1, 0) Comm/Sync only within work group Item Item Item Item Work group maps to compute unit (0, 0) (1, 0) (2, 0) (3, 0) Grid/Group ≈ outer loops in an algorithm Item Item Item Item (0, 1) (1, 1) (2, 1) (3, 1) Device Language: Item (0, 2) Item (1, 2) Item (2, 2) Item (3, 2) get {global,group,local} {id,size} Item Item Item Item (axis) (0, 3) (1, 3) (2, 3) (3, 3) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  17. 17. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Host (CPU) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  18. 18. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  19. 19. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  20. 20. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Memory Compute Device 1 (Platform 0) ··· Host ··· ··· Memory Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Memory Compute Device 1 (Platform 1) ··· ··· ··· Memory Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  21. 21. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  22. 22. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Platform 0 (e.g. CPUs) Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  23. 23. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Platform 1 (e.g. GPUs) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  24. 24. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  25. 25. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  26. 26. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  27. 27. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Processing Element (think “SIMD lane”) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  28. 28. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  29. 29. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  30. 30. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Device Language: ∼ C99 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  31. 31. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL Object Diagram Figure 2.1 - OpenCL UML Class Diagram Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  32. 32. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  33. 33. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Outline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  34. 34. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Outline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  35. 35. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Dive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a) 10 11 prg = cl.Program(ctx, ””” 12 kernel void twice( global float ∗a) 13 { a[ get global id (0)] ∗= 2; } 14 ”””). build () 15 16 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  36. 36. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Dive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a) 10 11 prg = cl.Program(ctx, ””” 12 kernel void twice( global float ∗a) 13 { a[ get global id (0)] ∗= 2; } Compute kernel 14 ”””). build () 15 16 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  37. 37. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Dive into PyOpenCL 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a) 10 11 prg = cl.Program(ctx, ””” 12 kernel void twice( global float ∗a) 13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; } 14 ”””). build () 15 16 prg. twice(queue, a.shape, (256,), a dev) 17 18 result = numpy.empty like(a) 19 cl . enqueue read buffer (queue, a dev, result ). wait() 20 import numpy.linalg as la 21 assert la .norm(result − 2∗a) == 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  38. 38. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Outline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  39. 39. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL: Completeness PyOpenCL exposes all of OpenCL. For example: Every GetInfo() query Images and Samplers Memory Maps Profiling and Synchronization GL Interop Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  40. 40. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL: Completeness PyOpenCL supports (nearly) every OS that has an OpenCL implementation. Linux OS X Windows Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  41. 41. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Automatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.release()) Correctly deals with multiple contexts and dependencies. (based on OpenCL’s reference counting) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  42. 42. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL: Documentation Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  43. 43. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  44. 44. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reduction, Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . ) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  45. 45. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Capturing Dependencies A f B = f(A) B C = g(B) g p E = f(C) C P q F = h(C) G = g(E,F) f h P = p(B) E F Q Q = q(B) g g r R = r(G,P,Q) G r r R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  46. 46. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Capturing Dependencies A Switch queue to out-of-order mode! f B = f(A) Specify as list of events using B C = g(B) for= optional keyword to wait g p E = f(C) enqueue XXX. C P q F = h(C) also enqueue barrier. Can G = g(E,F) f h Common use case: P = p(B) Transmit/receive from other MPI E F Q Q = q(B) ranks. g g r R = r(G,P,Q) Possible on Nv Fermi: Submit G r parallel work to increase machine r use. R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  47. 47. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  48. 48. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  49. 49. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  50. 50. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  51. 51. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  52. 52. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  53. 53. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  54. 54. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  55. 55. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Good for code In GPU scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  56. 56. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  57. 57. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  58. 58. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Machine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  59. 59. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action PyOpenCL: Support for Metaprogramming Three (main) ways of generating code: Simple %-operator substitution Combine with C preprocessor: simple, often sufficient Use a templating engine (Mako works very well) codepy: Build C syntax trees from Python Generates readable, indented C Many ways of evaluating code–most important one: Exact device timing via events Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  60. 60. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  61. 61. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action PyOpenCL Arrays: General Usage Remember your first PyOpenCL program? Abstraction is good: 1 import numpy 2 import pyopencl as cl 3 import pyopencl.array as cl array 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a gpu = cl array . to device ( 9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32)) 10 a doubled = (2∗a gpu).get() 11 print a doubled 12 print a gpu Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  62. 62. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.array: Simple Linear Algebra pyopencl.array.Array: Meant to look and feel just like numpy. p.a.to device(ctx, queue, numpy array) numpy array = ary.get() +, -, ∗, /, fill, sin, arange, exp, rand, . . . Mixed types (int32 + float32 = float64) print cl array for debugging. Allows access to raw bits Use as kernel arguments, memory maps Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  63. 63. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.elementwise: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: n = 10000 a gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) b gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) from pyopencl.elementwise import ElementwiseKernel lin comb = ElementwiseKernel(ctx, ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = cl array . empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) import numpy.linalg as la assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  64. 64. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action RTCG via Substitution source = (””” kernel void %(name)s(%(arguments)s) { unsigned lid = get local id (0); unsigned gsize = get global size (0); unsigned work item start = get local size (0)∗ get group id (0); for (unsigned i = work item start + lid ; i < n; i += gsize) { %(operation)s; } } ””” % { ”arguments”: ”, ”. join (arg . declarator () for arg in arguments), ”operation”: operation , ”name”: name, ”loop prep”: loop prep , }) prg = cl.Program(ctx, source ). build () Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  65. 65. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action RTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): <% offset = i∗ local size %> tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) knl = cl.Program(ctx, str ( rendered tpl )). build (). add Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  66. 66. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.reduction: Reduction made easy Example: A dot product calculation from pyopencl.reduction import ReductionKernel dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=” global const float ∗x, global const float ∗y”) import pyopencl.clrandom as cl rand x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  67. 67. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.scan: Scan made easy Example: A cumulative sum computation from pyopencl.scan import InclusiveScanKernel knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”) n = 2∗∗20−2∗∗18+5 host data = np.random.randint(0, 10, n). astype(np.int32) dev data = cl array . to device (queue, host data) knl(dev data) assert (dev data.get() == np.cumsum(host data, axis=0)).all() Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  68. 68. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  69. 69. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  70. 70. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Whetting your appetite 1 import pycuda.driver as cuda 2 import pycuda.autoinit , pycuda.compiler 3 import numpy 4 5 a = numpy.random.randn(4,4).astype(numpy.float32) 6 a gpu = cuda.mem alloc(a.nbytes) 7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  71. 71. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Whetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”) 10 func(a gpu, block=(4,4,1)) 11 12 a doubled = numpy.empty like(a) 13 cuda.memcpy dtoh(a doubled, a gpu) 14 print a doubled 15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  72. 72. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Whetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”) 10 func(a gpu, block=(4,4,1)) 11 12 a doubled = numpy.empty like(a) 13 cuda.memcpy dtoh(a doubled, a gpu) 14 print a doubled 15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  73. 73. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions PyOpenCL ↔ PyCUDA: A (rough) dictionary PyOpenCL PyCUDA Context Context CommandQueue Stream Buffer mem alloc / DeviceAllocation Program SourceModule Kernel Function Event (eg. enqueue marker) Event Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  74. 74. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  75. 75. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Discontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  76. 76. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Discontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  77. 77. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Discontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by 1 j 1 ∂t E − ×H =− , ∂t H + × E = 0, ε ε µ ρ ·E = , · H = 0. ε Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  78. 78. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  79. 79. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Flux terms: vary by problem expression specified by user evaluated pointwise Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  80. 80. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  81. 81. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 User writes: Vectorial statement in math. notation flux = 1/2∗cross(normal, h. int −h.ext −alpha∗cross(normal, e. int −e.ext)) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  82. 82. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 We generate: Scalar evaluator in C (6×) a flux += ( ((( val a field5 − val b field5 )∗ fpair −>normal[2] − ( val a field4 − val b field4 )∗ fpair −>normal[0]) + val a field0 − val b field0 )∗ fpair −>normal[0] − ((( val a field4 − val b field4 ) ∗ fpair −>normal[1] − ( val a field1 − val b field1 )∗ fpair −>normal[2]) + val a field3 − val b field3 ) ∗ fpair −>normal[1] )∗ value type (0.5); Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  83. 83. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  84. 84. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? ws : in sequence wi : “inline-parallel” wp : in parallel Thread Thread Thread t t t (amortize preparation) (exploit register space) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  85. 85. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loop Slicing for Differentiation 2.2 3.0 Local differentiation, matrix-in-shared, order 4, with microblocking 2.8 2.0 point size denotes wi ∈ 1, ,4 2.6 1.8 2.4 Execution time [ms] 1.6 2.2 2.0 ws 1.4 1.8 1.2 1.6 1.4 1.0 1.2 0.8 15 20 25 30 1.0 wp Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  86. 86. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 GPU 250 CPU 200 GFlops/s 150 100 50 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  87. 87. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Memory Bandwidth on a GTX 280 200 Gather 180 Lift Global Memory Bandwidth [GB/s] Diff 160 Assy. Peak 140 120 100 80 60 40 201 2 3 4 5 6 7 8 9 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  88. 88. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions GPU DG Showcase Eletromagnetism Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  89. 89. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions GPU DG Showcase Eletromagnetism Poisson Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  90. 90. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions GPU DG Showcase Eletromagnetism Poisson CFD Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  91. 91. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  92. 92. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  93. 93. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  94. 94. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  95. 95. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, } → Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  96. 96. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  97. 97. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  98. 98. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  99. 99. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Where to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/ GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  100. 100. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Conclusions GPUs to me: architecture choice now widely available Fun time to be in computational science GPUs and scripting work surprisingly well together Exploit a natural task decomposition in computational codes RTCG: Crucial tool GPU Scripting great for prototyping . . . and just as suitable for production code Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  101. 101. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Questions? ? Thank you for your attention! http://www.cims.nyu.edu/~kloeckner/ image credits Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  102. 102. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Image Credits Dictionary: sxc.hu/topfer C870 GPU: Nvidia Corp. OpenCL Logo: Apple Corp./Ars Technica OS Platforms: flickr.com/aOliN.Tk Old Books: flickr.com/ppdigital Floppy disk: flickr.com/ethanhein Machine: flickr.com/13521837@N00 Adding Machine: flickr.com/thomashawk Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  103. 103. Implementations Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs Flop Rates: 16 GPUs vs 64 CPU cores 4000 GPU CPU 3000 GFlops/s 2000 1000 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  104. 104. Implementations Outline 5 OpenCL implementations Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  105. 105. Implementations The Nvidia CL implementation Targets only GPUs Notes: Nearly identical to CUDA No native C-level JIT in CUDA (→ PyCUDA) Page-locked memory: Use CL MEM ALLOC HOST PTR. Careful: double meaning Need page-locked memory for genuinely overlapped transfers. No linear memory texturing CUDA device emulation mode deprecated → Use AMD CPU CL (faster, too!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  106. 106. Implementations The Apple CL implementation Targets CPUs and GPUs General notes: Different header name OpenCL/cl.h instead of CL/cl.h Use -framework OpenCL for C access. Beware of imperfect compiler cache implementation (ignores include files) CPU notes: One work item per processor GPU similar to hardware vendor implementation. (New: Intel w/ Sandy Bridge) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  107. 107. Implementations The AMD CL implementation Targets CPUs and GPUs (from both AMD and Nvidia) GPU notes: Wide SIMD groups (64) Native 4/5-wide vectors But: very flop-heavy machine, may ignore vectors for memory-bound workloads → Both implicit and explicit SIMD CPU notes: Many work items per processor (emulated) General: cl amd printf Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA

×