SlideShare une entreprise Scribd logo
1  sur  63
Télécharger pour lire hors ligne
Dynamic Compilation for Massively Parallel
             Processors

                      Gregory Diamos

                          PhD candidate

      Georgia Institute of Technology and NVIDIA Research


                       April 14, 2011




                  Gregory Diamos   CS264 - Dynamic Compilation   1/62
What is an execution model?




        Gregory Diamos   CS264 - Dynamic Compilation   2/62
Goals of programming languages
  Programming languages are designed for productivity.




  Efficiency is measured in terms of:
    1   cost - hardware investment, power consumption, area requirement
    2   complexity - application development effort
    3   speed - amount of work performed per unit time
                            Gregory Diamos   CS264 - Dynamic Compilation   3/62
Goals of processor architecture




  Hardware is designed for speed and efficiency.




                          Gregory Diamos   CS264 - Dynamic Compilation   4/62
Goals of processor architecture - 2




       [1] - M. Koyanagi, T. Fukushima, and T. Tanaka. "High-Density Through Silicon Vias for 3-D LSIs"   [2] - Novoselov et al. "Electric Field Effect in Atomically Thin Carbon Films."   [3] - Intel Corp. 22nm test chip.




  It is constrained by the limitations of physical devices.




                                                                                     Gregory Diamos                             CS264 - Dynamic Compilation                                               5/62
Execution models bridge the gap




                    Gregory Diamos   CS264 - Dynamic Compilation   6/62
Goals of execution models



  Execution models provide impedance matching between applications and
  hardware.

  Goals:
      leverage common optimizations across multiple applications.
      limit the impact of hardware changes on software.


  ISAs have traditionally been effective execution models.




                           Gregory Diamos   CS264 - Dynamic Compilation   7/62
Programming challenges of heterogeneity

  The introduction of heterogeneous and multi-core processors changes the
  hardware/software interface:




         Intel Nehalem            IBM PowerEN                  AMD Fusion            NVIDIA Fermi




    1   multi-core creates multiple interfaces.
    2   heterogeneity creates different interfaces.
    3   these increase software complexity.



                              Gregory Diamos    CS264 - Dynamic Compilation   8/62
Program the entire processor, not individual cores.
        (new execution model abstractions are needed)




                    Gregory Diamos   CS264 - Dynamic Compilation   9/62
Emerging execution models




       Gregory Diamos   CS264 - Dynamic Compilation   10/62
Bulk-synchronous parallel (BSP)




  [1] - Leslie Valiant. A bridging model for parallel computing.


                            Gregory Diamos   CS264 - Dynamic Compilation   11/62
The Parallel Thread eXecution (PTX) Model




  PTX defines a kernel as a 2-level grid of bulk-synchronous tasks.

                           Gregory Diamos   CS264 - Dynamic Compilation   12/62
Dynamically translating PTX




  Dynamic compilers can transform this parallelism to fit the hardware.

                           Gregory Diamos   CS264 - Dynamic Compilation   13/62
Beyond PTX - Data distributions




                   Gregory Diamos   CS264 - Dynamic Compilation   14/62
Beyond PTX - Memory hierarchies




  [1] - Leslie Valiant. A bridging model for multi-core.
  [2] Fatahalian et al. Sequoia: Programming the memory hierarchy.
                           Gregory Diamos   CS264 - Dynamic Compilation   15/62
Dynamic compilation/binary
       translation




        Gregory Diamos   CS264 - Dynamic Compilation   16/62
Binary translation




                     Gregory Diamos   CS264 - Dynamic Compilation   17/62
Binary translators are everywhere




  If you are running a browser, you are using dynamic compilation.

                           Gregory Diamos   CS264 - Dynamic Compilation   18/62
x86 binary translation




                     Gregory Diamos   CS264 - Dynamic Compilation   19/62
Low Level Virtual Machines

  Compile all programs to a common virtual machine representation (LLVM
  IR), keep this around.




  Perform common optimizations on this IR.
      Target various machines by lowering it to an ISA.
      Statically or via JIT compilation.


                            Gregory Diamos   CS264 - Dynamic Compilation   20/62
Execution model translation




         Gregory Diamos   CS264 - Dynamic Compilation   21/62
Execution model translation


  Extend binary translation to execution model translation.




  Dynamic compilers can map threads/tasks to the HW.




                            Gregory Diamos   CS264 - Dynamic Compilation   22/62
Different core architectures




  Can we target these from the same execution model.
      What about efficiency?


                          Gregory Diamos   CS264 - Dynamic Compilation   23/62
Ocelot




  Enables thread-aware compiler transformations.




                          Gregory Diamos   CS264 - Dynamic Compilation   24/62
Mapping CTAs to cores - thread fusion



                             Scheduler Block




                                                                          Restore Registers




                       Barrier




                                   Spill Registers

   Original PTX Code             Transformed PTX Code




  Transform threads into loops over the program.
       Distribute loops to handle barriers.

                                   Gregory Diamos       CS264 - Dynamic Compilation   25/62
Mapping CTAs to cores - vectorization




  Pack adjacent threads into vector instructions.
      Speculate that divergence never occurs, check in case it does.
                            Gregory Diamos   CS264 - Dynamic Compilation   26/62
Mapping CTAs to cores - multiple instruction streams

                              T0 T1 T2 T3




  Instructions from different threads are independent.
      merge instruction streams and statically schedule on functional units.
                           Gregory Diamos   CS264 - Dynamic Compilation   27/62
PTX analysis




 Gregory Diamos   CS264 - Dynamic Compilation   28/62
Divergence analysis




                      Gregory Diamos   CS264 - Dynamic Compilation   29/62
Subkernels


        subkernel




                    Gregory Diamos   CS264 - Dynamic Compilation   30/62
Thread frontier analysis


  Supporting control flow on SIMD processors requires finding divergent
  branches and potential re-converge points.
                                                                      entry                                               T0   T1   T2   T3                       T0   T1   T2   T3
                                             entry
                                                                                  Block Id     Thread Frontiers
                                                                                   B1                                                                                                  Push B3 on T0
                                                                    bra cond1()                      {}
                               bra cond1()            bra cond3()
  if((cond1() || cond2())                                                                                                                                                               Push Exit on T1
    && (cond3() || cond4()))
  {                                                                 bra cond2()
                                                                                   B2               {B2 - B3}                                 thread-frontier
                                                                                                                                              reconvergence                             Push B5 on T2
    ...                                                                                                                                            of T0
  }                            bra cond2()     ....   bra cond4()                                                                                                                      Push Exit on T4
                                                                                   B3
                                                                    bra cond3()                    {B3 - Exit}                                                                            Pop stack
                                                                                                                                                                                          Exit on T4
                                                                                                                                              thread-frontier
                                                                                                                                              reconvergence                               Pop stack
                                                                                                                                                   of T2
                                                                                   B4                                                                                                 switch to B5 on T2
                                                                    bra cond4()                    {B4 - Exit}                                post dominator
                                                                                                                                              reconvergence
                                                                                                                                                                                           Pop stack
                                                                                                                                                                                           Exit on T1
                                             exit                                                                                              of T1 and T3                               Pop stack
                                                                                                                                                                                      switch to B3 on T0
      compound                                                         ....        B5
     conditionals                                                                                  {B5 - Exit}
                                                                                                                  re-convergence at thread frontiers
                                                                       exit                                                                                                            post dominator

                               short circuit control flow                                                                                                                              reconvergence
                                                                                                                                                                                      of T1, T2, and T3

                                                                                                                                                       immediate post-dominator re-convergence




  Compiler analysis can identify immediate post donimators or
  thread-frontiers as re-convergence points.




                                                                              Gregory Diamos         CS264 - Dynamic Compilation                        31/62
Consequences of architecture
        differences




         Gregory Diamos   CS264 - Dynamic Compilation   32/62
Degraded performance portability




                                                                             1600
             600




                                                                             1400
             500




                                                                             1200
             400




                                                                             1000
    GFLOPS




                                                                    GFLOPS
             300




                                                                             800
                                                                             600
                                            Fermi SGEMM                                                      Fermi SGEMM
             200




                                            AMD SGEMM                                                        AMD SGEMM




                                                                             400
             100




                                                                             200
             0




                                                                             0
                   0   1000   2000   3000   4000      5000   6000                   0   1000   2000   3000    4000   5000   6000

                                      N                                                                N




  Performance of two OpenCL applications, one tuned for AMD, the other
  for NVIDIA.

                                                   Gregory Diamos            CS264 - Dynamic Compilation     33/62
Memory traversal patterns




             Warp(4) Cycle 1                            Warp(4) Cycle 2




            Warp(1) Cycle 1                           Warp(1) Cycle 2

  Thread loops change row major memory accesses to column major
  accesses.
                          Gregory Diamos   CS264 - Dynamic Compilation   34/62
Reduced memory bandwidth on CPUs
                Optimized for single-threaded CPU




      Optimized for SIMD (GPU)


  This reduces memory bandwidth by 10x for a memory microbenchmark
  running on a 4-core CPU.
                         Gregory Diamos   CS264 - Dynamic Compilation   35/62
The good news




  Gregory Diamos   CS264 - Dynamic Compilation   36/62
Scaling across three decades of processors

  Many existing applications still scale.



                                                                                    12x

                                                                                    480x




  280GTX has 40x more peak flops than a Phenom, 480x more than an
  Atom.


                             Gregory Diamos   CS264 - Dynamic Compilation   37/62
Questions?




             Gregory Diamos   CS264 - Dynamic Compilation   38/62
Databases on GPUs




                    Gregory Diamos   CS264 - Dynamic Compilation   39/62
Who cares about databases?




                   Gregory Diamos   CS264 - Dynamic Compilation   40/62
What do applications look like?




  What do applications look like?




                    Gregory Diamos   CS264 - Dynamic Compilation   41/62
Gobs of data




               Gregory Diamos   CS264 - Dynamic Compilation   42/62
Distributed systems




                      Gregory Diamos   CS264 - Dynamic Compilation   43/62
Lots of parallelism




                      Gregory Diamos   CS264 - Dynamic Compilation   44/62
What do CPU algorithms look like?




    What do cpu algorithms look
              like?



                   Gregory Diamos   CS264 - Dynamic Compilation   45/62
Btrees




         Gregory Diamos   CS264 - Dynamic Compilation   46/62
Sequential algorithms




                                     <
     relation 1                      =
                                                        result

       relation 2                    >



                    Gregory Diamos       CS264 - Dynamic Compilation   47/62
It doesn’t look good




  Outlook not so good...




                           Gregory Diamos   CS264 - Dynamic Compilation   48/62
Or does it?




  Where is the parallelism?


                              Gregory Diamos   CS264 - Dynamic Compilation   49/62
Flattened trees




                  Gregory Diamos   CS264 - Dynamic Compilation   50/62
Relational algebra




                     Gregory Diamos   CS264 - Dynamic Compilation   51/62
Inner Join




       A Case Study: Inner Join




              Gregory Diamos   CS264 - Dynamic Compilation   52/62
1. Recursive partitioning




                     Gregory Diamos   CS264 - Dynamic Compilation   53/62
2. Block streaming




  Blocking into pages, shared memory buffers, and transaction sized
  chunks makes memory accesses efficient.


                           Gregory Diamos   CS264 - Dynamic Compilation   54/62
3. Shared memory merging network




  A network for join can be constructed, similar to a sorting network.




                            Gregory Diamos   CS264 - Dynamic Compilation   55/62
4. Data chunking




  Stream compaction packs result data into chunks that can be streamed
  out of shared memory efficiently.
                          Gregory Diamos   CS264 - Dynamic Compilation   56/62
Operator fusion




                  Gregory Diamos   CS264 - Dynamic Compilation   57/62
Will it blend?




                 Gregory Diamos   CS264 - Dynamic Compilation   58/62
Yes it blends.




          Operator         NVIDIA C2050                 Phenom 9570
          inner-join       26.4-32.3 GB/s               0.11-0.63 GB/s
          select           104.2 GB/s                   2.55 GB/s
          set operators    45.8 GB/s                    0.72 GB/s
          projection       54.3 GB/s                    2.34 GB/s
          cross product    98.8 GB/s                    2.67 GB/s




                          Gregory Diamos   CS264 - Dynamic Compilation   59/62
Questions?




             Gregory Diamos   CS264 - Dynamic Compilation   60/62
Conclusions



  Emerging heterogeneous architectures need matching execution model
  abstractions.
      dynamic compilation can enable portability.

  When writing massively parallel codes, consider:
      data structures and algorithms.
      mapping onto the execution model.
      transformations in the compiler/runtime.
      processor micro-architecture.




                            Gregory Diamos   CS264 - Dynamic Compilation   61/62
Thoughts on open source software




                   Gregory Diamos   CS264 - Dynamic Compilation   62/62
Questions?

                               Questions?


                                Contact Me:

                      gregory.diamos@gatech.edu



             Contribute to Harmony, Ocelot, and Vanaheimr:

         http://code.google.com/p/harmonyruntime/

               http://code.google.com/p/gpuocelot/

               http://code.google.com/p/vanaheimr/


                         Gregory Diamos   CS264 - Dynamic Compilation   63/62

Contenu connexe

En vedette

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 

En vedette (10)

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 

Similaire à [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

DRACO - Domain specific Reconfigurable Architecture Computer Organization
DRACO - Domain specific Reconfigurable Architecture Computer OrganizationDRACO - Domain specific Reconfigurable Architecture Computer Organization
DRACO - Domain specific Reconfigurable Architecture Computer OrganizationNECST Lab @ Politecnico di Milano
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdfFrangoCamila
 
Tutorial-Auto-Code-Generation-for-F2803x-Target.pdf
Tutorial-Auto-Code-Generation-for-F2803x-Target.pdfTutorial-Auto-Code-Generation-for-F2803x-Target.pdf
Tutorial-Auto-Code-Generation-for-F2803x-Target.pdfmounir derri
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca..."Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...Enrique Monzo Solves
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache AMD
 
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Editor IJMTER
 
Scale Out Your Graph Across Servers and Clouds with OrientDB
Scale Out Your Graph Across Servers and Clouds  with OrientDBScale Out Your Graph Across Servers and Clouds  with OrientDB
Scale Out Your Graph Across Servers and Clouds with OrientDBLuca Garulli
 
Arithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued Logic   Arithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued Logic VLSICS Design
 
CFD Lecture (3/8): Mesh Generation in CFD
CFD Lecture (3/8): Mesh Generation in CFDCFD Lecture (3/8): Mesh Generation in CFD
CFD Lecture (3/8): Mesh Generation in CFDAbhishek Jain
 
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...Altair
 
Heterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D PackagingHeterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D PackagingAMD
 
Arithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued LogicArithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued LogicVLSICS Design
 
Beginning direct3d gameprogramming03_programmingconventions_20160414_jintaeks
Beginning direct3d gameprogramming03_programmingconventions_20160414_jintaeksBeginning direct3d gameprogramming03_programmingconventions_20160414_jintaeks
Beginning direct3d gameprogramming03_programmingconventions_20160414_jintaeksJinTaek Seo
 

Similaire à [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech) (20)

DRACO - Domain specific Reconfigurable Architecture Computer Organization
DRACO - Domain specific Reconfigurable Architecture Computer OrganizationDRACO - Domain specific Reconfigurable Architecture Computer Organization
DRACO - Domain specific Reconfigurable Architecture Computer Organization
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 
Tutorial-Auto-Code-Generation-for-F2803x-Target.pdf
Tutorial-Auto-Code-Generation-for-F2803x-Target.pdfTutorial-Auto-Code-Generation-for-F2803x-Target.pdf
Tutorial-Auto-Code-Generation-for-F2803x-Target.pdf
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca..."Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache
 
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
 
Pcbgcode
PcbgcodePcbgcode
Pcbgcode
 
Scale Out Your Graph Across Servers and Clouds with OrientDB
Scale Out Your Graph Across Servers and Clouds  with OrientDBScale Out Your Graph Across Servers and Clouds  with OrientDB
Scale Out Your Graph Across Servers and Clouds with OrientDB
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Arithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued Logic   Arithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued Logic
 
C0421013019
C0421013019C0421013019
C0421013019
 
CFD Lecture (3/8): Mesh Generation in CFD
CFD Lecture (3/8): Mesh Generation in CFDCFD Lecture (3/8): Mesh Generation in CFD
CFD Lecture (3/8): Mesh Generation in CFD
 
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...
 
Heterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D PackagingHeterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D Packaging
 
Arithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued LogicArithmetic Operations in Multi-Valued Logic
Arithmetic Operations in Multi-Valued Logic
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Beginning direct3d gameprogramming03_programmingconventions_20160414_jintaeks
Beginning direct3d gameprogramming03_programmingconventions_20160414_jintaeksBeginning direct3d gameprogramming03_programmingconventions_20160414_jintaeks
Beginning direct3d gameprogramming03_programmingconventions_20160414_jintaeks
 

Plus de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

Plus de npinto (16)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Dernier

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 

Dernier (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 

[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

  • 1. Dynamic Compilation for Massively Parallel Processors Gregory Diamos PhD candidate Georgia Institute of Technology and NVIDIA Research April 14, 2011 Gregory Diamos CS264 - Dynamic Compilation 1/62
  • 2. What is an execution model? Gregory Diamos CS264 - Dynamic Compilation 2/62
  • 3. Goals of programming languages Programming languages are designed for productivity. Efficiency is measured in terms of: 1 cost - hardware investment, power consumption, area requirement 2 complexity - application development effort 3 speed - amount of work performed per unit time Gregory Diamos CS264 - Dynamic Compilation 3/62
  • 4. Goals of processor architecture Hardware is designed for speed and efficiency. Gregory Diamos CS264 - Dynamic Compilation 4/62
  • 5. Goals of processor architecture - 2 [1] - M. Koyanagi, T. Fukushima, and T. Tanaka. "High-Density Through Silicon Vias for 3-D LSIs" [2] - Novoselov et al. "Electric Field Effect in Atomically Thin Carbon Films." [3] - Intel Corp. 22nm test chip. It is constrained by the limitations of physical devices. Gregory Diamos CS264 - Dynamic Compilation 5/62
  • 6. Execution models bridge the gap Gregory Diamos CS264 - Dynamic Compilation 6/62
  • 7. Goals of execution models Execution models provide impedance matching between applications and hardware. Goals: leverage common optimizations across multiple applications. limit the impact of hardware changes on software. ISAs have traditionally been effective execution models. Gregory Diamos CS264 - Dynamic Compilation 7/62
  • 8. Programming challenges of heterogeneity The introduction of heterogeneous and multi-core processors changes the hardware/software interface: Intel Nehalem IBM PowerEN AMD Fusion NVIDIA Fermi 1 multi-core creates multiple interfaces. 2 heterogeneity creates different interfaces. 3 these increase software complexity. Gregory Diamos CS264 - Dynamic Compilation 8/62
  • 9. Program the entire processor, not individual cores. (new execution model abstractions are needed) Gregory Diamos CS264 - Dynamic Compilation 9/62
  • 10. Emerging execution models Gregory Diamos CS264 - Dynamic Compilation 10/62
  • 11. Bulk-synchronous parallel (BSP) [1] - Leslie Valiant. A bridging model for parallel computing. Gregory Diamos CS264 - Dynamic Compilation 11/62
  • 12. The Parallel Thread eXecution (PTX) Model PTX defines a kernel as a 2-level grid of bulk-synchronous tasks. Gregory Diamos CS264 - Dynamic Compilation 12/62
  • 13. Dynamically translating PTX Dynamic compilers can transform this parallelism to fit the hardware. Gregory Diamos CS264 - Dynamic Compilation 13/62
  • 14. Beyond PTX - Data distributions Gregory Diamos CS264 - Dynamic Compilation 14/62
  • 15. Beyond PTX - Memory hierarchies [1] - Leslie Valiant. A bridging model for multi-core. [2] Fatahalian et al. Sequoia: Programming the memory hierarchy. Gregory Diamos CS264 - Dynamic Compilation 15/62
  • 16. Dynamic compilation/binary translation Gregory Diamos CS264 - Dynamic Compilation 16/62
  • 17. Binary translation Gregory Diamos CS264 - Dynamic Compilation 17/62
  • 18. Binary translators are everywhere If you are running a browser, you are using dynamic compilation. Gregory Diamos CS264 - Dynamic Compilation 18/62
  • 19. x86 binary translation Gregory Diamos CS264 - Dynamic Compilation 19/62
  • 20. Low Level Virtual Machines Compile all programs to a common virtual machine representation (LLVM IR), keep this around. Perform common optimizations on this IR. Target various machines by lowering it to an ISA. Statically or via JIT compilation. Gregory Diamos CS264 - Dynamic Compilation 20/62
  • 21. Execution model translation Gregory Diamos CS264 - Dynamic Compilation 21/62
  • 22. Execution model translation Extend binary translation to execution model translation. Dynamic compilers can map threads/tasks to the HW. Gregory Diamos CS264 - Dynamic Compilation 22/62
  • 23. Different core architectures Can we target these from the same execution model. What about efficiency? Gregory Diamos CS264 - Dynamic Compilation 23/62
  • 24. Ocelot Enables thread-aware compiler transformations. Gregory Diamos CS264 - Dynamic Compilation 24/62
  • 25. Mapping CTAs to cores - thread fusion Scheduler Block Restore Registers Barrier Spill Registers Original PTX Code Transformed PTX Code Transform threads into loops over the program. Distribute loops to handle barriers. Gregory Diamos CS264 - Dynamic Compilation 25/62
  • 26. Mapping CTAs to cores - vectorization Pack adjacent threads into vector instructions. Speculate that divergence never occurs, check in case it does. Gregory Diamos CS264 - Dynamic Compilation 26/62
  • 27. Mapping CTAs to cores - multiple instruction streams T0 T1 T2 T3 Instructions from different threads are independent. merge instruction streams and statically schedule on functional units. Gregory Diamos CS264 - Dynamic Compilation 27/62
  • 28. PTX analysis Gregory Diamos CS264 - Dynamic Compilation 28/62
  • 29. Divergence analysis Gregory Diamos CS264 - Dynamic Compilation 29/62
  • 30. Subkernels subkernel Gregory Diamos CS264 - Dynamic Compilation 30/62
  • 31. Thread frontier analysis Supporting control flow on SIMD processors requires finding divergent branches and potential re-converge points. entry T0 T1 T2 T3 T0 T1 T2 T3 entry Block Id Thread Frontiers B1 Push B3 on T0 bra cond1() {} bra cond1() bra cond3() if((cond1() || cond2()) Push Exit on T1 && (cond3() || cond4())) { bra cond2() B2 {B2 - B3} thread-frontier reconvergence Push B5 on T2 ... of T0 } bra cond2() .... bra cond4() Push Exit on T4 B3 bra cond3() {B3 - Exit} Pop stack Exit on T4 thread-frontier reconvergence Pop stack of T2 B4 switch to B5 on T2 bra cond4() {B4 - Exit} post dominator reconvergence Pop stack Exit on T1 exit of T1 and T3 Pop stack switch to B3 on T0 compound .... B5 conditionals {B5 - Exit} re-convergence at thread frontiers exit post dominator short circuit control flow reconvergence of T1, T2, and T3 immediate post-dominator re-convergence Compiler analysis can identify immediate post donimators or thread-frontiers as re-convergence points. Gregory Diamos CS264 - Dynamic Compilation 31/62
  • 32. Consequences of architecture differences Gregory Diamos CS264 - Dynamic Compilation 32/62
  • 33. Degraded performance portability 1600 600 1400 500 1200 400 1000 GFLOPS GFLOPS 300 800 600 Fermi SGEMM Fermi SGEMM 200 AMD SGEMM AMD SGEMM 400 100 200 0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 N N Performance of two OpenCL applications, one tuned for AMD, the other for NVIDIA. Gregory Diamos CS264 - Dynamic Compilation 33/62
  • 34. Memory traversal patterns Warp(4) Cycle 1 Warp(4) Cycle 2 Warp(1) Cycle 1 Warp(1) Cycle 2 Thread loops change row major memory accesses to column major accesses. Gregory Diamos CS264 - Dynamic Compilation 34/62
  • 35. Reduced memory bandwidth on CPUs Optimized for single-threaded CPU Optimized for SIMD (GPU) This reduces memory bandwidth by 10x for a memory microbenchmark running on a 4-core CPU. Gregory Diamos CS264 - Dynamic Compilation 35/62
  • 36. The good news Gregory Diamos CS264 - Dynamic Compilation 36/62
  • 37. Scaling across three decades of processors Many existing applications still scale. 12x 480x 280GTX has 40x more peak flops than a Phenom, 480x more than an Atom. Gregory Diamos CS264 - Dynamic Compilation 37/62
  • 38. Questions? Gregory Diamos CS264 - Dynamic Compilation 38/62
  • 39. Databases on GPUs Gregory Diamos CS264 - Dynamic Compilation 39/62
  • 40. Who cares about databases? Gregory Diamos CS264 - Dynamic Compilation 40/62
  • 41. What do applications look like? What do applications look like? Gregory Diamos CS264 - Dynamic Compilation 41/62
  • 42. Gobs of data Gregory Diamos CS264 - Dynamic Compilation 42/62
  • 43. Distributed systems Gregory Diamos CS264 - Dynamic Compilation 43/62
  • 44. Lots of parallelism Gregory Diamos CS264 - Dynamic Compilation 44/62
  • 45. What do CPU algorithms look like? What do cpu algorithms look like? Gregory Diamos CS264 - Dynamic Compilation 45/62
  • 46. Btrees Gregory Diamos CS264 - Dynamic Compilation 46/62
  • 47. Sequential algorithms < relation 1 = result relation 2 > Gregory Diamos CS264 - Dynamic Compilation 47/62
  • 48. It doesn’t look good Outlook not so good... Gregory Diamos CS264 - Dynamic Compilation 48/62
  • 49. Or does it? Where is the parallelism? Gregory Diamos CS264 - Dynamic Compilation 49/62
  • 50. Flattened trees Gregory Diamos CS264 - Dynamic Compilation 50/62
  • 51. Relational algebra Gregory Diamos CS264 - Dynamic Compilation 51/62
  • 52. Inner Join A Case Study: Inner Join Gregory Diamos CS264 - Dynamic Compilation 52/62
  • 53. 1. Recursive partitioning Gregory Diamos CS264 - Dynamic Compilation 53/62
  • 54. 2. Block streaming Blocking into pages, shared memory buffers, and transaction sized chunks makes memory accesses efficient. Gregory Diamos CS264 - Dynamic Compilation 54/62
  • 55. 3. Shared memory merging network A network for join can be constructed, similar to a sorting network. Gregory Diamos CS264 - Dynamic Compilation 55/62
  • 56. 4. Data chunking Stream compaction packs result data into chunks that can be streamed out of shared memory efficiently. Gregory Diamos CS264 - Dynamic Compilation 56/62
  • 57. Operator fusion Gregory Diamos CS264 - Dynamic Compilation 57/62
  • 58. Will it blend? Gregory Diamos CS264 - Dynamic Compilation 58/62
  • 59. Yes it blends. Operator NVIDIA C2050 Phenom 9570 inner-join 26.4-32.3 GB/s 0.11-0.63 GB/s select 104.2 GB/s 2.55 GB/s set operators 45.8 GB/s 0.72 GB/s projection 54.3 GB/s 2.34 GB/s cross product 98.8 GB/s 2.67 GB/s Gregory Diamos CS264 - Dynamic Compilation 59/62
  • 60. Questions? Gregory Diamos CS264 - Dynamic Compilation 60/62
  • 61. Conclusions Emerging heterogeneous architectures need matching execution model abstractions. dynamic compilation can enable portability. When writing massively parallel codes, consider: data structures and algorithms. mapping onto the execution model. transformations in the compiler/runtime. processor micro-architecture. Gregory Diamos CS264 - Dynamic Compilation 61/62
  • 62. Thoughts on open source software Gregory Diamos CS264 - Dynamic Compilation 62/62
  • 63. Questions? Questions? Contact Me: gregory.diamos@gatech.edu Contribute to Harmony, Ocelot, and Vanaheimr: http://code.google.com/p/harmonyruntime/ http://code.google.com/p/gpuocelot/ http://code.google.com/p/vanaheimr/ Gregory Diamos CS264 - Dynamic Compilation 63/62