SlideShare une entreprise Scribd logo
1  sur  232
Télécharger pour lire hors ligne
Massively Parallel Computing
                          CS 264 / CSCI E-292
Lecture #2: Architecture, Theory & Patterns | February 1st, 2011




                Nicolas Pinto (MIT, Harvard)
                       pinto@mit.edu
Objectives

• introduce important computational thinking
  skills for massively parallel computing
• understand hardware limitations
• understand algorithm constraints
• identify common patterns
During this course,
                          r CS264
                adapted fo



we’ll try to


          “                         ”

and use existing material ;-)
Outline

• Thinking Parallel
• Architecture
• Programming Model
• Bits of Theory
• Patterns
ti vat i on
                                     Mo

!   7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
    12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

!   P1;$.&1#+,,8'! -*Q;'3"$'O+;$"&

    " P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

!   S.I'! -*Q;'3"$'I16"&

                                             slide by Matthew Bolitho
ti vat i on
                                 Mo

!   T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
    O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

!   *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
!   Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

    " D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'
      O%26+/"2$+,,8'&"6";132"6

                                           slide by Matthew Bolitho
Thinking Parallel
Getting your feet wet

• Common scenario: “I want to make the
  algorithm X run faster, help me!”


• Q: How do you approach the problem?
How?
How?
• Option 1: wait
• Option 2: gcc -O3 -msse4.2
• Option 3: xlc -O5
• Option 4: use parallel libraries (e.g. (cu)blas)
• Option 5: hand-optimize everything!
• Option 6: wait more
What else ?
How about
 analysis ?
Getting your feet wet
           Algorithm X v1.0 Profiling Analysis on Input 10x10x10

            100

                                                                    100% parallelizable
             75
                                sequential in nature
time (s)




             50                                              50



             25       29


                                       10              11
              0
                  load_data()         foo()        bar()    yey()



             Q: What is the maximum speed up ?
Getting your feet wet
           Algorithm X v1.0 Profiling Analysis on Input 10x10x10

            100

                                                                    100% parallelizable
             75
                                sequential in nature
time (s)




             50                                              50



             25       29


                                       10              11
              0
                  load_data()         foo()        bar()    yey()



                                       A: 2X ! :-(
Getting your feet wet
           Algorithm X v1.0 Profiling Analysis on Input 100x100x100

           9,000                                            9,000

                                                                    100% parallelizable
           6,750
                                 sequential in nature
time (s)




           4,500


           2,250


              0        350              250         300
                   load_data()         foo()        bar()   yey()



                                      Q: and now?
You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and
  their constraints
• ... know the input domain
• ... profile accordingly
• ... “refactor” based on new constraints (hw/sw)
A better way ?

                                  ...

                           ale!
                       t sc
               es n’
            do




Speculation: (input) domain-aware optimization using
some sort of probabilistic modeling ?
Some Perspective
The “problem tree” for scientific problem solving
  9 Some Perspective

                               Technical Problem to be Analyzed


                                                            Consultation with experts

          Scientific Model "A"                              Model "B"


                                                                  Theoretical analysis
          Discretization "A"           Discretization "B"   Experiments


          Iterative equation solver           Direct elimination equation solver



         Parallel implementation        Sequential implementation



  Figure 11: There“problem tree” for to try to achieve the same goal. are many
               The are many options scientific problem solving. There
  options to try to achieve the same goal.
                                                                        from Scott et al. “Scientific Parallel Computing” (2005)
Computational Thinking

• translate/formulate domain problems into
  computational models that can be solved
  efficiently by available computing resources


• requires a deep understanding of their
  relationships


                                        adapted from Hwu & Kirk (PASI 2011)
Getting ready...

                 Programming Models

Architecture      Algorithms                     Languages
                   Patterns                 il   ers
                                      C omp




                Parallel Thinking
                  Parallel
                 Computing




               APPLICATIONS
                                                       adapted from Scott et al. “Scientific Parallel Computing” (2005)
Fundamental Skills

• Computer architecture
• Programming models and compilers
• Algorithm techniques and patterns
• Domain knowledge
Computer Architecture
critical in understanding tradeoffs btw algorithms




 • memory organization, bandwidth and latency;
   caching and locality (memory hierarchy)
 • floating-point precision vs. accuracy
 • SISD, SIMD, MISD, MIMD vs. SIMT, SPMD
Programming models
 for optimal data structure and code execution




• parallel execution models (threading hierarchy)
• optimal memory access patterns
• array data layout and loop transformations
Algorithms and patterns
• toolbox for designing good parallel algorithms
• it is critical to understand their scalability and
  efficiency
• many have been exposed and documented
• sometimes hard to “extract”
• ... but keep trying!
Domain Knowledge

• abstract modeling
• mathematical properties
• accuracy requirements
• coming back to the drawing board to expose
  more/better parallelism ?
You can do it!


• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !
Architecture
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
What’s in a computer?




adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
What’s in a computer?



             Processor




             Intel Q6600 Core2 Quad, 2.4 GHz

adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
What’s in a computer?
                                                          Die




             Processor




                                             (2×) 143 mm2 , 2 × 2 cores
             Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors
                                             ∼ 100W
adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
What’s in a computer?




adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
What’s in a computer?
                                                      Memory




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Architecture

• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
A Basic Processor
                                                                                   Memory Interface
                                              Address ALU                                 Address Bus


                                                                                          Data Bus
                                            Register File
                                                   Flags


                                                            Internal Bus

             Insn.
             fetch                                PC
                                                                           Data ALU
                                        Control Unit

           (loosely based on Intel 8086)

adapted from Berger & Klöckner (NYU 2010)                                       Intro Basics Assembly Memory Pipelines
How all of this fits together

      Everything synchronizes to the Clock.
      Control Unit (“CU”): The brains of the                                                                     Memory Interface

      operation. Everything connects to it.                                  Address ALU                             Address Bus


                                                                                                                     Data Bus
      Bus entries/exits are gated and                                      Register File
                                                                                  Flags

      (potentially) buffered.                                                               Internal Bus


      CU controls gates, tells other units                       Insn.
                                                                 fetch               PC
                                                                           Control Unit
                                                                                                          Data ALU

      about ‘what’ and ‘how’:
            • What operation?
            • Which register?
            • Which addressing mode?




adapted from Berger & Klöckner (NYU 2010)                                Intro Basics Assembly Memory Pipelines
What is. . . an ALU?
      Arithmetic Logic Unit
      One or two operands A, B
      Operation selector (Op):
            • (Integer) Addition, Subtraction
            • (Logical) And, Or, Not
            • (Bitwise) Shifts (equivalent to
                 multiplication by power of two)
            • (Integer) Multiplication, Division
      Specialized ALUs:
            • Floating Point Unit (FPU)
            • Address ALU
      Operates on binary representations of
      numbers. Negative numbers represented
      by two’s complement.

adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
What is. . . a Register File?


      Registers are On-Chip Memory
                                                                               %r0
            • Directly usable as operands in                                   %r1
                 Machine Language                                              %r2
            • Often “general-purpose”                                          %r3
            • Sometimes special-purpose: Floating                              %r4
                 point, Indexing, Accumulator                                  %r5
            • Small: x86 64: 16×64 bit GPRs                                    %r6
            • Very fast (near-zero latency)                                    %r7




adapted from Berger & Klöckner (NYU 2010)                           Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
How does computer memory work?
           One (reading) memory transaction (simplified):




                                                  D0..15
                    Processor                                           Memory


                                                  A0..15
                                                    ¯
                                                  R/W
                                                  CLK


           Observation: Access (and addressing) happens
           in bus-width-size “chunks”.
adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
What is. . . a Memory Interface?


      Memory Interface gets and stores binary
      words in off-chip memory.
      Smallest granularity: Bus width
      Tells outside memory
            • “where” through address bus
            • “what” through data bus
      Computer main memory is “Dynamic RAM”
      (DRAM): Slow, but small and cheap.




adapted from Berger & Klöckner (NYU 2010)                            Intro Basics Assembly Memory Pipelines
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
A Very Simple Program


                                              4:   c7   45   f4 05 00 00 00 movl   $0x5,−0xc(%rbp)
                                              b:   c7   45   f8 11 00 00 00 movl   $0x11,−0x8(%rbp)
           int a = 5;
                                             12:   8b   45   f4             mov    −0xc(%rbp),%eax
           int b = 17;
                                             15:   0f   af   45 f8          imul   −0x8(%rbp),%eax
           int z = a ∗ b;                    19:   89   45   fc             mov    %eax,−0x4(%rbp)
                                             1c:   8b   45   fc             mov    −0x4(%rbp),%eax

           Things to know:
                 • Addressing modes (Immediate, Register, Base plus Offset)
                 • 0xHexadecimal
                 • “AT&T Form”: (we’ll use this)
                      <opcode><size> <source>, <dest>




adapted from Berger & Klöckner (NYU 2010)                                  Intro Basics Assembly Memory Pipelines
A Very Simple Program: Intel Form


                4:          c7     45       f4 05 00 00 00   mov    DWORD PTR [rbp−0xc],0x5
                b:          c7     45       f8 11 00 00 00   mov    DWORD PTR [rbp−0x8],0x11
               12:          8b     45       f4               mov    eax,DWORD PTR [rbp−0xc]
               15:          0f     af       45 f8            imul   eax,DWORD PTR [rbp−0x8]
               19:          89     45       fc               mov    DWORD PTR [rbp−0x4],eax
               1c:          8b     45       fc               mov    eax,DWORD PTR [rbp−0x4]


                 • “Intel Form”: (you might see this on the net)
                      <opcode> <sized dest>, <sized source>
                 • Goal: Reading comprehension.
                 • Don’t understand an opcode?
                      Google “<opcode> intel instruction”.




adapted from Berger & Klöckner (NYU 2010)                                    Intro Basics Assembly Memory Pipelines
Machine Language Loops
                                               0:   55                          push     %rbp
                                               1:   48   89   e5                mov      %rsp,%rbp
    int main()                                 4:   c7   45   f8 00 00 00 00    movl     $0x0,−0x8(%rbp)
    {                                          b:   c7   45   fc 00 00 00 00    movl     $0x0,−0x4(%rbp)
      int y = 0, i ;                          12:   eb   0a                     jmp      1e <main+0x1e>
                                              14:   8b   45   fc                mov      −0x4(%rbp),%eax
      for ( i = 0;
                                              17:   01   45   f8                add      %eax,−0x8(%rbp)
          y < 10; ++i)                        1a:   83   45   fc 01             addl     $0x1,−0x4(%rbp)
        y += i;                               1e:   83   7d   f8 09             cmpl     $0x9,−0x8(%rbp)
      return y;                               22:   7e   f0                      jle     14 <main+0x14>
                                              24:   8b   45   f8                mov      −0x8(%rbp),%eax
    }                                         27:   c9                          leaveq
                                              28:   c3                          retq

           Things to know:
                 • Condition Codes (Flags): Zero, Sign, Carry, etc.
                 • Call Stack: Stack frame, stack pointer, base pointer
                 • ABI: Calling conventions



adapted from Berger & Klöckner (NYU 2010)                                      Intro Basics Assembly Memory Pipelines
Machine Language Loops
                                               0:   55                          push     %rbp
                                               1:   48   89   e5                mov      %rsp,%rbp
    int main()                                 4:   c7   45   f8 00 00 00 00    movl     $0x0,−0x8(%rbp)
    {                                          b:   c7   45   fc 00 00 00 00    movl     $0x0,−0x4(%rbp)
      int y = 0, i ;                          12:   eb   0a                     jmp      1e <main+0x1e>
                                              14:   8b   45   fc                mov      −0x4(%rbp),%eax
      for ( i = 0;
                                              17:   01   45   f8                add      %eax,−0x8(%rbp)
          y < 10; ++i)                        1a:   83   45   fc 01             addl     $0x1,−0x4(%rbp)
        y += i;                               1e:   83   7d   f8 09             cmpl     $0x9,−0x8(%rbp)
      return y;                               22:   7e   f0                      jle     14 <main+0x14>
                                              24:   8b   45   f8                mov      −0x8(%rbp),%eax
    }                                         27:   c9                          leaveq
                                              28:   c3                          retq

           Things to know:
                     Want to make those yourself?
             • Condition Codes (Flags): Zero, Sign, Carry, etc.
                     Write myprogram.c.
             • Call Stack:-c myprogram.c
                     $ cc Stack frame, stack pointer, base pointer
             • ABI: $ objdump --disassemble myprogram.o
                     Calling conventions


adapted from Berger & Klöckner (NYU 2010)                                      Intro Basics Assembly Memory Pipelines
We know how a computer works!


           All of this can be built in about 4000 transistors.
           (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
           So what exactly is Intel doing with the other 581,996,000
           transistors?
           Answer:




adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
We know how a computer works!


           All of this can be built in about 4000 transistors.
           (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
           So what exactly is Intel doing with the other 581,996,000
           transistors?
           Answer: Make things go faster!




adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
We know how a computer works!


           All of this can be built in about 4000 transistors.
           (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
           So what exactly is Intel doing with the other 581,996,000
           transistors?
           Answer: Make things go faster!
           Goal now:
           Understand sources of slowness, and how they get addressed.
           Remember: High Performance Computing




adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
The High-Performance Mindset
                                                  Writing high-performance Codes
                                                  Mindset: What is going to be the limiting
                                                  factor?
                                                    • ALU?
                                                    • Memory?
                                                    • Communication? (if multi-machine)

                                                  Benchmark the assumed limiting factor right
                                                  away.
                                                  Evaluate
                                                    • Know your peak throughputs (roughly)
                                                    • Are you getting close?
                                                    • Are you tracking the right limiting factor?

adapted from Berger & Klöckner (NYU 2010)                                Intro Basics Assembly Memory Pipelines
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
Source of Slowness: Memory
           Memory is slow.
           Distinguish two different versions of “slow”:
             • Bandwidth
             • Latency
           → Memory has long latency, but can have large bandwidth.




           Size of die vs. distance to memory: big!
           Dynamic RAM: long intrinsic latency!
adapted from Berger & Klöckner (NYU 2010)                        Intro Basics Assembly Memory Pipelines
Source of Slowness: Memory
           Memory is slow.
           Distinguish two different versions of “slow”:
             • Bandwidth
             • Latency
           → Memory has long latency, but can have large bandwidth.




                                                              Idea:
                                                              Put a look-up table of
                                                              recently-used data onto
                                                              the chip.
           Size of die vs. distance to memory: big!
                                                              → “Cache”
           Dynamic RAM: long intrinsic latency!
adapted from Berger & Klöckner (NYU 2010)                         Intro Basics Assembly Memory Pipelines
The Memory Hierarchy
           Hierarchy of increasingly bigger, slower memories:
    faster
                                               Registers       1 kB, 1 cycle

                                              L1 Cache         10 kB, 10 cycles

                                              L2 Cache         1 MB, 100 cycles

                                                DRAM           1 GB, 1000 cycles

                                            Virtual Memory
                                                               1 TB, 1 M cycles
                                              (hard drive)
                                                                                            bigger



adapted from Berger & Klöckner (NYU 2010)                          Intro Basics Assembly Memory Pipelines
Performance of computer system
                                                                                           Performance of computer system


                                                                                         Entire problem fits within registers


                                                                                         Entire problem fits within cache




from Scott et al. “Scientific Parallel Computing” (2005)
                                                                                         Entire problem
                                                                                         fits within
                                                                                         main memory



                                                                                         Problem
                                                                                         requires
                                                          Size of problem being solved
                                                          Size of problem being solved
                                                                                         secondary
                                                                                         (disk)
                                                                                         memory
                                                                                         for system!
                                                                                                                        Performance
                                                                                                                          Impact on




                                                                                         Problem too big
The Memory Hierarchy
           Hierarchy of increasingly bigger, slower memories:

                                               Registers       1 kB, 1 cycle

                                              L1 Cache         10 kB, 10 cycles

                                              L2 Cache         1 MB, 100 cycles

                                                DRAM           1 GB, 1000 cycles

                                            Virtual Memory
                                                               1 TB, 1 M cycles
                                              (hard drive)    How might data locality
                                                              factor into this?
                                                              What is a working set?

adapted from Berger & Klöckner (NYU 2010)                          Intro Basics Assembly Memory Pipelines
Cache: Actual Implementation
         Demands on cache implementation:
               • Fast, small, cheap, low-power
               • Fine-grained
               • High “hit”-rate (few “misses”)

           Problem:
            Goals at odds with each other: Access matching logic expensive!

           Solution 1: More data per unit of access matching logic
                                                     → Larger “Cache Lines”

           Solution 2: Simpler/less access matching logic
                                             → Less than full “Associativity”

           Other choices: Eviction strategy, size


adapted from Berger & Klöckner (NYU 2010)                         Intro Basics Assembly Memory Pipelines
Cache: Associativity


                                  Direct Mapped            2-way set associative
                     Memory                   Cache      Memory                Cache
                       0                        0          0                     0
                       1                        1          1                     1
                       2                        2          2                     2
                       3                        3          3                     3
                       4                                   4
                       5                                   5
                       6                                   6
                       .
                       .                                   .
                                                           .
                       .                                   .




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Cache: Associativity


                                  Direct Mapped                     2-way set associative
                     Memory                      Cache           Memory                 Cache
                       0                           0               0                      0
                       1                           1               1                      1
                       2                           2               2                      2
                       3                           3               3                      3
                       4                                           4
                       5                                           5
                       6                                           6
                       .
                       .                                           .
                                                                   .
                       .                                           .

                                            Miss rate versus cache size on the Integer por-
                                            tion of SPEC CPU2000 [Cantin, Hill 2003]


adapted from Berger & Klöckner (NYU 2010)                               Intro Basics Assembly Memory Pipelines
Cache Example: Intel Q6600/Core2 Quad

           --- L1 data cache ---
           fully associative cache    =     false
           threads sharing this cache =     0x0 (0)
           processor cores on this die=     0x3 (3)
           system coherency line size =     0x3f (63)
           ways of associativity      =     0x7 (7)
           number of sets - 1 (s)     =     63

           --- L1 instruction ---
           fully associative cache    =     false       --- L2 unified cache ---
           threads sharing this cache =     0x0 (0)     fully associative cache     false
           processor cores on this die=     0x3 (3)     threads sharing this cache = 0x1 (1)
           system coherency line size =     0x3f (63)   processor cores on this die= 0x3 (3)
           ways of associativity      =     0x7 (7)     system coherency line size = 0x3f (63)
           number of sets - 1 (s)     =     63          ways of associativity      = 0xf (15)
                                                        number of sets - 1 (s)     = 4095

           More than you care to know about your CPU:
           http://www.etallen.com/cpuid.html

adapted from Berger & Klöckner (NYU 2010)                        Intro Basics Assembly Memory Pipelines
Measuring the Cache I

           void go(unsigned count, unsigned stride)
           {
             const unsigned arr size = 64 ∗ 1024 ∗ 1024;
             int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size );

                for (unsigned it = 0; it < count; ++it)
                {
                  for (unsigned i = 0; i < arr size ; i += stride)
                    ary [ i ] ∗= 17;
                }

                free (ary );
           }




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Measuring the Cache I

           void go(unsigned count, unsigned stride)
           {
             const unsigned arr size = 64 ∗ 1024 ∗ 1024;
             int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size );

                for (unsigned it = 0; it < count; ++it)
                {
                  for (unsigned i = 0; i < arr size ; i += stride)
                    ary [ i ] ∗= 17;
                }

                free (ary );
           }




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Measuring the Cache II


           void go(unsigned array size , unsigned steps)
           {
             int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size );
             unsigned asm1 = array size − 1;

                for (unsigned i = 0; i < steps; ++i)
                  ary [( i ∗16) & asm1] ++;

                free (ary );
           }




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Measuring the Cache II


           void go(unsigned array size , unsigned steps)
           {
             int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size );
             unsigned asm1 = array size − 1;

                for (unsigned i = 0; i < steps; ++i)
                  ary [( i ∗16) & asm1] ++;

                free (ary );
           }




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Measuring the Cache III

           void go(unsigned array size , unsigned stride , unsigned steps)
           {
             char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size );

                unsigned p = 0;
                for (unsigned i = 0; i < steps; ++i)
                {
                  ary [p] ++;
                  p += stride;
                  if (p >= array size)
                    p = 0;
                }

                free (ary );
           }




adapted from Berger & Klöckner (NYU 2010)                       Intro Basics Assembly Memory Pipelines
Measuring the Cache III

           void go(unsigned array size , unsigned stride , unsigned steps)
           {
             char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size );

                unsigned p = 0;
                for (unsigned i = 0; i < steps; ++i)
                {
                  ary [p] ++;
                  p += stride;
                  if (p >= array size)
                    p = 0;
                }

                free (ary );
           }




adapted from Berger & Klöckner (NYU 2010)                       Intro Basics Assembly Memory Pipelines
Mike Bauer (Stanford)
http://sequoia.stanford.edu/




Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
Source of Slowness: Sequential Operation




                                    IF Instruction fetch
                                   ID Instruction Decode
                                 EX Execution
                           MEM Memory Read/Write
                               WB Result Writeback




adapted from Berger & Klöckner (NYU 2010)                  Intro Basics Assembly Memory Pipelines
Solution: Pipelining




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Pipelining




           (MIPS, 110,000 transistors)

adapted from Berger & Klöckner (NYU 2010)                Intro Basics Assembly Memory Pipelines
Issues with Pipelines


      Pipelines generally help
      performance–but not always.
      Possible issues:
            • Stalls
            • Dependent Instructions
            • Branches (+Prediction)
            • Self-Modifying Code
      “Solution”: Bubbling, extra
      circuitry




adapted from Berger & Klöckner (NYU 2010)                       Intro Basics Assembly Memory Pipelines
Intel Q6600 Pipeline




adapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Intel Q6600 Pipeline
                                                            New concept:
                                                            Instruction-level
                                                            parallelism
                                                            (“Superscalar”)




adapted from Berger & Klöckner (NYU 2010)                       Intro Basics Assembly Memory Pipelines
Programming for the Pipeline



           How to upset a processor pipeline:
           for (int i = 0; i < 1000; ++i)
             for (int j = 0; j < 1000; ++j)
             {
               if ( j % 2 == 0)
                 do something(i , j );
             }

                                                                     . . . why is this bad?




adapted from Berger & Klöckner (NYU 2010)                         Intro Basics Assembly Memory Pipelines
A Puzzle


           int steps = 256 ∗ 1024 ∗ 1024;
           int [] a = new int[2];

           // Loop 1
           for (int i =0; i<steps; i ++) { a[0]++; a[0]++; }

           // Loop 2
           for (int i =0; i<steps; i ++) { a[0]++; a[1]++; }

           Which is faster?

                                                                      . . . and why?




adapted from Berger & Klöckner (NYU 2010)                Intro Basics Assembly Memory Pipelines
Two useful Strategies
           Loop unrolling:

                                                          for (int i = 0; i < 500; i+=2)
                                                          {
      for (int i = 0; i < 1000; ++i)
        do something(i );
                                                     →      do something(i );
                                                            do something(i+1);
                                                          }
           Software pipelining:

                                                          for (int i = 0; i < 500; i+=2)
      for (int i = 0; i < 1000; ++i)                      {
      {                                                     do a( i );
        do a( i );                                   →      do a( i +1);
        do b(i );                                           do b(i );
      }                                                     do b(i+1);
                                                          }


adapted from Berger & Klöckner (NYU 2010)                       Intro Basics Assembly Memory Pipelines
SIMD
           Control Units are large and expensive.                         SIMD        Instruction Pool

           Functional Units are simple and cheap.
           → Increase the Function/Control ratio:




                                                                          Data Pool
           Control several functional units with
           one control unit.
           All execute same operation.

           GCC vector extensions:
           typedef int v4si                 attribute   (( vector size (16)));

           v4si a, b, c;
           c = a + b;
           // +, −, ∗, /, unary minus, ˆ, |, &, ˜, %

           Will revisit for OpenCL, GPUs.

adapted from Berger & Klöckner (NYU 2010)                               Intro Basics Assembly Memory Pipelines
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
GPUs ?
!   6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
    C(*8D'+4/




!   E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
    .*-3(*D&,-@&@,3,&.,.A'
Intro PyOpenCL           What and Why? OpenCL


“CPU-style” Cores
     CPU-“style” cores


                              Fetch/                    Out-of-order control logic
                              Decode
                                                          Fancy branch predictor
                                ALU
                              (Execute)
                                                             Memory pre-fetcher
                            Execution
                             Context
                                                                    Data cache
                                                                      (A big one)




      SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/       13

   Credit: Kayvon Fatahalian (Stanford)
Intro PyOpenCL           What and Why? OpenCL


Slimming down
      Slimming down


                             Fetch/
                             Decode
                                                    Idea #1:
                               ALU                  Remove components that
                             (Execute)
                                                    help a single instruction
                           Execution                stream run fast
                            Context




     SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                      14

  Credit: Kayvon Fatahalian (Stanford)

                           slide by Andreas Kl¨ckner
                                              o              GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL       What and Why? OpenCL


More Space: Double the Numberparallel)
   Two cores (two fragments in of Cores
    fragment 1                                                                              fragment 2


                                         Fetch/                           Fetch/
                                         Decode                           Decode
     !"#$$%&'()*"'+,-.
                                                                                             !"#$$%&'()*"'+,-.


                                          ALU                                 ALU
     &*/01'.+23.453.623.&2.
                                                                                             &*/01'.+23.453.623.&2.
     /%1..+73.423.892:2;.
                                                                                             /%1..+73.423.892:2;.
     /*"".+73.4<3.892:<;3.+7.
                                         (Execute)                        (Execute)
                                                                                             /*"".+73.4<3.892:<;3.+7.
     /*"".+73.4=3.892:=;3.+7.
                                                                                             /*"".+73.4=3.892:=;3.+7.
     81/0.+73.+73.1>2?2@3.1><?2@.
                                                                                             81/0.+73.+73.1>2?2@3.1><?2@.
     /%1..A23.+23.+7.
                                                                                             /%1..A23.+23.+7.


                                      Execution                         Execution
     /%1..A<3.+<3.+7.
                                                                                             /%1..A<3.+<3.+7.
     /%1..A=3.+=3.+7.
                                                                                             /%1..A=3.+=3.+7.


                                       Context                           Context
     /A4..A73.1><?2@.
                                                                                             /A4..A73.1><?2@.




   SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                                             15

   Credit: Kayvon Fatahalian (Stanford)

                                    slide by Andreas Kl¨ckner
                                                       o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL        What and Why? OpenCL



Fouragain
  . . . cores                  (four fragments in parallel)


                                                          Fetch/                  Fetch/
                                                          Decode                  Decode

                                                            ALU                     ALU
                                                         (Execute)               (Execute)

                                                         Execution               Execution
                                                          Context                 Context




                                                          Fetch/                  Fetch/
                                                          Decode                  Decode

                                                            ALU                     ALU
                                                         (Execute)               (Execute)

                                                         Execution               Execution
                                                          Context                 Context




GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                               16

             Credit: Kayvon Fatahalian (Stanford)

                                         slide by Andreas Kl¨ckner
                                                            o           GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL       What and Why? OpenCL



xteen cores
  . . . and again                  (sixteen fragments in parallel)


                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                 16 cores = 16 simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
            Credit: Kayvon Fatahalian (Stanford)                                                  17


                                      slide by Andreas Kl¨ckner
                                                         o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL       What and Why? OpenCL



xteen cores
  . . . and again                  (sixteen fragments in parallel)


                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU
                                                      → 16 independent instruction streams
                                                         ALU      ALU    ALU


                                              Reality: instruction streams not actually
                                 16 cores = 16very different/independent
                                               simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
            Credit: Kayvon Fatahalian (Stanford)                                                  17


                                      slide by Andreas Kl¨ckner
                                                         o          GPU-Python with PyOpenCL and PyCUDA
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL


 Saving Yet More Space

               Fetch/
               Decode


                ALU
               (Execute)



            Execution
             Context




    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o        GPU-Python with PyOpenCL and PyCUDA
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL


 Saving Yet More Space

               Fetch/
               Decode


                ALU                                Idea #2
               (Execute)
                                                   Amortize cost/complexity of
                                                   managing an instruction stream
            Execution                              across many ALUs
             Context                               → SIMD




    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o        GPU-Python with PyOpenCL and PyCUDA
ecall: simple processing core
dd ALUs                        Intro PyOpenCL       What and Why? OpenCL


 Saving Yet More Space

               Fetch/                              Idea #2:
               Decode
                                                   Amortize cost/complexity of
     ALU 1   ALU 2    ALU 3     ALU 4
                ALU                                managing an instruction
                                                    Idea #2
               (Execute)
     ALU 5    ALU 6   ALU 7     ALU 8              stream across many of
                                                    Amortize cost/complexity ALUs
                                                    managing an instruction stream
         Execution                                  across many ALUs
     Ctx Ctx Ctx
          Context
                                Ctx
                                                   SIMD processing
                                                    → SIMD
     Ctx      Ctx     Ctx       Ctx

        Shared Ctx Data
    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o         GPU-Python with PyOpenCL and PyCUDA
dd ALUs                        Intro PyOpenCL       What and Why? OpenCL


 Saving Yet More Space

               Fetch/                              Idea #2:
               Decode
                                                   Amortize cost/complexity of
     ALU 1   ALU 2    ALU 3     ALU 4
                                                   managing an instruction
                                                    Idea #2
     ALU 5    ALU 6   ALU 7     ALU 8              stream across many of
                                                    Amortize cost/complexity ALUs
                                                    managing an instruction stream
                                                    across many ALUs
     Ctx      Ctx     Ctx       Ctx
                                                   SIMD processing
                                                    → SIMD
     Ctx      Ctx     Ctx       Ctx

        Shared Ctx Data
    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o         GPU-Python with PyOpenCL and PyCUDA
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL


  Gratuitous Amounts of Parallelism!
ragments in parallel




                        16 cores = 128 ALUs
                                        = 16 simultaneous instruction streams
            Credit: Shading: http://s09.idav.ucdavis.edu/
                     Kayvon Fatahalian (Stanford)
Beyond Programmable                                                         24


                                               slide by Andreas Kl¨ckner
                                                                  o        GPU-Python with PyOpenCL and PyCUDA
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL


  Gratuitous Amounts of Parallelism!
ragments in parallel
                  Example:
                  128 instruction streams in parallel
                  16 independent groups of 8 synchronized streams




                        16 cores = 128 ALUs
                                        = 16 simultaneous instruction streams
            Credit: Shading: http://s09.idav.ucdavis.edu/
                     Kayvon Fatahalian (Stanford)
Beyond Programmable                                                         24


                                               slide by Andreas Kl¨ckner
                                                                  o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Remaining Problem: Slow Memory


 Problem
 Memory still has very high latency. . .
 . . . but we’ve removed most of the
 hardware that helps us deal with that.

 We’ve removed
     caches
     branch prediction
     out-of-order execution
 So what now?



                    slide by Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Remaining Problem: Slow Memory


 Problem
 Memory still has very high latency. . .
 . . . but we’ve removed most of the
 hardware that helps us deal with that.

 We’ve removed
     caches
     branch prediction                              Idea #3
     out-of-order execution                                 Even more parallelism
 So what now?                                         +     Some extra memory
                                                      =     A solution!


                    slide by Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


  Remaining Problem: Slow Memory
                                  Fetch/
                                  Decode

    Problem             ALU     ALU      ALU      ALU
    Memory still has very high latency. . .
                     ALU  ALU ALU    ALU
    . . . but we’ve removed most of the
    hardware that helps us deal with that.
                        Ctx     Ctx      Ctx      Ctx

    We’ve removedCtx            Ctx      Ctx      Ctx
          caches
                          Shared Ctx Data
          branch prediction                                   Idea #3
          out-of-order execution                                      Even more parallelism
v.ucdavis.edu/
     So what     now?                                           +
                                                               33     Some extra memory
                                                                =     A solution!


                              slide by Andreas Kl¨ckner
                                                 o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


  Remaining Problem: Slow Memory
                              Fetch/
                              Decode

    Problem         ALU     ALU      ALU      ALU
    Memory still has very high latency. . .
                     ALU  ALU ALU    ALU
    . . . but we’ve removed most of the
    hardware that helps us deal with that.
                       1             2
    We’ve removed
         caches          3                    4
            branch prediction                             Idea #3
            out-of-order execution                                Even more parallelism
v.ucdavis.edu/ now?
     So what                                                +
                                                           34     Some extra memory
                                                            =     A solution!


                          slide by Andreas Kl¨ckner
                                             o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


GPU Architecture Summary


 Core Ideas:

   1   Many slimmed down cores
       → lots of parallelism

   2   More ALUs, Fewer Control Units

   3   Avoid memory stalls by interleaving
       execution of SIMD groups
       (“warps”)



   Credit: Kayvon Fatahalian (Stanford)

                      slide by Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Is it free?
!   GA,3&,('&3A'&.*-4'H2'-.'4I
!   $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
    ! 6,3,&,..'44&.*A'('-.5
    ! $(*1(,+&)D*F




                                        slide by Matthew Bolitho
dvariables.
   variables.
uted memory private memory for each processor, only acces
 uted memory private memory for each processor, only acce
                Some terminology
 ocessor, so no synchronization for memory accesses neede
  ocessor, so no synchronization for memory accesses neede
mationexchanged by sending data from one processor to ano
 ation exchanged by sending data from one processor to an
 interconnection network using explicit communication opera
  interconnection network using explicit communication opera
     M
     M    M
          M              M
                         M          PP      PP             PP


     PP    PP            PP
                                   Interconnection Network
                                    Interconnection Network


     Interconnection Network
      Interconnection Network      M
                                   M       M
                                           M           M
                                                       M

     “distributed memory”
 approach increasingly common            “shared memory”
d approach increasingly common
                      now: mostly hybrid
Some terminology
                 Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
                  Some More Terminologyshared variables
cores. Information exchanged between threads using
One way to classify machines distinguishes Need to coordinate access to
written by one thread and read by another. between
shared memory global memory can be acessed by all processors or
shared variables.
cores. Information exchanged between threads using shared accessible
distributed memory private memory for each processor, only variables
written by one thread synchronization for memoryto coordinate access to
this processor, so no and read by another. Need accesses needed.
shared variables.
Information exchanged by sending data from one processor to another
distributed memory private memory for each processor, only accessible
via an interconnection network using explicit communication operations.
this processor, so no synchronization for memory accesses needed.
InformationM exchanged by sending data from one processor to another
                 M            M           P     P           P
via an interconnection network using explicit communication operations.

          P
          M    P
               M             P
                             M          P     P            P
                                       Interconnection Network
Programming Model
      (Overview)
GPU Architecture




CUDA Programming Model
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model
       Fetch/
       Decode                                                             Fetch/
                                                                          Decode
                                                                                          Fetch/
                                                                                          Decode
                                                                                                          Fetch/
                                                                                                          Decode




                                                                         32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                           Private         Private         Private
                                                                        (“Registers”)   (“Registers”)   (“Registers”)


                                                                         16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                           Shared          Shared          Shared




                                                                          Fetch/          Fetch/          Fetch/
                                                                          Decode          Decode          Decode




      32 kiB Ctx                                                         32 kiB Ctx      32 kiB Ctx      32 kiB Ctx


        Private
                                                                           Private         Private         Private
                                                                        (“Registers”)   (“Registers”)   (“Registers”)


                                                                         16 kiB Ctx      16 kiB Ctx      16 kiB Ctx

     (“Registers”)                                                         Shared          Shared          Shared




                                                                          Fetch/          Fetch/          Fetch/
                                                                          Decode          Decode          Decode




      16 kiB Ctx                                                         32 kiB Ctx
                                                                           Private
                                                                        (“Registers”)
                                                                                         32 kiB Ctx
                                                                                           Private
                                                                                        (“Registers”)
                                                                                                         32 kiB Ctx
                                                                                                           Private
                                                                                                        (“Registers”)


        Shared                                                           16 kiB Ctx
                                                                           Shared
                                                                                         16 kiB Ctx
                                                                                           Shared
                                                                                                         16 kiB Ctx
                                                                                                           Shared




                     slide by Andreas Kl¨ckner
                                        o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                                                                         Fetch/          Fetch/          Fetch/
                                                                         Decode          Decode          Decode




                       show
                    are s?
                                                                        32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                 o c ore
                                                                          Private         Private         Private
                                                                       (“Registers”)   (“Registers”)   (“Registers”)




                h
               W ny c
                                                                        16 kiB Ctx
                                                                          Shared
                                                                                        16 kiB Ctx
                                                                                          Shared
                                                                                                        16 kiB Ctx
                                                                                                          Shared




                ma
                                                                         Fetch/          Fetch/          Fetch/
                                                                         Decode          Decode          Decode




                                                                        32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                          Private         Private         Private
                                                                       (“Registers”)   (“Registers”)   (“Registers”)


      Idea:                                                             16 kiB Ctx
                                                                          Shared
                                                                                        16 kiB Ctx
                                                                                          Shared
                                                                                                        16 kiB Ctx
                                                                                                          Shared




              Program as if there were                                   Fetch/
                                                                         Decode
                                                                                         Fetch/
                                                                                         Decode
                                                                                                         Fetch/
                                                                                                         Decode




              “infinitely” many cores                                    32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                          Private         Private         Private
                                                                       (“Registers”)   (“Registers”)   (“Registers”)


              Program as if there were                                  16 kiB Ctx
                                                                          Shared
                                                                                        16 kiB Ctx
                                                                                          Shared
                                                                                                        16 kiB Ctx
                                                                                                          Shared



              “infinitely” many ALUs per
              core



                    slide by Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                                                                     Fetch/          Fetch/          Fetch/
                                                                     Decode          Decode          Decode




                      show
                   are s?
                                                                    32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                o c ore
                                                                      Private         Private         Private
                                                                   (“Registers”)   (“Registers”)   (“Registers”)




               h
              W ny c
                                                                    16 kiB Ctx
                                                                      Shared
                                                                                    16 kiB Ctx
                                                                                      Shared
                                                                                                    16 kiB Ctx
                                                                                                      Shared




               ma
                                                                     Fetch/          Fetch/          Fetch/
                                                                     Decode          Decode          Decode




                                                                    32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                      Private         Private         Private
                                                                   (“Registers”)   (“Registers”)   (“Registers”)


      Idea:                                                         16 kiB Ctx
                                                                      Shared
                                                                                    16 kiB Ctx
                                                                                      Shared
                                                                                                    16 kiB Ctx
                                                                                                      Shared




       Consider: Which there were do automatically?
         Program as if is easy to                                    Fetch/
                                                                     Decode
                                                                                     Fetch/
                                                                                     Decode
                                                                                                     Fetch/
                                                                                                     Decode




         “infinitely” many cores
           Parallel program → sequential hardware                   32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                      Private         Private         Private
                                                                   (“Registers”)   (“Registers”)   (“Registers”)


       or Program as if there were                                  16 kiB Ctx
                                                                      Shared
                                                                                    16 kiB Ctx
                                                                                      Shared
                                                                                                    16 kiB Ctx
                                                                                                      Shared



          “infinitely” many ALUs per
            Sequential program → parallel hardware?
          core



                slide by Andreas Kl¨ckner
                                   o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                          Axis 0                                       Fetch/
                                                                       Decode
                                                                                       Fetch/
                                                                                       Decode
                                                                                                       Fetch/
                                                                                                       Decode




                (Work) Group                                          32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                       or “Block”
                                                                        Private         Private         Private
                                                                     (“Registers”)   (“Registers”)   (“Registers”)


                                                                      16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                        Shared          Shared          Shared




           Grid             nc-
                                                                       Fetch/          Fetch/          Fetch/
                                                                       Decode          Decode          Decode




                  nel: Fu
                er
  Axis 1




           (K
                                                                      32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                        Private         Private         Private
                                                                     (“Registers”)   (“Registers”)   (“Registers”)




                    nG  r i d)
                                                                      16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                        Shared          Shared          Shared




            ti on o                                                    Fetch/
                                                                       Decode
                                                                                       Fetch/
                                                                                       Decode
                                                                                                       Fetch/
                                                                                                       Decode




                                                                      32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                        Private         Private         Private
                                                                     (“Registers”)   (“Registers”)   (“Registers”)




                             (Work) Item
                                                                      16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                        Shared          Shared          Shared




           Software representation
                                 or “Thread” Hardware


                  slide by Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Grid             nc-
                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                  nel: Fu
                er
  Axis 1




           (K
                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)




                    nG  r i d)
                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




            ti on o                                                     Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                 (Work) Group                                          32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                        or “Block”
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Grid             nc-
                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                  nel: Fu
                er
  Axis 1




           (K
                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)




                    nG  r i d)
                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




            ti on o                                                     Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0
                                                       ?                Fetch/
                                                                        Decode




                                                                       32 kiB Ctx
                                                                         Private
                                                                      (“Registers”)


                                                                       16 kiB Ctx
                                                                         Shared
                                                                                        Fetch/
                                                                                        Decode




                                                                                       32 kiB Ctx
                                                                                         Private
                                                                                      (“Registers”)


                                                                                       16 kiB Ctx
                                                                                         Shared
                                                                                                        Fetch/
                                                                                                        Decode




                                                                                                       32 kiB Ctx
                                                                                                         Private
                                                                                                      (“Registers”)


                                                                                                       16 kiB Ctx
                                                                                                         Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Contenu connexe

Tendances

Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesTobias Lindaaker
 
Using SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with PythonUsing SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with PythonDavid Beazley (Dabeaz LLC)
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
OpenCV DNN module vs. Ours method
OpenCV DNN module vs. Ours method OpenCV DNN module vs. Ours method
OpenCV DNN module vs. Ours method Ryosuke Tanno
 
[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent ProgrammingTobias Lindaaker
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Mark Rees
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsDavid Beazley (Dabeaz LLC)
 
Python For Scientists
Python For ScientistsPython For Scientists
Python For Scientistsaeberspaecher
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
Async await in C++
Async await in C++Async await in C++
Async await in C++cppfrug
 
LeFlowを調べてみました
LeFlowを調べてみましたLeFlowを調べてみました
LeFlowを調べてみましたMr. Vengineer
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in PythonSarah Mount
 
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APITensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APIMr. Vengineer
 

Tendances (20)

Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic Languages
 
Using SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with PythonUsing SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with Python
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Interfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIGInterfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIG
 
OpenCV DNN module vs. Ours method
OpenCV DNN module vs. Ours method OpenCV DNN module vs. Ours method
OpenCV DNN module vs. Ours method
 
[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
 
Python For Scientists
Python For ScientistsPython For Scientists
Python For Scientists
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
Async await in C++
Async await in C++Async await in C++
Async await in C++
 
LeFlowを調べてみました
LeFlowを調べてみましたLeFlowを調べてみました
LeFlowを調べてみました
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in Python
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APITensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
 
Perl-C/C++ Integration with Swig
Perl-C/C++ Integration with SwigPerl-C/C++ Integration with Swig
Perl-C/C++ Integration with Swig
 

Similaire à [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Julien SIMON
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionTony Albrecht
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Oracle 10g Performance: chapter 00 intro live_short
Oracle 10g Performance: chapter 00 intro live_shortOracle 10g Performance: chapter 00 intro live_short
Oracle 10g Performance: chapter 00 intro live_shortKyle Hailey
 
A Survey of Concurrency Constructs
A Survey of Concurrency ConstructsA Survey of Concurrency Constructs
A Survey of Concurrency ConstructsTed Leung
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Cloud computing: evolution or redefinition
Cloud computing: evolution or redefinitionCloud computing: evolution or redefinition
Cloud computing: evolution or redefinitionPET Computação
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsGreg Makowski
 
Static analysis of java enterprise applications
Static analysis of java enterprise applicationsStatic analysis of java enterprise applications
Static analysis of java enterprise applicationsAnastasiοs Antoniadis
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 

Similaire à [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns (20)

Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
IN4308 1
IN4308 1IN4308 1
IN4308 1
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Oracle 10g Performance: chapter 00 intro live_short
Oracle 10g Performance: chapter 00 intro live_shortOracle 10g Performance: chapter 00 intro live_short
Oracle 10g Performance: chapter 00 intro live_short
 
A Survey of Concurrency Constructs
A Survey of Concurrency ConstructsA Survey of Concurrency Constructs
A Survey of Concurrency Constructs
 
Hi
HiHi
Hi
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
Cloud computing: evolution or redefinition
Cloud computing: evolution or redefinitionCloud computing: evolution or redefinition
Cloud computing: evolution or redefinition
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical Applications
 
Static analysis of java enterprise applications
Static analysis of java enterprise applicationsStatic analysis of java enterprise applications
Static analysis of java enterprise applications
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 

Plus de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 

Plus de npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 

Dernier

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 

Dernier (20)

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

  • 1. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #2: Architecture, Theory & Patterns | February 1st, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2. Objectives • introduce important computational thinking skills for massively parallel computing • understand hardware limitations • understand algorithm constraints • identify common patterns
  • 3. During this course, r CS264 adapted fo we’ll try to “ ” and use existing material ;-)
  • 4.
  • 5. Outline • Thinking Parallel • Architecture • Programming Model • Bits of Theory • Patterns
  • 6. ti vat i on Mo ! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;' 12'+2'E-'I1,,'6.%C,"'"<"&8'8"+& ! P1;$.&1#+,,8'! -*Q;'3"$'O+;$"& " P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2; ! S.I'! -*Q;'3"$'I16"& slide by Matthew Bolitho
  • 7. ti vat i on Mo ! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;' O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U ! *+&+,,",'0&.#";;123'O.&'$F"'/+;;"; ! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V'' " D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"' O%26+/"2$+,,8'&"6";132"6 slide by Matthew Bolitho
  • 9. Getting your feet wet • Common scenario: “I want to make the algorithm X run faster, help me!” • Q: How do you approach the problem?
  • 10. How?
  • 11.
  • 12. How? • Option 1: wait • Option 2: gcc -O3 -msse4.2 • Option 3: xlc -O5 • Option 4: use parallel libraries (e.g. (cu)blas) • Option 5: hand-optimize everything! • Option 6: wait more
  • 15. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in nature time (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() Q: What is the maximum speed up ?
  • 16. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in nature time (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() A: 2X ! :-(
  • 17. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 100x100x100 9,000 9,000 100% parallelizable 6,750 sequential in nature time (s) 4,500 2,250 0 350 250 300 load_data() foo() bar() yey() Q: and now?
  • 18. You need to... • ... understand the problem (duh!) • ... study the current (sequential?) solutions and their constraints • ... know the input domain • ... profile accordingly • ... “refactor” based on new constraints (hw/sw)
  • 19. A better way ? ... ale! t sc es n’ do Speculation: (input) domain-aware optimization using some sort of probabilistic modeling ?
  • 20. Some Perspective The “problem tree” for scientific problem solving 9 Some Perspective Technical Problem to be Analyzed Consultation with experts Scientific Model "A" Model "B" Theoretical analysis Discretization "A" Discretization "B" Experiments Iterative equation solver Direct elimination equation solver Parallel implementation Sequential implementation Figure 11: There“problem tree” for to try to achieve the same goal. are many The are many options scientific problem solving. There options to try to achieve the same goal. from Scott et al. “Scientific Parallel Computing” (2005)
  • 21. Computational Thinking • translate/formulate domain problems into computational models that can be solved efficiently by available computing resources • requires a deep understanding of their relationships adapted from Hwu & Kirk (PASI 2011)
  • 22. Getting ready... Programming Models Architecture Algorithms Languages Patterns il ers C omp Parallel Thinking Parallel Computing APPLICATIONS adapted from Scott et al. “Scientific Parallel Computing” (2005)
  • 23. Fundamental Skills • Computer architecture • Programming models and compilers • Algorithm techniques and patterns • Domain knowledge
  • 24. Computer Architecture critical in understanding tradeoffs btw algorithms • memory organization, bandwidth and latency; caching and locality (memory hierarchy) • floating-point precision vs. accuracy • SISD, SIMD, MISD, MIMD vs. SIMT, SPMD
  • 25. Programming models for optimal data structure and code execution • parallel execution models (threading hierarchy) • optimal memory access patterns • array data layout and loop transformations
  • 26. Algorithms and patterns • toolbox for designing good parallel algorithms • it is critical to understand their scalability and efficiency • many have been exposed and documented • sometimes hard to “extract” • ... but keep trying!
  • 27. Domain Knowledge • abstract modeling • mathematical properties • accuracy requirements • coming back to the drawing board to expose more/better parallelism ?
  • 28. You can do it! • thinking parallel is not as hard as you may think • many techniques have been thoroughly explained... • ... and are now “accessible” to non-experts !
  • 30. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines • CPUs to GPUs
  • 31. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines • CPUs to GPUs
  • 32. What’s in a computer? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 33. What’s in a computer? Processor Intel Q6600 Core2 Quad, 2.4 GHz adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 34. What’s in a computer? Die Processor (2×) 143 mm2 , 2 × 2 cores Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors ∼ 100W adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 35. What’s in a computer? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 36. What’s in a computer? Memory adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 37. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines
  • 38. A Basic Processor Memory Interface Address ALU Address Bus Data Bus Register File Flags Internal Bus Insn. fetch PC Data ALU Control Unit (loosely based on Intel 8086) adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 39. How all of this fits together Everything synchronizes to the Clock. Control Unit (“CU”): The brains of the Memory Interface operation. Everything connects to it. Address ALU Address Bus Data Bus Bus entries/exits are gated and Register File Flags (potentially) buffered. Internal Bus CU controls gates, tells other units Insn. fetch PC Control Unit Data ALU about ‘what’ and ‘how’: • What operation? • Which register? • Which addressing mode? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 40. What is. . . an ALU? Arithmetic Logic Unit One or two operands A, B Operation selector (Op): • (Integer) Addition, Subtraction • (Logical) And, Or, Not • (Bitwise) Shifts (equivalent to multiplication by power of two) • (Integer) Multiplication, Division Specialized ALUs: • Floating Point Unit (FPU) • Address ALU Operates on binary representations of numbers. Negative numbers represented by two’s complement. adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 41. What is. . . a Register File? Registers are On-Chip Memory %r0 • Directly usable as operands in %r1 Machine Language %r2 • Often “general-purpose” %r3 • Sometimes special-purpose: Floating %r4 point, Indexing, Accumulator %r5 • Small: x86 64: 16×64 bit GPRs %r6 • Very fast (near-zero latency) %r7 adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 42. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 43. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 44. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 45. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 46. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 47. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 48. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK Observation: Access (and addressing) happens in bus-width-size “chunks”. adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 49. What is. . . a Memory Interface? Memory Interface gets and stores binary words in off-chip memory. Smallest granularity: Bus width Tells outside memory • “where” through address bus • “what” through data bus Computer main memory is “Dynamic RAM” (DRAM): Slow, but small and cheap. adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 50. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines • CPUs to GPUs
  • 51. A Very Simple Program 4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp) b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp) int a = 5; 12: 8b 45 f4 mov −0xc(%rbp),%eax int b = 17; 15: 0f af 45 f8 imul −0x8(%rbp),%eax int z = a ∗ b; 19: 89 45 fc mov %eax,−0x4(%rbp) 1c: 8b 45 fc mov −0x4(%rbp),%eax Things to know: • Addressing modes (Immediate, Register, Base plus Offset) • 0xHexadecimal • “AT&T Form”: (we’ll use this) <opcode><size> <source>, <dest> adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 52. A Very Simple Program: Intel Form 4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5 b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11 12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc] 15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8] 19: 89 45 fc mov DWORD PTR [rbp−0x4],eax 1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4] • “Intel Form”: (you might see this on the net) <opcode> <sized dest>, <sized source> • Goal: Reading comprehension. • Don’t understand an opcode? Google “<opcode> intel instruction”. adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 53. Machine Language Loops 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) { b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax for ( i = 0; 17: 01 45 f8 add %eax,−0x8(%rbp) y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) return y; 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax } 27: c9 leaveq 28: c3 retq Things to know: • Condition Codes (Flags): Zero, Sign, Carry, etc. • Call Stack: Stack frame, stack pointer, base pointer • ABI: Calling conventions adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 54. Machine Language Loops 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) { b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax for ( i = 0; 17: 01 45 f8 add %eax,−0x8(%rbp) y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) return y; 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax } 27: c9 leaveq 28: c3 retq Things to know: Want to make those yourself? • Condition Codes (Flags): Zero, Sign, Carry, etc. Write myprogram.c. • Call Stack:-c myprogram.c $ cc Stack frame, stack pointer, base pointer • ABI: $ objdump --disassemble myprogram.o Calling conventions adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 55. We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 56. We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster! adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 57. We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster! Goal now: Understand sources of slowness, and how they get addressed. Remember: High Performance Computing adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 58. The High-Performance Mindset Writing high-performance Codes Mindset: What is going to be the limiting factor? • ALU? • Memory? • Communication? (if multi-machine) Benchmark the assumed limiting factor right away. Evaluate • Know your peak throughputs (roughly) • Are you getting close? • Are you tracking the right limiting factor? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 59. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines • CPUs to GPUs
  • 60. Source of Slowness: Memory Memory is slow. Distinguish two different versions of “slow”: • Bandwidth • Latency → Memory has long latency, but can have large bandwidth. Size of die vs. distance to memory: big! Dynamic RAM: long intrinsic latency! adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 61. Source of Slowness: Memory Memory is slow. Distinguish two different versions of “slow”: • Bandwidth • Latency → Memory has long latency, but can have large bandwidth. Idea: Put a look-up table of recently-used data onto the chip. Size of die vs. distance to memory: big! → “Cache” Dynamic RAM: long intrinsic latency! adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 62. The Memory Hierarchy Hierarchy of increasingly bigger, slower memories: faster Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) bigger adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 63. Performance of computer system Performance of computer system Entire problem fits within registers Entire problem fits within cache from Scott et al. “Scientific Parallel Computing” (2005) Entire problem fits within main memory Problem requires Size of problem being solved Size of problem being solved secondary (disk) memory for system! Performance Impact on Problem too big
  • 64. The Memory Hierarchy Hierarchy of increasingly bigger, slower memories: Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) How might data locality factor into this? What is a working set? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 65. Cache: Actual Implementation Demands on cache implementation: • Fast, small, cheap, low-power • Fine-grained • High “hit”-rate (few “misses”) Problem: Goals at odds with each other: Access matching logic expensive! Solution 1: More data per unit of access matching logic → Larger “Cache Lines” Solution 2: Simpler/less access matching logic → Less than full “Associativity” Other choices: Eviction strategy, size adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 66. Cache: Associativity Direct Mapped 2-way set associative Memory Cache Memory Cache 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 6 6 . . . . . . adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 67. Cache: Associativity Direct Mapped 2-way set associative Memory Cache Memory Cache 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 6 6 . . . . . . Miss rate versus cache size on the Integer por- tion of SPEC CPU2000 [Cantin, Hill 2003] adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 68. Cache Example: Intel Q6600/Core2 Quad --- L1 data cache --- fully associative cache = false threads sharing this cache = 0x0 (0) processor cores on this die= 0x3 (3) system coherency line size = 0x3f (63) ways of associativity = 0x7 (7) number of sets - 1 (s) = 63 --- L1 instruction --- fully associative cache = false --- L2 unified cache --- threads sharing this cache = 0x0 (0) fully associative cache false processor cores on this die= 0x3 (3) threads sharing this cache = 0x1 (1) system coherency line size = 0x3f (63) processor cores on this die= 0x3 (3) ways of associativity = 0x7 (7) system coherency line size = 0x3f (63) number of sets - 1 (s) = 63 ways of associativity = 0xf (15) number of sets - 1 (s) = 4095 More than you care to know about your CPU: http://www.etallen.com/cpuid.html adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 69. Measuring the Cache I void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary [ i ] ∗= 17; } free (ary ); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 70. Measuring the Cache I void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary [ i ] ∗= 17; } free (ary ); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 71. Measuring the Cache II void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i ∗16) & asm1] ++; free (ary ); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 72. Measuring the Cache II void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i ∗16) & asm1] ++; free (ary ); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 73. Measuring the Cache III void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary [p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 74. Measuring the Cache III void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary [p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 76. http://sequoia.stanford.edu/ Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)
  • 77. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines • CPUs to GPUs
  • 78. Source of Slowness: Sequential Operation IF Instruction fetch ID Instruction Decode EX Execution MEM Memory Read/Write WB Result Writeback adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 79. Solution: Pipelining adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 80. Pipelining (MIPS, 110,000 transistors) adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 81. Issues with Pipelines Pipelines generally help performance–but not always. Possible issues: • Stalls • Dependent Instructions • Branches (+Prediction) • Self-Modifying Code “Solution”: Bubbling, extra circuitry adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 82. Intel Q6600 Pipeline adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 83. Intel Q6600 Pipeline New concept: Instruction-level parallelism (“Superscalar”) adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 84. Programming for the Pipeline How to upset a processor pipeline: for (int i = 0; i < 1000; ++i) for (int j = 0; j < 1000; ++j) { if ( j % 2 == 0) do something(i , j ); } . . . why is this bad? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 85. A Puzzle int steps = 256 ∗ 1024 ∗ 1024; int [] a = new int[2]; // Loop 1 for (int i =0; i<steps; i ++) { a[0]++; a[0]++; } // Loop 2 for (int i =0; i<steps; i ++) { a[0]++; a[1]++; } Which is faster? . . . and why? adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 86. Two useful Strategies Loop unrolling: for (int i = 0; i < 500; i+=2) { for (int i = 0; i < 1000; ++i) do something(i ); → do something(i ); do something(i+1); } Software pipelining: for (int i = 0; i < 500; i+=2) for (int i = 0; i < 1000; ++i) { { do a( i ); do a( i ); → do a( i +1); do b(i ); do b(i ); } do b(i+1); } adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 87. SIMD Control Units are large and expensive. SIMD Instruction Pool Functional Units are simple and cheap. → Increase the Function/Control ratio: Data Pool Control several functional units with one control unit. All execute same operation. GCC vector extensions: typedef int v4si attribute (( vector size (16))); v4si a, b, c; c = a + b; // +, −, ∗, /, unary minus, ˆ, |, &, ˜, % Will revisit for OpenCL, GPUs. adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  • 88. Architecture • What’s in a (basic) computer? • Basic Subsystems • Machine Language • Memory Hierarchy • Pipelines • CPUs to GPUs
  • 89. GPUs ? ! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D& C(*8D'+4/ ! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F& .*-3(*D&,-@&@,3,&.,.A'
  • 90. Intro PyOpenCL What and Why? OpenCL “CPU-style” Cores CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13 Credit: Kayvon Fatahalian (Stanford)
  • 91. Intro PyOpenCL What and Why? OpenCL Slimming down Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 92. Intro PyOpenCL What and Why? OpenCL More Space: Double the Numberparallel) Two cores (two fragments in of Cores fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode !"#$$%&'()*"'+,-. !"#$$%&'()*"'+,-. ALU ALU &*/01'.+23.453.623.&2. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. (Execute) (Execute) /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A23.+23.+7. Execution Execution /%1..A<3.+<3.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /%1..A=3.+=3.+7. Context Context /A4..A73.1><?2@. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 93. Intro PyOpenCL What and Why? OpenCL Fouragain . . . cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 94. Intro PyOpenCL What and Why? OpenCL xteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 95. Intro PyOpenCL What and Why? OpenCL xteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU → 16 independent instruction streams ALU ALU ALU Reality: instruction streams not actually 16 cores = 16very different/independent simultaneous instruction streams H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 96. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU (Execute) Execution Context Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 97. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU Idea #2 (Execute) Amortize cost/complexity of managing an instruction stream Execution across many ALUs Context → SIMD Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 98. ecall: simple processing core dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 ALU managing an instruction Idea #2 (Execute) ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream Execution across many ALUs Ctx Ctx Ctx Context Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 99. dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction Idea #2 ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 100. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism! ragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford) Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 101. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism! ragments in parallel Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford) Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 102. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction out-of-order execution So what now? slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 103. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction Idea #3 out-of-order execution Even more parallelism So what now? + Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 104. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. Ctx Ctx Ctx Ctx We’ve removedCtx Ctx Ctx Ctx caches Shared Ctx Data branch prediction Idea #3 out-of-order execution Even more parallelism v.ucdavis.edu/ So what now? + 33 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 105. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. 1 2 We’ve removed caches 3 4 branch prediction Idea #3 out-of-order execution Even more parallelism v.ucdavis.edu/ now? So what + 34 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 106. Intro PyOpenCL What and Why? OpenCL GPU Architecture Summary Core Ideas: 1 Many slimmed down cores → lots of parallelism 2 More ALUs, Fewer Control Units 3 Avoid memory stalls by interleaving execution of SIMD groups (“warps”) Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 107. Is it free? ! GA,3&,('&3A'&.*-4'H2'-.'4I ! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/ ! 6,3,&,..'44&.*A'('-.5 ! $(*1(,+&)D*F slide by Matthew Bolitho
  • 108. dvariables. variables. uted memory private memory for each processor, only acces uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede ocessor, so no synchronization for memory accesses neede mationexchanged by sending data from one processor to ano ation exchanged by sending data from one processor to an interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M PP PP PP PP PP PP Interconnection Network Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” approach increasingly common “shared memory” d approach increasingly common now: mostly hybrid
  • 109. Some terminology Some More Terminology One way to classify machines distinguishes between shared memory global memory can be acessed by all processors or Some More Terminologyshared variables cores. Information exchanged between threads using One way to classify machines distinguishes Need to coordinate access to written by one thread and read by another. between shared memory global memory can be acessed by all processors or shared variables. cores. Information exchanged between threads using shared accessible distributed memory private memory for each processor, only variables written by one thread synchronization for memoryto coordinate access to this processor, so no and read by another. Need accesses needed. shared variables. Information exchanged by sending data from one processor to another distributed memory private memory for each processor, only accessible via an interconnection network using explicit communication operations. this processor, so no synchronization for memory accesses needed. InformationM exchanged by sending data from one processor to another M M P P P via an interconnection network using explicit communication operations. P M P M P M P P P Interconnection Network
  • 110. Programming Model (Overview)
  • 112. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx (“Registers”) Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 16 kiB Ctx 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) Shared 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 113. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Program as if there were Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 114. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Consider: Which there were do automatically? Program as if is easy to Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) or Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per Sequential program → parallel hardware? core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 115. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 116. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) (Work) Item 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation or “Thread” Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 117. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 118. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 119. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 ? Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA