SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
Analysis-‐Driven Optimization
         Presented by Cliff Woolley for Harvard CS264




© NVIDIA 2010
Performance Optimization Process
                 Use appropriate performance metric for each kernel
                     For example, Gflops                                      -‐bound kernel
                 Determine what limits kernel performance
                     Memory throughput
                     Instruction throughput
                     Latency
                     Combination of the above
                 Address the limiters in the order of importance
                     Determine how close to the HW limits the resource is being used
                     Analyze for possible inefficiencies
                     Apply optimizations
                        Often these will just fall out from how HW operates

                                                                                               2
© NVIDIA 2010
Presentation Outline
                 Identifying performance limiters
                 Analyzing and optimizing :
                     Memory-‐bound kernels
                     Instruction (math) bound kernels
                     Kernels with poor latency hiding

                 For each:
                     Brief background
                     How to analyze
                     How to judge whether particular issue is problematic
                     How to optimize
                                                      -‐

                 Note: Most information is for Fermi GPUs
                                                                            3
© NVIDIA 2010
New in CUDA Toolkit 4.0: Automated
                Performance Analysis using Visual Profiler
                Summary analysis & hints
                     Per-‐Session
                     Per-‐Device
                     Per-‐Context
                     Per-‐Kernel


                New UI for kernel analysis
                     Identify the limiting factor
                     Analyze instruction throughput
                     Analyze memory throughput
                     Analyze kernel occupancy


© NVIDIA 2010
Notes on profiler
                 Most counters are reported per Streaming Multiprocessor (SM)
                     Not entire GPU
                     Exceptions: L2 and DRAM counters
                 A single run can collect a few counters
                     Multiple runs are needed when profiling more counters
                        Done automatically by the Visual Profiler
                        Have to be done manually using command-‐line profiler
                 Counter values may not be exactly the same for repeated runs
                     Threadblocks and warps are scheduled at run-‐time

                 See the profiler documentation for more information


                                                                                5
© NVIDIA 2010
Identifying Performance Limiters


                                                   6
© NVIDIA 2010
Limited by Bandwidth or Arithmetic?
                 Perfect instructions:bytes ratio for Fermi C2050:
                     ~4.5 : 1 with ECC on
                     ~3.6 : 1 with ECC off
                     These assume fp32 instructions, throughput for other instructions varies
                 Algorithmic analysis:
                     Rough estimate of arithmetic to bytes ratio
                 Code likely uses more instructions and bytes than algorithm analysis
                 suggests:
                     Instructions for loop control, pointer math, etc.
                     Address pattern may result in more memory fetches
                     Two ways to investigate:
                        Use the profiler (quick, but approximate)
                        Use source code modification (more accurate, more work intensive)
                                                                                                7
© NVIDIA 2010
A Note on Counting Global Memory Accesses
                 Load/store instruction count can be lower than the number of actual
                 memory transactions
                     Address pattern, different word sizes

                 Counting requests from L1 to the rest of the memory system makes the
                 most sense
                     Caching-‐loads: count L1 misses
                     Non-‐caching loads and stores: count L2 read requests
                         Note that L2 counters are for the entire chip, L1 counters are per SM



                     One 32-‐bit access instruction           -‐> one 128-‐byte transaction per warp
                     One 64-‐bit access instruction           -‐> two 128-‐byte transactions per warp
                     One 128-‐bit access instruction          -‐> four 128-‐byte transactions per warp

                                                                                                         8
© NVIDIA 2010
Analysis with Profiler
                 Profiler counters:
                     instructions_issued, instructions_executed
                          Both incremented by 1 per warp


                     gld_request, gst_request
                          Incremented by 1 per warp for each load/store instruction


                     l1_global_load_miss, l1_global_load_hit, global_store_transaction
                          Incremented by 1 per L1 line (line is 128B)
                     uncached_global_load_transaction
                          Incremented by 1 per group of 1, 2, 3, or 4 transactions
                          Better to look at L2_read_request counter (incremented by 1 per 32B transaction; per GPU, not per SM)
                 Compare:
                       32 * instructions_issued                 /* 32 = warp size */
                      128B * (global_store_transaction +  l1_global_load_miss)
                                                                                                                                  9
© NVIDIA 2010
Analysis with Modified Source Code
                 Time memory-‐only and math-‐only versions of the kernel
                                                               -‐dependent control-‐flow or addressing
                     Gives you good estimates for:
                        Time spent accessing memory
                        Time spent in executing instructions

                 Compare the times taken by the modified kernels
                     Helps decide whether the kernel is mem or math bound

                 Compare the sum of mem-‐only and math-‐only times to full-‐kernel time
                     Shows how well memory operations are overlapped with arithmetic
                     Can reveal latency bottleneck



                                                                                                         10
© NVIDIA 2010
Some Example Scenarios


time




            mem math full

         Memory-bound
         Good mem-math
         overlap: latency likely
         not a problem
         (assuming memory
         throughput is not low
         compared to HW theory)
                                         11
© NVIDIA 2010
Some Example Scenarios


time




            mem math full          mem math full

         Memory-bound              Math-bound
         Good mem-math             Good mem-math
         overlap: latency likely   overlap: latency likely
         not a problem             not a problem
         (assuming memory          (assuming instruction
         throughput is not low     throughput is not low
         compared to HW theory)    compared to HW theory)

© NVIDIA 2010
Some Example Scenarios


time




            mem math full          mem math full             mem math full

         Memory-bound              Math-bound                Balanced
         Good mem-math             Good mem-math             Good mem-math
         overlap: latency likely   overlap: latency likely   overlap: latency likely
         not a problem             not a problem             not a problem
         (assuming memory          (assuming instruction     (assuming memory/instr
         throughput is not low     throughput is not low     throughput is not low
         compared to HW theory)    compared to HW theory)    compared to HW theory)
                                                                                       13
© NVIDIA 2010
Some Example Scenarios


time




            mem math full          mem math full             mem math full              mem math full

         Memory-bound              Math-bound                Balanced                  Memory and latency bound
         Good mem-math             Good mem-math             Good mem-math             Poor mem-math overlap:
         overlap: latency likely   overlap: latency likely   overlap: latency likely   latency is a problem
         not a problem             not a problem             not a problem
         (assuming memory          (assuming instruction     (assuming memory/instr
         throughput is not low     throughput is not low     throughput is not low
         compared to HW theory)    compared to HW theory)    compared to HW theory)
                                                                                                            14
© NVIDIA 2010
Source Modification
                 Memory-‐only:
                     Remove as much arithmetic as possible
                         Without changing access pattern
                         Use the profiler to verify that load/store instruction count is the same
                 Store-‐only:
                     Also remove the loads (to compare read time vs. write time)
                 Math-‐only:
                     Remove global memory accesses
                     Need to trick the compiler:
                         Compiler throws away all code that it detects as not contributing to stores
                         Put stores inside conditionals that always evaluate to false
                                Condition should depend on the value about to be stored (prevents other optimizations)
                                Condition outcome should not be known to the compiler

                                                                                                                         15
© NVIDIA 2010
Source Modification for Math-‐only


                     __global__  void  fwd_3D(  ...,  int flag)
                     {
                          ...
                       value  =  temp  +  coeff *  vsq;           If you compare only the
                       if(  1  ==  value  *  flag  )              flag, the compiler may
                          g_output[out_idx]  =  value;            move the computation
                                                                  into the conditional as
                     }
                                                                  well




                                                                                            16
© NVIDIA 2010
Source Modification and Occupancy

                Removing pieces of code is likely to affect
                register count
                   This could increase occupancy, skewing the results
                   See slide 23 to see how that could affect throughput

                Make sure to keep the same occupancy
                   Check the occupancy with profiler before modifications
                   After modifications, if necessary add shared memory to

                    kernel<<<  grid,  block,  smem, ...>>>(...)
                                                                            17
© NVIDIA 2010
Case Study: Limiter Analysis
                 3DFD of the wave equation, fp32   Analysis:
                 Time (ms):                            Instr:byte ratio = ~2.66
                     Full-‐kernel:   35.39                 32*18,194,139 / 128*1,708,032

                     Mem-‐only:      33.27             Good overlap between math and mem:
                     Math-‐only:     16.25                 2.12 ms of math-‐only time (13%) are not
                                                           overlapped with mem
                 Instructions issued:                  App memory throughput: 62 GB/s
                     Full-‐kernel:   18,194,139            HW theory is 114 GB/s
                     Mem-‐only:       7,497,296
                                                   Conclusion:
                     Math-‐only:     16,839,792
                                                       Code is memory-‐bound
                 Memory access transactions:
                                                       Latency could be an issue too
                     Full-‐kernel:    1,708,032
                                                       Optimizations should focus on memory
                     Mem-‐only:       1,708,032        throughput first
                     Math-‐only:             0             math contributes very little to total time (2.12
                                                           out of 35.39ms)

                                                                                                              18
© NVIDIA 2010
Summary: Limiter Analysis
                 Rough algorithmic analysis:
                    How many bytes needed, how many instructions
                 Profiler analysis:
                    Instruction count, memory request/transaction count
                 Analysis with source modification:
                    Memory-‐only version of the kernel
                    Math-‐only version of the kernel
                    Examine how these times relate and overlap


                                                                          19
© NVIDIA 2010
Optimizations for Global Memory


                                                  20
© NVIDIA 2010
Memory Throughput Analysis
                 Throughput: from application point of view
                    From app point of view: count bytes requested by the application
                    From HW point of view: count bytes moved by the hardware
                    The two can be different
                       Scattered/misaligned pattern: not all transaction bytes are utilized
                       Broadcast: the same small transaction serves many requests

                 Two aspects to analyze for performance impact:
                    Addressing pattern
                    Number of concurrent accesses in flight




                                                                                              21
© NVIDIA 2010
Memory Throughput Analysis
                 Determining that access pattern is problematic:
                    App throughput is much smaller than HW throughput
                       Use profiler to get HW throughput
                    Profiler counters: access instruction count is significantly smaller than
                    transaction count
                       gld_request < ( l1_global_load_miss + l1_global_load_hit) * ( word_size / 4B )
                       gst_request < 4 * l2_write_requests * ( word_size / 4B )
                       Make sure to adjust the transaction counters for word size (see slide 8)

                 Determining that the number of concurrent accesses is insufficient:
                    Throughput from HW point of view is much lower than theoretical



                                                                                                        22
© NVIDIA 2010
Concurrent Accesses and Performance
                 Increment a 64M element array
                     Two accesses per thread (load then store, but they are dependent)
                         Thus, each warp (32 threads) has one outstanding transaction at a time
                 Tesla C2050, ECC on, max achievable bandwidth: ~114 GB/s



                                                                                                  Several independent smaller
                                                                                                  accesses have the same effect
                                                                                                  as one larger one.
                                                                                                  For example:
                                                                                                  Four 32-‐bit ~= one 128-‐bit




                                                                                                                                 23
© NVIDIA 2010
Optimization: Address Pattern
                 Coalesce the address pattern
                    128-‐byte lines for caching loads
                    32-‐byte segments for non-‐caching loads, stores

                        Coalesce to maximize utilization of bus transactions
                        Refer to CUDA C Programming Guide / Best Practices Guide
                 Try using non-‐caching loads
                    Smaller transactions (32B instead of 128B)
                        more efficient for scattered or partially-‐filled patterns
                 Try fetching data from texture
                    Smaller transactions and different caching
                    Cache not polluted by other gmem loads
                                                                                     24
© NVIDIA 2010
Optimizing Access Concurrency
                 Have enough concurrent accesses to saturate the bus
                    Need (mem_latency
                    Fermi C2050 global memory:
                       400-‐800 cycle latency, 1.15 GHz clock, 144 GB/s bandwidth, 14 SMs
                       Need 30-‐50 128-‐byte transactions in flight per SM

                 Ways to increase concurrent accesses:
                    Increase occupancy
                       Adjust threadblock dimensions
                           To maximize occupancy at given register and smem requirements
                       Reduce register count (-‐maxrregcount option, or __launch_bounds__)
                    Modify code to process several elements per thread

                                                                                             25
© NVIDIA 2010
Case Study: Access Pattern 1
                 Same 3DFD code as in the previous study
                 Using caching loads (compiler default):
                     Memory throughput: 62 / 74 GB/s for app / hw
                     Different enough to be interesting
                 Loads are coalesced:
                     gld_request == ( l1_global_load_miss + l1_global_load_hit )
                 There are halo loads that use only 4 threads out of 32
                     For these transactions only 16 bytes out of 128 are useful
                 Solution: try non-‐caching loads ( -‐Xptxas dlcm=cg compiler option)
                     Performance increase of 7%
                        Not bad for just trying a compiler flag, no code change
                     Memory throughput: 66 / 67 GB/s for app / hw

                                                                                        26
© NVIDIA 2010
Case Study: Accesses in Flight
                 Continuing with the FD code
                     Throughput from both app and hw point of view is 66-‐67 GB/s
                     Now 30.84 out of 33.71 ms are due to mem
                     1024 concurrent threads per SM
                        Due to register count (24 per thread)
                        Simple copy kernel reaches ~80% of achievable mem throughput at this thread count
                 Solution: increase accesses per thread
                     Modified code so that each thread is responsible for 2 output points
                        Doubles the load and store count per thread, saves some indexing math
                        Doubles the tile size -‐> reduces bandwidth spent on halos
                     Further 25% increase in performance
                        App and HW throughputs are now 82 and 84 GB/s, respectively


                                                                                                            27
© NVIDIA 2010
Case Study: Access Pattern 2
                 Kernel from climate simulation code
                     Mostly fp64 (so, at least 2 transactions per mem access)
                 Profiler results:
                     gld_request:                           72,704
                     l1_global_load_hit:                   439,072
                     l1_global_load_miss:                  724,192
                 Analysis:
                     L1 hit rate: 37.7%
                     16 transactions per load instruction
                         Indicates bad access pattern (2 are expected due to 64-‐bit words)
                         Of the 16, 10 miss in L1 and contribute to mem bus traffic
                         So, we fetch 5x more bytes than needed by the app


                                                                                              28
© NVIDIA 2010
Case Study: Access Pattern 2
                 Looking closer at the access pattern:
                    Each thread linearly traverses a contiguous memory region
                    Expecting CPU-‐like L1 caching
                                        -‐block for L1 and L2
                      One of the worst access patterns for GPUs
                 Solution:
                    Transposed the code so that each warp accesses a contiguous
                    memory region
                    2.17 transactions per load instruction
                    This and some other changes improved performance by 3x

                                                                                29
© NVIDIA 2010
Optimizing with Compression
                 When all else has been optimized and kernel is limited by the number of
                 bytes needed, consider compression
                 Approaches:
                     Int: conversion between 8-‐, 16-‐, 32-‐bit integers is 1 instruction (64-‐bit requires a
                     couple)
                     FP: conversion between fp16, fp32, fp64 is one instruction
                         fp16 (1s5e10m) is storage only, no math instructions
                     Range-‐based:
                         Lower and upper limits are kernel arguments
                         Data is an index for interpolation between the limits

                 Application in practice:
                     Clark et al.

                     http://arxiv.org/abs/0911.3191
                                                                                                                30
© NVIDIA 2010
Summary: Memory Analysis and Optimization
                 Analyze:
                     Access pattern:
                        Compare counts of access instructions and transactions
                        Compare throughput from app and hw point of view
                     Number of accesses in flight
                        Look at occupancy and independent accesses per thread
                        Compare achieved throughput to theoretical throughput
                             Also to simple memcpy throughput at the same occupancy

                 Optimizations:
                     Coalesce address patterns per warp (nothing new here), consider texture
                     Process more words per thread (if insufficient accesses in flight to saturate bus)
                     Try the 4 combinations of L1 size and load type (caching and non-‐caching)
                     Consider compression
                                                                                                          31
© NVIDIA 2010
Optimizations for Instruction Throughput



                                                           32
© NVIDIA 2010
Possible Limiting Factors
                 Raw instruction throughput
                      Know the kernel instruction mix
                      fp32, fp64, int, mem, transcendentals, etc. have different throughputs
                          Refer to the CUDA C Programming Guide
                      Can examine assembly, if needed:

                          Can look at SASS (machine assembly) by disassembling the .cubin using cuobjdump

                 Instruction serialization
                      Occurs when threads in a warp issue the same instruction in sequence
                          As opposed to the entire warp issuing the instruction at once


                      Some causes:
                          Shared memory bank conflicts
                          Constant memory bank conflicts

                                                                                                            33
© NVIDIA 2010
Instruction Throughput: Analysis
                 Profiler counters (both incremented by 1 per warp):
                    instructions  executed: counts instructions encoutered during execution
                    instructions  issued:   also includes additional issues due to serialization
                    Difference between the two: issues that happened due to serialization,
                    instr cache misses, etc.

                       instructions issued
                 Compare achieved throughput to HW capabilities
                    Peak instruction throughput is documented in the Programming Guide
                    Profiler also reports throughput:
                       GT200: as a fraction of theoretical peak for fp32 instructions
                       Fermi: as IPC (instructions per clock)


                                                                                                   34
© NVIDIA 2010
Instruction Throughput: Optimization
                 Use intrinsics where possible ( __sin(),  __sincos(), __exp(),  etc.)
                     Available for a number of math.h functions
                     Less accuracy, much higher throughput
                          Refer to the CUDA C Programming Guide for details
                     Often a single instruction, whereas a non-‐intrinsic is a SW sequence
                 Additional compiler flags that also help (select GT200-‐level precision):
                     -‐ftz=true            : flush denormals to 0
                     -‐prec-­‐div=false    : faster fp division instruction sequence (some precision loss)
                     -‐prec-­‐sqrt=false   : faster fp sqrt instruction sequence (some precision loss)
                 Make sure you do fp64 arithmetic only where you mean it:
                     fp64 throughput is lower than fp32
                     fp

                                                                                                             35
© NVIDIA 2010
Serialization: Profiler Analysis
                 Serialization is significant if
                     instructions_issued is significantly higher than instructions_executed
                 Warp divergence
                     Profiler counters: divergent_branch, branch
                     Compare the two to see what percentage diverges
                         However, this only counts the branches, not the rest of serialized instructions
                 SMEM bank conflicts
                     Profiler counters:
                         l1_shared_bank_conflict: incremented by 1 per warp for each replay
                              double counts for 64-‐bit accesses
                         shared_load, shared_store: incremented by 1 per warp per smem LD/ST instruction
                     Bank conflicts are significant if both are true:
                         instruction throughput affects performance
                         l1_shared_bank_conflict is significant compared to instructions_issued, shared_load+shared_store
                                                                                                                            36
© NVIDIA 2010
Serialization: Analysis with Modified Code
                 Modify kernel code to assess performance improvement
                 if serialization were removed
                   Helps decide whether optimizations are worth pursuing
                 Shared memory bank conflicts:
                   Change indexing to be either broadcasts or just threadIdx.x
                   Should also declare smem variables as volatile

                 Warp divergence:
                   change the condition to always take the same path
                   Time both paths to see what each costs
                                                                                 37
© NVIDIA 2010
Serialization: Optimization

                 Shared memory bank conflicts:
                   Pad SMEM arrays

                      See CUDA Best Practices Guide, Transpose SDK whitepaper
                   Rearrange data in SMEM
                 Warp serialization:
                   Try grouping threads that take the same path
                      Rearrange the data, pre-‐process the data
                      Rearrange how threads index data (may affect memory perf)


                                                                                  38
© NVIDIA 2010
Case Study: SMEM Bank Conflicts
                 A different climate simulation code kernel, fp64
                 Profiler values:
                     Instructions:
                         Executed / issued:          2,406,426 / 2,756,140
                         Difference:                    349,714 (12.7%
                     GMEM:
                         Total load and store transactions: 170,263
                         Instr:byte ratio: 4
                                suggests that instructions are a significant limiter (especially since there is a lot of fp64 math)

                     SMEM:
                         Load / store:              421,785 / 95,172
                         Bank conflict:             674,856      (really 337,428 because of double-‐counting for fp64)
                               This means a total of 854,385 SMEM access instructions, (421,785 +95,172+337,428) , 39% replays

                 Solution:
                     Pad shared memory array: performance increased by 15%
                         replayed instructions reduced down to 1%

                                                                                                                                      39
© NVIDIA 2010
Instruction Throughput: Summary

                 Analyze:
                    Check achieved instruction throughput
                    Compare to HW peak (but must take instruction mix into
                    consideration)
                    Check percentage of instructions due to serialization
                 Optimizations:
                    Intrinsics, compiler options for expensive operations
                    Group threads that are likely to follow same execution path
                    Avoid SMEM bank conflicts (pad, rearrange data)

                                                                                  40
© NVIDIA 2010
Optimizations for Latency



                                            41
© NVIDIA 2010
Latency: Analysis
                 Suspect if:
                     Neither memory nor instruction throughput rates are close to HW theoretical
                     rates
                     Poor overlap between mem and math
                         Full-‐kernel time is significantly larger than max{mem-‐only, math-‐only}

                 Two possible causes:
                     Insufficient concurrent threads per multiprocessor to hide latency
                         Occupancy too low
                         Too few threads in kernel launch to load the GPU


                     Too few concurrent threadblocks per SM when using __syncthreads()
                         __syncthreads()   can prevent overlap between math and mem within the same threadblock


                                                                                                                  42
© NVIDIA 2010
Simplified View of Latency and Syncs
                                                      Kernel where most math cannot be
                                                      executed until all data is loaded by
                              Memory-only time        the threadblock
                              Math-only time



                                          Full-kernel time, one large threadblock per SM




                       time                                                                43
© NVIDIA 2010
Simplified View of Latency and Syncs
                                                      Kernel where most math cannot be
                                                      executed until all data is loaded by
                              Memory-only time        the threadblock
                              Math-only time



                                          Full-kernel time, one large threadblock per SM




                                          Full-kernel time, two threadblocks per SM
                                           (each half the size of one large one)


                       time                                                                44
© NVIDIA 2010
Latency: Optimization
                 Insufficient threads or workload:
                     Increase the level of parallelism (more threads)
                     If occupancy is already high but latency is not being hidden:
                         Process several output elements per thread gives more independent memory and arithmetic instructions
                         (which get pipelined)
                 Barriers:
                     Can assess impact on perf by commenting out __syncthreads()
                         Incorrect result, but gives upper bound on improvement
                     Try running several smaller threadblocks

                         In some cases that costs extra bandwidth due to halos


                 Check out Vasily                talk 2238 from GTC 2010 for a detailed treatment:


                                                                                                                                45
© NVIDIA 2010
Register Spilling



                                    46
© NVIDIA 2010
Register Spilling

                     Fermi HW limit is 63 registers per thread
                     Spills also possible when register limit is programmer-‐specified
                         Common when trying to achieve certain occupancy with -­‐maxrregcount compiler flag or
                         __launch_bounds__  in source
                     lmem is like gmem, except that writes are cached in L1
                         lmem load hit in L1 -‐> no bus traffic
                         lmem load miss in L1 -‐> bus traffic (128 bytes per miss)
                     Compiler flag Xptxas v  gives the register and lmem usage per thread

                 Potential impact on performance
                     Additional bandwidth pressure if evicted from L1
                     Additional instructions
                     Not always a problem, easy to investigate with quick profiler analysis
                                                                                                                 47
© NVIDIA 2010
Register Spilling: Analysis
                 Profiler counters: l1_local_load_hit, l1_local_load_miss
                 Impact on instruction count:
                    Compare to total instructions issued
                 Impact on memory throughput:
                    Misses add 128 bytes per warp
                    Compare 2*l1_local_load_miss count to gmem access count
                    (stores + loads)
                       Multiply lmem load misses by 2: missed line must have been evicted -‐>
                       store across bus
                       Comparing with caching loads: count only gmem misses in L1
                       Comparing with non-‐caching loads: count all loads
                                                                                                48
© NVIDIA 2010
Optimization for Register Spilling
                 Try increasing the limit of registers per thread
                     Use a higher limit in maxrregcount, or lower thread count for
                     __launch_bounds__
                     Will likely decrease occupancy, potentially making gmem accesses less
                     efficient
                     However, may still be an overall win fewer total bytes being accessed in
                     gmem

                 Non-‐caching loads for gmem
                     potentially fewer contentions with spilled registers in L1

                 Increase L1 size to 48KB
                     default is 16KB L1 / 48KB smem

                                                                                                49
© NVIDIA 2010
Register Spilling: Case Study
                 FD kernel, (3D-‐cross stencil)
                     fp32, so all gmem accesses are 4-‐byte words
                        Need higher occupancy to saturate memory bandwidth
                     Coalesced, non-‐caching loads
                        one gmem request = 128 bytes
                        all gmem loads result in bus traffic
                     Larger threadblocks mean lower gmem pressure
                        Halos (ghost cells) are smaller as a percentage

                 Aiming to have 1024 concurrent threads per SM
                     Means no more than 32 registers per thread
                     Compiled with maxrregcount=32

                                                                             50
© NVIDIA 2010
Case Study: Register Spilling 1
                 10th order in space kernel (31-‐point stencil)
                      32 registers per thread : 68 bytes of lmem per thread : upto 1024 threads per SM
                 Profiled counters:
                      l1_local_load_miss              =      36           inst_issued       = 8,308,582
                      l1_local_load_hit               = 70,956            gld_request       =   595,200
                      local_store                     = 64,800            gst_request       =   128,000
                 Conclusion: spilling is not a problem in this case
                      The ratio of gmem to lmem bus traffic is approx 10,044 : 1 (hardly any bus traffic is due to spills)
                           L1 contains most of the spills (99.9% hit rate for lmem loads)
                      Only 1.6% of all instructions are due to spills
                 Comparison:
                      42 registers per thread : no spilling : upto 768 threads per SM
                           Single 512-‐thread block per SM : 24% perf decrease
                           Three 256-‐thread blocks per SM : 7% perf decrease

                                                                                                                             51
© NVIDIA 2010
Case Study: Register Spilling 2
                 12th order in space kernel (37-‐point stencil)
                      32 registers per thread : 80 bytes of lmem per thread : upto 1024 threads per SM
                 Profiled counters:
                      l1_local_load_miss              = 376,889           inst_issued      = 10,154,216
                      l1_local_load_hit               = 36,931            gld_request      =   550,656
                      local_store                     = 71,176            gst_request      =   115,200
                 Conclusion: spilling is a problem in this case
                      The ratio of gmem to lmem bus traffic is approx 6 : 7 (53% of bus traffic is due to spilling)
                           L1 does not contain the spills (8.9% hit rate for lmem loads)
                      Only 4.1% of all instructions are due to spills
                 Solution: increase register limit per thread
                      42 registers per thread : no spilling : upto 768 threads per SM
                      Single 512-‐thread block per SM : 13% perf increase
                      Three 256-‐thread blocks per SM : 37% perf increase
                                                                                                                      52
© NVIDIA 2010
Register Spilling: Summary

                 due to:
                    Increased pressure on the memory bus
                    Increased instruction count
                 Use the profiler to examine the impact by comparing:
                    2*l1_local_load_miss to all gmem
                    Local access count to total instructions issued
                 Impact is significant if:
                    Memory-‐bound code: lmem misses are a significant percentage of
                    total bus traffic
                    Instruction-‐bound code: lmem accesses are a significant
                    percentage of all instructions
                                                                                      53
© NVIDIA 2010
Summary
                Determining what limits your kernel most:
                    Arithmetic, memory bandwidth, latency
                Address the bottlenecks in the order of importance
                    Analyze for inefficient use of hardware
                    Estimate the impact on overall performance
                    Optimize to most efficiently use hardware
                More resources:
                    Fundamental Optimizations talk
                    Prior CUDA tutorials at Supercomputing
                       http://gpgpu.org/{sc2007,sc2008,sc2009}
                    CUDA C Programming Guide, CUDA C Best Practices Guide
                    CUDA webinars

                                                                            54
© NVIDIA 2010
Questions?




                             55
© NVIDIA 2010
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Contenu connexe

Tendances

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterRyousei Takano
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1RBratton
 
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvPersistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvIntel® Software
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
ABC of Teradata System Performance Analysis
ABC of Teradata System Performance AnalysisABC of Teradata System Performance Analysis
ABC of Teradata System Performance AnalysisShaheryar Iqbal
 
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...Shaheryar Iqbal
 
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCCSho Nakazono
 
Multi-threaded Performance Pitfalls
Multi-threaded Performance PitfallsMulti-threaded Performance Pitfalls
Multi-threaded Performance PitfallsCiaran McHale
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]Christopher Chedeau
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Shader model 5 0 and compute shader
Shader model 5 0 and compute shaderShader model 5 0 and compute shader
Shader model 5 0 and compute shaderzaywalker
 
2 colin walls - how to measure rtos performance
2    colin walls - how to measure rtos performance2    colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performanceIevgenii Katsan
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
 

Tendances (20)

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
 
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvPersistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
06threadsimp
06threadsimp06threadsimp
06threadsimp
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
ABC of Teradata System Performance Analysis
ABC of Teradata System Performance AnalysisABC of Teradata System Performance Analysis
ABC of Teradata System Performance Analysis
 
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
Teradata Co-existing Systems Parallel Efficiency -- Calculation & Reconfigura...
 
01intro
01intro01intro
01intro
 
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
[xSIG 2018]InvisibleWriteRule: Extended write protocol for 1VCC
 
Multi-threaded Performance Pitfalls
Multi-threaded Performance PitfallsMulti-threaded Performance Pitfalls
Multi-threaded Performance Pitfalls
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]Climb - Property-based dispatch in functional languages [Report]
Climb - Property-based dispatch in functional languages [Report]
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Shader model 5 0 and compute shader
Shader model 5 0 and compute shaderShader model 5 0 and compute shader
Shader model 5 0 and compute shader
 
2 colin walls - how to measure rtos performance
2    colin walls - how to measure rtos performance2    colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performance
 
Hpc4
Hpc4Hpc4
Hpc4
 
Pgq
PgqPgq
Pgq
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
 

En vedette

Spots for M.A.T.H. Profession​al Development​ Events
Spots for M.A.T.H. Profession​al Development​ Events Spots for M.A.T.H. Profession​al Development​ Events
Spots for M.A.T.H. Profession​al Development​ Events Nechemia Weiss
 
Counters in the ten frame 001
Counters in the ten frame 001Counters in the ten frame 001
Counters in the ten frame 001nvasco
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 

En vedette (15)

Spots for M.A.T.H. Profession​al Development​ Events
Spots for M.A.T.H. Profession​al Development​ Events Spots for M.A.T.H. Profession​al Development​ Events
Spots for M.A.T.H. Profession​al Development​ Events
 
Counters in the ten frame 001
Counters in the ten frame 001Counters in the ten frame 001
Counters in the ten frame 001
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 

Similaire à [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentDoKC
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
SQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teamsSQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teamsSumeet Bansal
 
Os Madsen Block
Os Madsen BlockOs Madsen Block
Os Madsen Blockoscon2007
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalTommy Lee
 
Tips and Tricks for SAP Sybase IQ
Tips and Tricks for SAP  Sybase IQTips and Tricks for SAP  Sybase IQ
Tips and Tricks for SAP Sybase IQDon Brizendine
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Steve Staso
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructurexKinAnx
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructuresolarisyourep
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Hardware planning & sizing for sql server
Hardware planning & sizing for sql serverHardware planning & sizing for sql server
Hardware planning & sizing for sql serverDavide Mauri
 

Similaire à [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford) (20)

Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
SQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teamsSQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teams
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Os Madsen Block
Os Madsen BlockOs Madsen Block
Os Madsen Block
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Tips and Tricks for SAP Sybase IQ
Tips and Tricks for SAP  Sybase IQTips and Tricks for SAP  Sybase IQ
Tips and Tricks for SAP Sybase IQ
 
Concept of thread
Concept of threadConcept of thread
Concept of thread
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Hardware planning & sizing for sql server
Hardware planning & sizing for sql serverHardware planning & sizing for sql server
Hardware planning & sizing for sql server
 

Plus de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

Plus de npinto (12)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Dernier

ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 

Dernier (20)

ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 

[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)

  • 1. Analysis-‐Driven Optimization Presented by Cliff Woolley for Harvard CS264 © NVIDIA 2010
  • 2. Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops -‐bound kernel Determine what limits kernel performance Memory throughput Instruction throughput Latency Combination of the above Address the limiters in the order of importance Determine how close to the HW limits the resource is being used Analyze for possible inefficiencies Apply optimizations Often these will just fall out from how HW operates 2 © NVIDIA 2010
  • 3. Presentation Outline Identifying performance limiters Analyzing and optimizing : Memory-‐bound kernels Instruction (math) bound kernels Kernels with poor latency hiding For each: Brief background How to analyze How to judge whether particular issue is problematic How to optimize -‐ Note: Most information is for Fermi GPUs 3 © NVIDIA 2010
  • 4. New in CUDA Toolkit 4.0: Automated Performance Analysis using Visual Profiler Summary analysis & hints Per-‐Session Per-‐Device Per-‐Context Per-‐Kernel New UI for kernel analysis Identify the limiting factor Analyze instruction throughput Analyze memory throughput Analyze kernel occupancy © NVIDIA 2010
  • 5. Notes on profiler Most counters are reported per Streaming Multiprocessor (SM) Not entire GPU Exceptions: L2 and DRAM counters A single run can collect a few counters Multiple runs are needed when profiling more counters Done automatically by the Visual Profiler Have to be done manually using command-‐line profiler Counter values may not be exactly the same for repeated runs Threadblocks and warps are scheduled at run-‐time See the profiler documentation for more information 5 © NVIDIA 2010
  • 7. Limited by Bandwidth or Arithmetic? Perfect instructions:bytes ratio for Fermi C2050: ~4.5 : 1 with ECC on ~3.6 : 1 with ECC off These assume fp32 instructions, throughput for other instructions varies Algorithmic analysis: Rough estimate of arithmetic to bytes ratio Code likely uses more instructions and bytes than algorithm analysis suggests: Instructions for loop control, pointer math, etc. Address pattern may result in more memory fetches Two ways to investigate: Use the profiler (quick, but approximate) Use source code modification (more accurate, more work intensive) 7 © NVIDIA 2010
  • 8. A Note on Counting Global Memory Accesses Load/store instruction count can be lower than the number of actual memory transactions Address pattern, different word sizes Counting requests from L1 to the rest of the memory system makes the most sense Caching-‐loads: count L1 misses Non-‐caching loads and stores: count L2 read requests Note that L2 counters are for the entire chip, L1 counters are per SM One 32-‐bit access instruction -‐> one 128-‐byte transaction per warp One 64-‐bit access instruction -‐> two 128-‐byte transactions per warp One 128-‐bit access instruction -‐> four 128-‐byte transactions per warp 8 © NVIDIA 2010
  • 9. Analysis with Profiler Profiler counters: instructions_issued, instructions_executed Both incremented by 1 per warp gld_request, gst_request Incremented by 1 per warp for each load/store instruction l1_global_load_miss, l1_global_load_hit, global_store_transaction Incremented by 1 per L1 line (line is 128B) uncached_global_load_transaction Incremented by 1 per group of 1, 2, 3, or 4 transactions Better to look at L2_read_request counter (incremented by 1 per 32B transaction; per GPU, not per SM) Compare: 32 * instructions_issued /* 32 = warp size */ 128B * (global_store_transaction +  l1_global_load_miss) 9 © NVIDIA 2010
  • 10. Analysis with Modified Source Code Time memory-‐only and math-‐only versions of the kernel -‐dependent control-‐flow or addressing Gives you good estimates for: Time spent accessing memory Time spent in executing instructions Compare the times taken by the modified kernels Helps decide whether the kernel is mem or math bound Compare the sum of mem-‐only and math-‐only times to full-‐kernel time Shows how well memory operations are overlapped with arithmetic Can reveal latency bottleneck 10 © NVIDIA 2010
  • 11. Some Example Scenarios time mem math full Memory-bound Good mem-math overlap: latency likely not a problem (assuming memory throughput is not low compared to HW theory) 11 © NVIDIA 2010
  • 12. Some Example Scenarios time mem math full mem math full Memory-bound Math-bound Good mem-math Good mem-math overlap: latency likely overlap: latency likely not a problem not a problem (assuming memory (assuming instruction throughput is not low throughput is not low compared to HW theory) compared to HW theory) © NVIDIA 2010
  • 13. Some Example Scenarios time mem math full mem math full mem math full Memory-bound Math-bound Balanced Good mem-math Good mem-math Good mem-math overlap: latency likely overlap: latency likely overlap: latency likely not a problem not a problem not a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13 © NVIDIA 2010
  • 14. Some Example Scenarios time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency likely overlap: latency likely overlap: latency likely latency is a problem not a problem not a problem not a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 14 © NVIDIA 2010
  • 15. Source Modification Memory-‐only: Remove as much arithmetic as possible Without changing access pattern Use the profiler to verify that load/store instruction count is the same Store-‐only: Also remove the loads (to compare read time vs. write time) Math-‐only: Remove global memory accesses Need to trick the compiler: Compiler throws away all code that it detects as not contributing to stores Put stores inside conditionals that always evaluate to false Condition should depend on the value about to be stored (prevents other optimizations) Condition outcome should not be known to the compiler 15 © NVIDIA 2010
  • 16. Source Modification for Math-‐only __global__  void  fwd_3D(  ...,  int flag) { ... value  =  temp  +  coeff *  vsq; If you compare only the if(  1  ==  value  *  flag  ) flag, the compiler may g_output[out_idx]  =  value; move the computation into the conditional as } well 16 © NVIDIA 2010
  • 17. Source Modification and Occupancy Removing pieces of code is likely to affect register count This could increase occupancy, skewing the results See slide 23 to see how that could affect throughput Make sure to keep the same occupancy Check the occupancy with profiler before modifications After modifications, if necessary add shared memory to kernel<<<  grid,  block,  smem, ...>>>(...) 17 © NVIDIA 2010
  • 18. Case Study: Limiter Analysis 3DFD of the wave equation, fp32 Analysis: Time (ms): Instr:byte ratio = ~2.66 Full-‐kernel: 35.39 32*18,194,139 / 128*1,708,032 Mem-‐only: 33.27 Good overlap between math and mem: Math-‐only: 16.25 2.12 ms of math-‐only time (13%) are not overlapped with mem Instructions issued: App memory throughput: 62 GB/s Full-‐kernel: 18,194,139 HW theory is 114 GB/s Mem-‐only: 7,497,296 Conclusion: Math-‐only: 16,839,792 Code is memory-‐bound Memory access transactions: Latency could be an issue too Full-‐kernel: 1,708,032 Optimizations should focus on memory Mem-‐only: 1,708,032 throughput first Math-‐only: 0 math contributes very little to total time (2.12 out of 35.39ms) 18 © NVIDIA 2010
  • 19. Summary: Limiter Analysis Rough algorithmic analysis: How many bytes needed, how many instructions Profiler analysis: Instruction count, memory request/transaction count Analysis with source modification: Memory-‐only version of the kernel Math-‐only version of the kernel Examine how these times relate and overlap 19 © NVIDIA 2010
  • 20. Optimizations for Global Memory 20 © NVIDIA 2010
  • 21. Memory Throughput Analysis Throughput: from application point of view From app point of view: count bytes requested by the application From HW point of view: count bytes moved by the hardware The two can be different Scattered/misaligned pattern: not all transaction bytes are utilized Broadcast: the same small transaction serves many requests Two aspects to analyze for performance impact: Addressing pattern Number of concurrent accesses in flight 21 © NVIDIA 2010
  • 22. Memory Throughput Analysis Determining that access pattern is problematic: App throughput is much smaller than HW throughput Use profiler to get HW throughput Profiler counters: access instruction count is significantly smaller than transaction count gld_request < ( l1_global_load_miss + l1_global_load_hit) * ( word_size / 4B ) gst_request < 4 * l2_write_requests * ( word_size / 4B ) Make sure to adjust the transaction counters for word size (see slide 8) Determining that the number of concurrent accesses is insufficient: Throughput from HW point of view is much lower than theoretical 22 © NVIDIA 2010
  • 23. Concurrent Accesses and Performance Increment a 64M element array Two accesses per thread (load then store, but they are dependent) Thus, each warp (32 threads) has one outstanding transaction at a time Tesla C2050, ECC on, max achievable bandwidth: ~114 GB/s Several independent smaller accesses have the same effect as one larger one. For example: Four 32-‐bit ~= one 128-‐bit 23 © NVIDIA 2010
  • 24. Optimization: Address Pattern Coalesce the address pattern 128-‐byte lines for caching loads 32-‐byte segments for non-‐caching loads, stores Coalesce to maximize utilization of bus transactions Refer to CUDA C Programming Guide / Best Practices Guide Try using non-‐caching loads Smaller transactions (32B instead of 128B) more efficient for scattered or partially-‐filled patterns Try fetching data from texture Smaller transactions and different caching Cache not polluted by other gmem loads 24 © NVIDIA 2010
  • 25. Optimizing Access Concurrency Have enough concurrent accesses to saturate the bus Need (mem_latency Fermi C2050 global memory: 400-‐800 cycle latency, 1.15 GHz clock, 144 GB/s bandwidth, 14 SMs Need 30-‐50 128-‐byte transactions in flight per SM Ways to increase concurrent accesses: Increase occupancy Adjust threadblock dimensions To maximize occupancy at given register and smem requirements Reduce register count (-‐maxrregcount option, or __launch_bounds__) Modify code to process several elements per thread 25 © NVIDIA 2010
  • 26. Case Study: Access Pattern 1 Same 3DFD code as in the previous study Using caching loads (compiler default): Memory throughput: 62 / 74 GB/s for app / hw Different enough to be interesting Loads are coalesced: gld_request == ( l1_global_load_miss + l1_global_load_hit ) There are halo loads that use only 4 threads out of 32 For these transactions only 16 bytes out of 128 are useful Solution: try non-‐caching loads ( -‐Xptxas dlcm=cg compiler option) Performance increase of 7% Not bad for just trying a compiler flag, no code change Memory throughput: 66 / 67 GB/s for app / hw 26 © NVIDIA 2010
  • 27. Case Study: Accesses in Flight Continuing with the FD code Throughput from both app and hw point of view is 66-‐67 GB/s Now 30.84 out of 33.71 ms are due to mem 1024 concurrent threads per SM Due to register count (24 per thread) Simple copy kernel reaches ~80% of achievable mem throughput at this thread count Solution: increase accesses per thread Modified code so that each thread is responsible for 2 output points Doubles the load and store count per thread, saves some indexing math Doubles the tile size -‐> reduces bandwidth spent on halos Further 25% increase in performance App and HW throughputs are now 82 and 84 GB/s, respectively 27 © NVIDIA 2010
  • 28. Case Study: Access Pattern 2 Kernel from climate simulation code Mostly fp64 (so, at least 2 transactions per mem access) Profiler results: gld_request: 72,704 l1_global_load_hit: 439,072 l1_global_load_miss: 724,192 Analysis: L1 hit rate: 37.7% 16 transactions per load instruction Indicates bad access pattern (2 are expected due to 64-‐bit words) Of the 16, 10 miss in L1 and contribute to mem bus traffic So, we fetch 5x more bytes than needed by the app 28 © NVIDIA 2010
  • 29. Case Study: Access Pattern 2 Looking closer at the access pattern: Each thread linearly traverses a contiguous memory region Expecting CPU-‐like L1 caching -‐block for L1 and L2 One of the worst access patterns for GPUs Solution: Transposed the code so that each warp accesses a contiguous memory region 2.17 transactions per load instruction This and some other changes improved performance by 3x 29 © NVIDIA 2010
  • 30. Optimizing with Compression When all else has been optimized and kernel is limited by the number of bytes needed, consider compression Approaches: Int: conversion between 8-‐, 16-‐, 32-‐bit integers is 1 instruction (64-‐bit requires a couple) FP: conversion between fp16, fp32, fp64 is one instruction fp16 (1s5e10m) is storage only, no math instructions Range-‐based: Lower and upper limits are kernel arguments Data is an index for interpolation between the limits Application in practice: Clark et al. http://arxiv.org/abs/0911.3191 30 © NVIDIA 2010
  • 31. Summary: Memory Analysis and Optimization Analyze: Access pattern: Compare counts of access instructions and transactions Compare throughput from app and hw point of view Number of accesses in flight Look at occupancy and independent accesses per thread Compare achieved throughput to theoretical throughput Also to simple memcpy throughput at the same occupancy Optimizations: Coalesce address patterns per warp (nothing new here), consider texture Process more words per thread (if insufficient accesses in flight to saturate bus) Try the 4 combinations of L1 size and load type (caching and non-‐caching) Consider compression 31 © NVIDIA 2010
  • 32. Optimizations for Instruction Throughput 32 © NVIDIA 2010
  • 33. Possible Limiting Factors Raw instruction throughput Know the kernel instruction mix fp32, fp64, int, mem, transcendentals, etc. have different throughputs Refer to the CUDA C Programming Guide Can examine assembly, if needed: Can look at SASS (machine assembly) by disassembling the .cubin using cuobjdump Instruction serialization Occurs when threads in a warp issue the same instruction in sequence As opposed to the entire warp issuing the instruction at once Some causes: Shared memory bank conflicts Constant memory bank conflicts 33 © NVIDIA 2010
  • 34. Instruction Throughput: Analysis Profiler counters (both incremented by 1 per warp): instructions  executed: counts instructions encoutered during execution instructions  issued: also includes additional issues due to serialization Difference between the two: issues that happened due to serialization, instr cache misses, etc. instructions issued Compare achieved throughput to HW capabilities Peak instruction throughput is documented in the Programming Guide Profiler also reports throughput: GT200: as a fraction of theoretical peak for fp32 instructions Fermi: as IPC (instructions per clock) 34 © NVIDIA 2010
  • 35. Instruction Throughput: Optimization Use intrinsics where possible ( __sin(),  __sincos(), __exp(),  etc.) Available for a number of math.h functions Less accuracy, much higher throughput Refer to the CUDA C Programming Guide for details Often a single instruction, whereas a non-‐intrinsic is a SW sequence Additional compiler flags that also help (select GT200-‐level precision): -‐ftz=true : flush denormals to 0 -‐prec-­‐div=false : faster fp division instruction sequence (some precision loss) -‐prec-­‐sqrt=false : faster fp sqrt instruction sequence (some precision loss) Make sure you do fp64 arithmetic only where you mean it: fp64 throughput is lower than fp32 fp 35 © NVIDIA 2010
  • 36. Serialization: Profiler Analysis Serialization is significant if instructions_issued is significantly higher than instructions_executed Warp divergence Profiler counters: divergent_branch, branch Compare the two to see what percentage diverges However, this only counts the branches, not the rest of serialized instructions SMEM bank conflicts Profiler counters: l1_shared_bank_conflict: incremented by 1 per warp for each replay double counts for 64-‐bit accesses shared_load, shared_store: incremented by 1 per warp per smem LD/ST instruction Bank conflicts are significant if both are true: instruction throughput affects performance l1_shared_bank_conflict is significant compared to instructions_issued, shared_load+shared_store 36 © NVIDIA 2010
  • 37. Serialization: Analysis with Modified Code Modify kernel code to assess performance improvement if serialization were removed Helps decide whether optimizations are worth pursuing Shared memory bank conflicts: Change indexing to be either broadcasts or just threadIdx.x Should also declare smem variables as volatile Warp divergence: change the condition to always take the same path Time both paths to see what each costs 37 © NVIDIA 2010
  • 38. Serialization: Optimization Shared memory bank conflicts: Pad SMEM arrays See CUDA Best Practices Guide, Transpose SDK whitepaper Rearrange data in SMEM Warp serialization: Try grouping threads that take the same path Rearrange the data, pre-‐process the data Rearrange how threads index data (may affect memory perf) 38 © NVIDIA 2010
  • 39. Case Study: SMEM Bank Conflicts A different climate simulation code kernel, fp64 Profiler values: Instructions: Executed / issued: 2,406,426 / 2,756,140 Difference: 349,714 (12.7% GMEM: Total load and store transactions: 170,263 Instr:byte ratio: 4 suggests that instructions are a significant limiter (especially since there is a lot of fp64 math) SMEM: Load / store: 421,785 / 95,172 Bank conflict: 674,856 (really 337,428 because of double-‐counting for fp64) This means a total of 854,385 SMEM access instructions, (421,785 +95,172+337,428) , 39% replays Solution: Pad shared memory array: performance increased by 15% replayed instructions reduced down to 1% 39 © NVIDIA 2010
  • 40. Instruction Throughput: Summary Analyze: Check achieved instruction throughput Compare to HW peak (but must take instruction mix into consideration) Check percentage of instructions due to serialization Optimizations: Intrinsics, compiler options for expensive operations Group threads that are likely to follow same execution path Avoid SMEM bank conflicts (pad, rearrange data) 40 © NVIDIA 2010
  • 41. Optimizations for Latency 41 © NVIDIA 2010
  • 42. Latency: Analysis Suspect if: Neither memory nor instruction throughput rates are close to HW theoretical rates Poor overlap between mem and math Full-‐kernel time is significantly larger than max{mem-‐only, math-‐only} Two possible causes: Insufficient concurrent threads per multiprocessor to hide latency Occupancy too low Too few threads in kernel launch to load the GPU Too few concurrent threadblocks per SM when using __syncthreads() __syncthreads()   can prevent overlap between math and mem within the same threadblock 42 © NVIDIA 2010
  • 43. Simplified View of Latency and Syncs Kernel where most math cannot be executed until all data is loaded by Memory-only time the threadblock Math-only time Full-kernel time, one large threadblock per SM time 43 © NVIDIA 2010
  • 44. Simplified View of Latency and Syncs Kernel where most math cannot be executed until all data is loaded by Memory-only time the threadblock Math-only time Full-kernel time, one large threadblock per SM Full-kernel time, two threadblocks per SM (each half the size of one large one) time 44 © NVIDIA 2010
  • 45. Latency: Optimization Insufficient threads or workload: Increase the level of parallelism (more threads) If occupancy is already high but latency is not being hidden: Process several output elements per thread gives more independent memory and arithmetic instructions (which get pipelined) Barriers: Can assess impact on perf by commenting out __syncthreads() Incorrect result, but gives upper bound on improvement Try running several smaller threadblocks In some cases that costs extra bandwidth due to halos Check out Vasily talk 2238 from GTC 2010 for a detailed treatment: 45 © NVIDIA 2010
  • 46. Register Spilling 46 © NVIDIA 2010
  • 47. Register Spilling Fermi HW limit is 63 registers per thread Spills also possible when register limit is programmer-‐specified Common when trying to achieve certain occupancy with -­‐maxrregcount compiler flag or __launch_bounds__  in source lmem is like gmem, except that writes are cached in L1 lmem load hit in L1 -‐> no bus traffic lmem load miss in L1 -‐> bus traffic (128 bytes per miss) Compiler flag Xptxas v  gives the register and lmem usage per thread Potential impact on performance Additional bandwidth pressure if evicted from L1 Additional instructions Not always a problem, easy to investigate with quick profiler analysis 47 © NVIDIA 2010
  • 48. Register Spilling: Analysis Profiler counters: l1_local_load_hit, l1_local_load_miss Impact on instruction count: Compare to total instructions issued Impact on memory throughput: Misses add 128 bytes per warp Compare 2*l1_local_load_miss count to gmem access count (stores + loads) Multiply lmem load misses by 2: missed line must have been evicted -‐> store across bus Comparing with caching loads: count only gmem misses in L1 Comparing with non-‐caching loads: count all loads 48 © NVIDIA 2010
  • 49. Optimization for Register Spilling Try increasing the limit of registers per thread Use a higher limit in maxrregcount, or lower thread count for __launch_bounds__ Will likely decrease occupancy, potentially making gmem accesses less efficient However, may still be an overall win fewer total bytes being accessed in gmem Non-‐caching loads for gmem potentially fewer contentions with spilled registers in L1 Increase L1 size to 48KB default is 16KB L1 / 48KB smem 49 © NVIDIA 2010
  • 50. Register Spilling: Case Study FD kernel, (3D-‐cross stencil) fp32, so all gmem accesses are 4-‐byte words Need higher occupancy to saturate memory bandwidth Coalesced, non-‐caching loads one gmem request = 128 bytes all gmem loads result in bus traffic Larger threadblocks mean lower gmem pressure Halos (ghost cells) are smaller as a percentage Aiming to have 1024 concurrent threads per SM Means no more than 32 registers per thread Compiled with maxrregcount=32 50 © NVIDIA 2010
  • 51. Case Study: Register Spilling 1 10th order in space kernel (31-‐point stencil) 32 registers per thread : 68 bytes of lmem per thread : upto 1024 threads per SM Profiled counters: l1_local_load_miss = 36 inst_issued = 8,308,582 l1_local_load_hit = 70,956 gld_request = 595,200 local_store = 64,800 gst_request = 128,000 Conclusion: spilling is not a problem in this case The ratio of gmem to lmem bus traffic is approx 10,044 : 1 (hardly any bus traffic is due to spills) L1 contains most of the spills (99.9% hit rate for lmem loads) Only 1.6% of all instructions are due to spills Comparison: 42 registers per thread : no spilling : upto 768 threads per SM Single 512-‐thread block per SM : 24% perf decrease Three 256-‐thread blocks per SM : 7% perf decrease 51 © NVIDIA 2010
  • 52. Case Study: Register Spilling 2 12th order in space kernel (37-‐point stencil) 32 registers per thread : 80 bytes of lmem per thread : upto 1024 threads per SM Profiled counters: l1_local_load_miss = 376,889 inst_issued = 10,154,216 l1_local_load_hit = 36,931 gld_request = 550,656 local_store = 71,176 gst_request = 115,200 Conclusion: spilling is a problem in this case The ratio of gmem to lmem bus traffic is approx 6 : 7 (53% of bus traffic is due to spilling) L1 does not contain the spills (8.9% hit rate for lmem loads) Only 4.1% of all instructions are due to spills Solution: increase register limit per thread 42 registers per thread : no spilling : upto 768 threads per SM Single 512-‐thread block per SM : 13% perf increase Three 256-‐thread blocks per SM : 37% perf increase 52 © NVIDIA 2010
  • 53. Register Spilling: Summary due to: Increased pressure on the memory bus Increased instruction count Use the profiler to examine the impact by comparing: 2*l1_local_load_miss to all gmem Local access count to total instructions issued Impact is significant if: Memory-‐bound code: lmem misses are a significant percentage of total bus traffic Instruction-‐bound code: lmem accesses are a significant percentage of all instructions 53 © NVIDIA 2010
  • 54. Summary Determining what limits your kernel most: Arithmetic, memory bandwidth, latency Address the bottlenecks in the order of importance Analyze for inefficient use of hardware Estimate the impact on overall performance Optimize to most efficiently use hardware More resources: Fundamental Optimizations talk Prior CUDA tutorials at Supercomputing http://gpgpu.org/{sc2007,sc2008,sc2009} CUDA C Programming Guide, CUDA C Best Practices Guide CUDA webinars 54 © NVIDIA 2010
  • 55. Questions? 55 © NVIDIA 2010