4. GPU MEMORY TYPES
On-chip is placed on SM
Register file (RF)
Shared (SMEM)
Off-chip is placed in GPU’s RAM
Global (GMEM)
Constant (CMEM)
Texture (TEX)
Local (LMEM)
5. VECTOR TRANSACTION
SM has dedicated LD/ST units to handle memory access
Global memory accesses are serviced on warp basis
6. COALESCED MEMORY ACCESS
Fist sm_10 defines coalesced access as an affine access aligned to 128 byte line
Other obsolete sm_1x has strict coalescing rules, too.
Modern GPUs have more relaxed requirements and define coalesced transaction as
transaction that fits cache line
7. COALESCED MEMORY ACCESS (CONT)
Request is coalesced if warp loads only bytes it needs
The less cache lines it needs the more coalesced access it has
Address alignment by cache line size is still preferred
9. CACHE CHARACTERISTICS
Cache L1 L2
generation Fermi Kepler Fermi Kepler
sizes, KB 16/48 16/32/48 up to 768 up to 1536
line width 128B 32B
latency 56 clock - 282 158
mode R, n-c - R&W, c, WB
associativity 2x64/6x64 - ? ?
usage gmem, sys sys gmem, sys, tex
10. MEMORY REQUEST TRAJECTORY: LD.E
Fermi: fully-cached load
LD/ST units compute physical address and number of cache lines warp requests
(L1 line is 128 B)
L1 hit -> return line else go to L2
L2 subdivides 128 B line into 4x32 B (L2 line size). If all required 32 B lines are
found in L2 return result else go to gmem
gmem
Kepler
discrete GPUs: like Fermi but bypass L1
integrated GPUs: the same as Fermi
11. DUALITY OF CACHE LINE
The following requests are equal from gmem point of view.
32 B granularity useful if access pattern is close to random.
12. LOAD CACHING CONFIGURATIONS
LD
Default (cache all): No special suffix
Cache only in L2 (cache global): LD.CG
Bypass caches (cache volatile) LD.CV
Cache streaming
L D R 8 , [ R 6 ] ;
L D . C G R 4 , [ R 1 6 ] ;
L D . C V R 1 4 , [ R 1 4 ] ;
13. MEMORY REQUEST TRAJECTORY: ST.E
Store instruction invalidates cache line in L1 on all SMs, if present (since L1s are on SM
and non-coherent)
Request goes directly to L2. Default write strategy is write back. Can be configured as
write through.
Hit to L2 costs ~160 clocks in case write-back is not needed.
Go to gmem in case of L2 miss (penalty > 350 clocks)
L2ismulti-ported
14. WIDE & NARROW TYPES
Wide
GPU supports wide memory transactions
Only 64 and 128-bit transactions are supported since they can be mapped to 2(4)
32-bit registers
Narrow
Example: uchar2 SOA store results in 2 store transactions
/ * 1 6 1 8 * / L D . E . 1 2 8 R 8 , [ R 1 4 ] ;
/ * 1 6 3 0 * / S T . E . 1 2 8 [ R 1 8 ] , R 8 ;
s t r u c t u c h a r 2 {
u n s i g n e d c h a r x ;
u n s i g n e d c h a r y ;
}
/ * 0 2 c 8 * / S T . E . U 8 [ R 6 + 0 x 1 ] , R 0 ;
/ * 0 2 d 0 * / S T . E . U 8 [ R 6 ] , R 3 ;
15. GMEM ATOMIC OPERATIONS
Performed in L2 per 32 B cache line.
throughput Fermi, per clock Kepler, per clock
shared address 1/9 th 1
independent 24 64
Same address means the same cache line
ATOM
RED
A T O M . E . I N C R 4 , [ R 6 ] , R 8 ;
R E D . E . A D D [ R 2 ] , R 0 ;
16. TEXTURE HARDWARE
Legacy from graphics
Read-only. Always loads through interpolation hardware
Two-level: Dedicated L1, shared L2 for texture and global loads
property Fermi sm_30 sm_35
L1 hit latency, clock No data 104 108
line size, B No data 128 128
size, KB 8 12 4sbpx12
(set)x(way) No data 4x24 4x24
L2 hit latency, clock No data 212 229
penalty, clock No data 316 351
17. READ-ONLY DATA CACHE
L1 Texture cache is opened for global load bypassing interpolation hardware. Supported
by sm_35.
/ * 0 2 8 8 * / T E X D E P B A R 0 x 0 ;
/ * 0 2 9 0 * / L D G . E . 6 4 R 8 , [ R 4 ] ;
/ * 0 2 9 8 * / T E X D E P B A R 0 x 0 ;
/ * 0 2 a 0 * / L D G . E . 6 4 R 4 , [ R 8 ] ;
/ * 0 2 a 8 * / I A D D R 6 , R 6 , 0 x 4 ;
/ * 0 2 b 0 * / T E X D E P B A R 0 x 0 ;
/ * 0 2 b 8 * / L D G . E . 6 4 R 8 , [ R 4 ] ;
/ * 0 2 c 8 * / I S E T P . L T . A N D P 0 , P T , R 6 , R 7 , P T ;
/ * 0 2 d 0 * / T E X D E P B A R 0 x 0 ;
Size is 48 KB (4 sub-partitions x 12 KB)
Different warps go through different sub-partitions
Single warp can use up to 12 KB
18. CONSTANT MEMORY
Optimized for uniform access from the warp.
Compile time constants
Kernel parameters and configurations
2–3 layers of caches. Latency: 4–800 clocks
19. LOAD UNIFORM
The LDU instruction can employ constant cache hierarchy for each global memory
location. LDU = load (block-) uniform variable from memory.
Variable resides in global memory
Prefix pointer with constkeyword
Memory access must be uniform across all threads in the block (not dependent on
threadIdx)
_ _ g l o b a l _ _ v o i d k e r n e l ( t e s t _ t * g _ d s t , c o n s t t e s t _ t * g _ s r c )
{
c o n s t i n t t i d = / * * / ;
g _ d s t [ t i d ] = g _ s r c [ 0 ] + g _ s r c [ b l o c k I d x . x ] ;
}
/ * 0 0 7 8 * / L D U . E R 0 , [ R 4 ] ;
/ * 0 0 8 0 * / L D U . E R 2 , [ R 2 ] ;
20. SHARED MEMORY
Banked: Successive 4-byte words placed to successive banks
sm_1x – 16x4 B, sm_2x – 32x4 B, sm_3x – 32x64 B
Atomic operations are done in lock/unlock manner
( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;
/ * 0 0 5 0 * / S S Y 0 x 8 0 ;
/ * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ;
/ * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ;
/ * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ;
/ * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ;
/ * 0 0 7 8 * / N O P . S ;
21. REGISTER SPILLING
Local memory refers to memory where registers are spilled
Physically resides in gmem, but likely cached
A local variable require a cache line for spilling because spilling is done per warp
Addressing is resolved by the compiler
Stores are cached in L1
Analogy with CPU stack variables
22. LDL/STL ACCESS OPERATION
Store writes line to L1
If evicted, then line is written to L2
The line could also be evicted from L2, in this case it is written to DRAM
Load requests line from L1
If a hit, operation is complete
If a miss, then request the line from L2
If L2 miss, then request the line from DRAM
23. FINAL WORDS
SM has dedicated LD/ST units to handle memory access
Global memory accesses are serviced on warp basis
Coalesced transaction is a transaction that fits cache line
GPU memory has 2 levels of caches
One L1 cache line consists of 4 L2-lines. Coalescing unit manages number of L2 lines that
is actually required
64-bit and 128-bit memory transactions are natively supported
Atomic operations on global memory is done in L2
Register spilling is fully cached for both reads and writes