SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
CODE GPU WITH CUDA
MEMORY SUBSYSTEM
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
PREVIOUS
OUTLINE
GPU memory types
Vector transaction
Coalesced memory access
Memory hierarchy
Request trajectory
Hardware supported atomics
Texture, constant, shared memory types
Register spilling
OUT OF SCOPE
Computer graphics capabilities
Organization of texture interpolation HW
GPU MEMORY TYPES
On-chip is placed on SM
Register file (RF)
Shared (SMEM)
Off-chip is placed in GPU’s RAM
Global (GMEM)
Constant (CMEM)
Texture (TEX)
Local (LMEM)
VECTOR TRANSACTION
SM has dedicated LD/ST units to handle memory access
Global memory accesses are serviced on warp basis
COALESCED MEMORY ACCESS
Fist sm_10 defines coalesced access as an affine access aligned to 128 byte line
Other obsolete sm_1x has strict coalescing rules, too.
Modern GPUs have more relaxed requirements and define coalesced transaction as
transaction that fits cache line
COALESCED MEMORY ACCESS (CONT)
Request is coalesced if warp loads only bytes it needs
The less cache lines it needs the more coalesced access it has
Address alignment by cache line size is still preferred
MEMORY HIERARCHY
GPU memory has 2 levels of caches.
CACHE CHARACTERISTICS
Cache L1 L2
generation Fermi Kepler Fermi Kepler
sizes, KB 16/48 16/32/48 up to 768 up to 1536
line width 128B 32B
latency 56 clock - 282 158
mode R, n-c - R&W, c, WB
associativity 2x64/6x64 - ? ?
usage gmem, sys sys gmem, sys, tex
MEMORY REQUEST TRAJECTORY: LD.E
Fermi: fully-cached load
LD/ST units compute physical address and number of cache lines warp requests
(L1 line is 128 B)
L1 hit -> return line else go to L2
L2 subdivides 128 B line into 4x32 B (L2 line size). If all required 32 B lines are
found in L2 return result else go to gmem
gmem
Kepler
discrete GPUs: like Fermi but bypass L1
integrated GPUs: the same as Fermi
DUALITY OF CACHE LINE
The following requests are equal from gmem point of view.
32 B granularity useful if access pattern is close to random.
LOAD CACHING CONFIGURATIONS
LD
Default (cache all): No special suffix
Cache only in L2 (cache global): LD.CG
Bypass caches (cache volatile) LD.CV
Cache streaming
L D R 8 , [ R 6 ] ;
L D . C G R 4 , [ R 1 6 ] ;
L D . C V R 1 4 , [ R 1 4 ] ;
MEMORY REQUEST TRAJECTORY: ST.E
Store instruction invalidates cache line in L1 on all SMs, if present (since L1s are on SM
and non-coherent)
Request goes directly to L2. Default write strategy is write back. Can be configured as
write through.
Hit to L2 costs ~160 clocks in case write-back is not needed.
Go to gmem in case of L2 miss (penalty > 350 clocks)
L2ismulti-ported
WIDE & NARROW TYPES
Wide
GPU supports wide memory transactions
Only 64 and 128-bit transactions are supported since they can be mapped to 2(4)
32-bit registers
Narrow
Example: uchar2 SOA store results in 2 store transactions
/ * 1 6 1 8 * / L D . E . 1 2 8 R 8 , [ R 1 4 ] ;
/ * 1 6 3 0 * / S T . E . 1 2 8 [ R 1 8 ] , R 8 ;
s t r u c t u c h a r 2 {
u n s i g n e d c h a r x ;
u n s i g n e d c h a r y ;
}
/ * 0 2 c 8 * / S T . E . U 8 [ R 6 + 0 x 1 ] , R 0 ;
/ * 0 2 d 0 * / S T . E . U 8 [ R 6 ] , R 3 ;
GMEM ATOMIC OPERATIONS
Performed in L2 per 32 B cache line.
throughput Fermi, per clock Kepler, per clock
shared address 1/9 th 1
independent 24 64
Same address means the same cache line
ATOM
RED
A T O M . E . I N C R 4 , [ R 6 ] , R 8 ;
R E D . E . A D D [ R 2 ] , R 0 ;
TEXTURE HARDWARE
Legacy from graphics
Read-only. Always loads through interpolation hardware
Two-level: Dedicated L1, shared L2 for texture and global loads
property Fermi sm_30 sm_35
L1 hit latency, clock No data 104 108
line size, B No data 128 128
size, KB 8 12 4sbpx12
(set)x(way) No data 4x24 4x24
L2 hit latency, clock No data 212 229
penalty, clock No data 316 351
READ-ONLY DATA CACHE
L1 Texture cache is opened for global load bypassing interpolation hardware. Supported
by sm_35.
/ * 0 2 8 8 * / T E X D E P B A R 0 x 0 ;
/ * 0 2 9 0 * / L D G . E . 6 4 R 8 , [ R 4 ] ;
/ * 0 2 9 8 * / T E X D E P B A R 0 x 0 ;
/ * 0 2 a 0 * / L D G . E . 6 4 R 4 , [ R 8 ] ;
/ * 0 2 a 8 * / I A D D R 6 , R 6 , 0 x 4 ;
/ * 0 2 b 0 * / T E X D E P B A R 0 x 0 ;
/ * 0 2 b 8 * / L D G . E . 6 4 R 8 , [ R 4 ] ;
/ * 0 2 c 8 * / I S E T P . L T . A N D P 0 , P T , R 6 , R 7 , P T ;
/ * 0 2 d 0 * / T E X D E P B A R 0 x 0 ;
Size is 48 KB (4 sub-partitions x 12 KB)
Different warps go through different sub-partitions
Single warp can use up to 12 KB
CONSTANT MEMORY
Optimized for uniform access from the warp.
Compile time constants
Kernel parameters and configurations
2–3 layers of caches. Latency: 4–800 clocks
LOAD UNIFORM
The LDU instruction can employ constant cache hierarchy for each global memory
location. LDU = load (block-) uniform variable from memory.
Variable resides in global memory
Prefix pointer with constkeyword
Memory access must be uniform across all threads in the block (not dependent on
threadIdx)
_ _ g l o b a l _ _ v o i d k e r n e l ( t e s t _ t * g _ d s t , c o n s t t e s t _ t * g _ s r c )
{
c o n s t i n t t i d = / * * / ;
g _ d s t [ t i d ] = g _ s r c [ 0 ] + g _ s r c [ b l o c k I d x . x ] ;
}
/ * 0 0 7 8 * / L D U . E R 0 , [ R 4 ] ;
/ * 0 0 8 0 * / L D U . E R 2 , [ R 2 ] ;
SHARED MEMORY
Banked: Successive 4-byte words placed to successive banks
sm_1x – 16x4 B, sm_2x – 32x4 B, sm_3x – 32x64 B
Atomic operations are done in lock/unlock manner
( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;
/ * 0 0 5 0 * / S S Y 0 x 8 0 ;
/ * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ;
/ * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ;
/ * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ;
/ * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ;
/ * 0 0 7 8 * / N O P . S ;
REGISTER SPILLING
Local memory refers to memory where registers are spilled
Physically resides in gmem, but likely cached
A local variable require a cache line for spilling because spilling is done per warp
Addressing is resolved by the compiler
Stores are cached in L1
Analogy with CPU stack variables
LDL/STL ACCESS OPERATION
Store writes line to L1
If evicted, then line is written to L2
The line could also be evicted from L2, in this case it is written to DRAM
Load requests line from L1
If a hit, operation is complete
If a miss, then request the line from L2
If L2 miss, then request the line from DRAM
FINAL WORDS
SM has dedicated LD/ST units to handle memory access
Global memory accesses are serviced on warp basis
Coalesced transaction is a transaction that fits cache line
GPU memory has 2 levels of caches
One L1 cache line consists of 4 L2-lines. Coalescing unit manages number of L2 lines that
is actually required
64-bit and 128-bit memory transactions are natively supported
Atomic operations on global memory is done in L2
Register spilling is fully cached for both reads and writes
THE END
NEXT
BY / 2013–2015CUDA.GEEK

Contenu connexe

Tendances

Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
Linaro
 

Tendances (20)

Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
ARM 64bit has come!
ARM 64bit has come!ARM 64bit has come!
ARM 64bit has come!
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
 
Socket Programming- Data Link Access
Socket Programming- Data Link AccessSocket Programming- Data Link Access
Socket Programming- Data Link Access
 

Similaire à Code GPU with CUDA - Memory Subsystem

Advanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil KawareAdvanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil Kaware
Prof. Swapnil V. Kaware
 
Pentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil KawarePentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil Kaware
Prof. Swapnil V. Kaware
 
Advanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil Kaware Advanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil Kaware
Prof. Swapnil V. Kaware
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
Srinivas Naidu
 
Memory modules
Memory modulesMemory modules
Memory modules
Sana Sini
 

Similaire à Code GPU with CUDA - Memory Subsystem (20)

Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Memory management
Memory managementMemory management
Memory management
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
Advanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil KawareAdvanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil Kaware
 
Pentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil KawarePentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil Kaware
 
Advanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil Kaware Advanced Microprocessors By Er. Swapnil Kaware
Advanced Microprocessors By Er. Swapnil Kaware
 
Memory elements 1
Memory elements  1Memory elements  1
Memory elements 1
 
Memory_Interface.pdf
Memory_Interface.pdfMemory_Interface.pdf
Memory_Interface.pdf
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessor
 
DDR DIMM Design
DDR DIMM DesignDDR DIMM Design
DDR DIMM Design
 
DDR SDRAM : Notes
DDR SDRAM : NotesDDR SDRAM : Notes
DDR SDRAM : Notes
 
DIGITAL DESIGNS SLIDES 7 ENGINEERING 2ND YEAR
DIGITAL DESIGNS SLIDES 7 ENGINEERING  2ND YEARDIGITAL DESIGNS SLIDES 7 ENGINEERING  2ND YEAR
DIGITAL DESIGNS SLIDES 7 ENGINEERING 2ND YEAR
 
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
 
8253,8254
8253,8254 8253,8254
8253,8254
 
Memory modules
Memory modulesMemory modules
Memory modules
 
Digital electronics
Digital electronicsDigital electronics
Digital electronics
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 

Dernier

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Dernier (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Code GPU with CUDA - Memory Subsystem

  • 1. CODE GPU WITH CUDA MEMORY SUBSYSTEM CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  • 2. OUTLINE GPU memory types Vector transaction Coalesced memory access Memory hierarchy Request trajectory Hardware supported atomics Texture, constant, shared memory types Register spilling
  • 3. OUT OF SCOPE Computer graphics capabilities Organization of texture interpolation HW
  • 4. GPU MEMORY TYPES On-chip is placed on SM Register file (RF) Shared (SMEM) Off-chip is placed in GPU’s RAM Global (GMEM) Constant (CMEM) Texture (TEX) Local (LMEM)
  • 5. VECTOR TRANSACTION SM has dedicated LD/ST units to handle memory access Global memory accesses are serviced on warp basis
  • 6. COALESCED MEMORY ACCESS Fist sm_10 defines coalesced access as an affine access aligned to 128 byte line Other obsolete sm_1x has strict coalescing rules, too. Modern GPUs have more relaxed requirements and define coalesced transaction as transaction that fits cache line
  • 7. COALESCED MEMORY ACCESS (CONT) Request is coalesced if warp loads only bytes it needs The less cache lines it needs the more coalesced access it has Address alignment by cache line size is still preferred
  • 8. MEMORY HIERARCHY GPU memory has 2 levels of caches.
  • 9. CACHE CHARACTERISTICS Cache L1 L2 generation Fermi Kepler Fermi Kepler sizes, KB 16/48 16/32/48 up to 768 up to 1536 line width 128B 32B latency 56 clock - 282 158 mode R, n-c - R&W, c, WB associativity 2x64/6x64 - ? ? usage gmem, sys sys gmem, sys, tex
  • 10. MEMORY REQUEST TRAJECTORY: LD.E Fermi: fully-cached load LD/ST units compute physical address and number of cache lines warp requests (L1 line is 128 B) L1 hit -> return line else go to L2 L2 subdivides 128 B line into 4x32 B (L2 line size). If all required 32 B lines are found in L2 return result else go to gmem gmem Kepler discrete GPUs: like Fermi but bypass L1 integrated GPUs: the same as Fermi
  • 11. DUALITY OF CACHE LINE The following requests are equal from gmem point of view. 32 B granularity useful if access pattern is close to random.
  • 12. LOAD CACHING CONFIGURATIONS LD Default (cache all): No special suffix Cache only in L2 (cache global): LD.CG Bypass caches (cache volatile) LD.CV Cache streaming L D R 8 , [ R 6 ] ; L D . C G R 4 , [ R 1 6 ] ; L D . C V R 1 4 , [ R 1 4 ] ;
  • 13. MEMORY REQUEST TRAJECTORY: ST.E Store instruction invalidates cache line in L1 on all SMs, if present (since L1s are on SM and non-coherent) Request goes directly to L2. Default write strategy is write back. Can be configured as write through. Hit to L2 costs ~160 clocks in case write-back is not needed. Go to gmem in case of L2 miss (penalty > 350 clocks) L2ismulti-ported
  • 14. WIDE & NARROW TYPES Wide GPU supports wide memory transactions Only 64 and 128-bit transactions are supported since they can be mapped to 2(4) 32-bit registers Narrow Example: uchar2 SOA store results in 2 store transactions / * 1 6 1 8 * / L D . E . 1 2 8 R 8 , [ R 1 4 ] ; / * 1 6 3 0 * / S T . E . 1 2 8 [ R 1 8 ] , R 8 ; s t r u c t u c h a r 2 { u n s i g n e d c h a r x ; u n s i g n e d c h a r y ; } / * 0 2 c 8 * / S T . E . U 8 [ R 6 + 0 x 1 ] , R 0 ; / * 0 2 d 0 * / S T . E . U 8 [ R 6 ] , R 3 ;
  • 15. GMEM ATOMIC OPERATIONS Performed in L2 per 32 B cache line. throughput Fermi, per clock Kepler, per clock shared address 1/9 th 1 independent 24 64 Same address means the same cache line ATOM RED A T O M . E . I N C R 4 , [ R 6 ] , R 8 ; R E D . E . A D D [ R 2 ] , R 0 ;
  • 16. TEXTURE HARDWARE Legacy from graphics Read-only. Always loads through interpolation hardware Two-level: Dedicated L1, shared L2 for texture and global loads property Fermi sm_30 sm_35 L1 hit latency, clock No data 104 108 line size, B No data 128 128 size, KB 8 12 4sbpx12 (set)x(way) No data 4x24 4x24 L2 hit latency, clock No data 212 229 penalty, clock No data 316 351
  • 17. READ-ONLY DATA CACHE L1 Texture cache is opened for global load bypassing interpolation hardware. Supported by sm_35. / * 0 2 8 8 * / T E X D E P B A R 0 x 0 ; / * 0 2 9 0 * / L D G . E . 6 4 R 8 , [ R 4 ] ; / * 0 2 9 8 * / T E X D E P B A R 0 x 0 ; / * 0 2 a 0 * / L D G . E . 6 4 R 4 , [ R 8 ] ; / * 0 2 a 8 * / I A D D R 6 , R 6 , 0 x 4 ; / * 0 2 b 0 * / T E X D E P B A R 0 x 0 ; / * 0 2 b 8 * / L D G . E . 6 4 R 8 , [ R 4 ] ; / * 0 2 c 8 * / I S E T P . L T . A N D P 0 , P T , R 6 , R 7 , P T ; / * 0 2 d 0 * / T E X D E P B A R 0 x 0 ; Size is 48 KB (4 sub-partitions x 12 KB) Different warps go through different sub-partitions Single warp can use up to 12 KB
  • 18. CONSTANT MEMORY Optimized for uniform access from the warp. Compile time constants Kernel parameters and configurations 2–3 layers of caches. Latency: 4–800 clocks
  • 19. LOAD UNIFORM The LDU instruction can employ constant cache hierarchy for each global memory location. LDU = load (block-) uniform variable from memory. Variable resides in global memory Prefix pointer with constkeyword Memory access must be uniform across all threads in the block (not dependent on threadIdx) _ _ g l o b a l _ _ v o i d k e r n e l ( t e s t _ t * g _ d s t , c o n s t t e s t _ t * g _ s r c ) { c o n s t i n t t i d = / * * / ; g _ d s t [ t i d ] = g _ s r c [ 0 ] + g _ s r c [ b l o c k I d x . x ] ; } / * 0 0 7 8 * / L D U . E R 0 , [ R 4 ] ; / * 0 0 8 0 * / L D U . E R 2 , [ R 2 ] ;
  • 20. SHARED MEMORY Banked: Successive 4-byte words placed to successive banks sm_1x – 16x4 B, sm_2x – 32x4 B, sm_3x – 32x64 B Atomic operations are done in lock/unlock manner ( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ; / * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ;
  • 21. REGISTER SPILLING Local memory refers to memory where registers are spilled Physically resides in gmem, but likely cached A local variable require a cache line for spilling because spilling is done per warp Addressing is resolved by the compiler Stores are cached in L1 Analogy with CPU stack variables
  • 22. LDL/STL ACCESS OPERATION Store writes line to L1 If evicted, then line is written to L2 The line could also be evicted from L2, in this case it is written to DRAM Load requests line from L1 If a hit, operation is complete If a miss, then request the line from L2 If L2 miss, then request the line from DRAM
  • 23. FINAL WORDS SM has dedicated LD/ST units to handle memory access Global memory accesses are serviced on warp basis Coalesced transaction is a transaction that fits cache line GPU memory has 2 levels of caches One L1 cache line consists of 4 L2-lines. Coalescing unit manages number of L2 lines that is actually required 64-bit and 128-bit memory transactions are natively supported Atomic operations on global memory is done in L2 Register spilling is fully cached for both reads and writes
  • 24. THE END NEXT BY / 2013–2015CUDA.GEEK