Memory is organized in a hierarchy from fast but small CPU caches to larger but slower dynamic RAM (DRAM) and static RAM (SRAM), with the Linux kernel using zone allocation, buddy allocation, and slab allocation to manage physical memory and handle fragmentation issues across these storage and processing layers. Applications rely on standard library allocators like malloc while the kernel uses slab allocators like SLAB, SLUB, and SLOB to manage memory for internal data structures.
5. HW : Data Storage - Hierarchy
DRAM Chip
x4 = 16 Banks
x8 = 8 Banks
x16 = 4 Banks
Bank
Array
Row Decoder
Row Buffer
Column Decoder
Array Cell
1 bit
6. HW : Data Storage - {S,D,V}RAM
RAM Random Access Memory
SRAM Static RAM
DRAM Dynamic RAM
VRAM Video RAM
SRAM DRAM
SRAM DRAM
Speed CMOS 2T Cond.
Power Consumption Constant Low + burst
Production Cost Expensive Cheap
Production complexity 5 trans. 1 trans.
Read Operation Stable Destructive
7. HW : Data Storage - DRAM Refresh
DRAM must be refreshed, even if not accessed.
Done every 4 - 64ms
Refresh can be done by :
- Burst refresh : Stop all operation, refresh all memory
- Distributed refresh : refresh one (or more) row at a time
- Self-Refresh (low-power mode) : Turn off memory controller, and refresh
capacitor itself
8. HW : Data Storage - Bank
- Row Buffer holds read data
- Read gets entire row into the
buffer
- Once read, capacitor is
empty, but value in buffer
- Write bits back before doing
another read
Process is called “opening” or
“closing” a row
0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0
0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9. HW : Data Storage - Rank
Rank : Set of DRAM Modules on a DIMM connected to the same
Chip-Select pin (accessed simultaneously)
A Rank has 64 bit wide data bus (72 on DIMM with ECC) and noted
<rank>Rx<dram-width>:
- 1Rx16 : Single Rank, 16bits width (4 DRAM to have 64bits)
- 2Rx8 : Dual Rank, 8bits width (8 DRAM to have 64bits)
- 4Rx8 : Quad Rank, 8bits width (8 DRAM to have 64bits)
10. DIMM Dual Inline Memory Module ( != SIMM)
SO-DIMM Small Outline DIMM (for laptop & embedded)
UDIMM Unregistered DIMM (standard end-user)
RDIMM Registered DIMM or Buffered DIMM (most servers).
FB-DIMM Fully Buffered DIMM : Buffer for Addr & Data bus.
LR-DIMM Load-Reduced DIMM : Like FB, but only serialize data, not address
ECC Error Correcting Code : Parity value checking (like RAID5)
HW : Data Storage - {U,R,LR,FB}DIMM
11. HW : Data Storage - {U,R,LR,FB}DIMM
Latency Max Size
(1 DIMM)
@ Bus Data Bus Implementation Details
UDIMM Low 8 GB Parallel Parallel Input command and output data bus are
directly connected to the bus.
RDIMM UDIMM + 1
cycle
32 GB Parallel Parallel Same as UDIMM, but input commands are
stabilized through a register (cost of 1 cycle)
FB-DIMM 8 GB Serial Serial Add a big buffer for both command and data.
But serial implementation generates
hyperfrequency and signal stability issues.
LR-DIMM Similiar to
RDIMM
128 GB Parallel Parallel Fix issues of FBDIMM. Based on RDIMM and
also buffers data-lines.
12. HW : Data Storage - RAM Timings
Row and Column addresses are sent
on the same (address) bus.
Multiplexer on the memory DIMM
Notation : w-x-y-z-T
- w : CAS Latency (CL)
- x : RAS to CAS delay (TRCD)
- y : RAS precharge (TRP)
- z : Active to Precharge delay (TRAS)
- T : Command Rate
Timings are in cycles
CL : Column select → Data avail. on bus
TRCD : Row select → Column select
TRP : New line activation (opening)
TRAS : Line deactivation (closing)
T : Between 2 commands
13. TRRD RAS to RAS. Time to activate the next bank of memory.
TWTR Write to Read. Between write command and the next read command.
TWR Write Recovery. Time after a valid write operation and precharge.
TRFC Row Refresh Cycle. to refresh a row on a memory bank.
TRTP Read To Precharge. between a read command to a row pre-charge
command to the same rank.
TRTW Read to Write Delay. When a write command is received, cycles for the
command to be executed.
TRC Row Cycle. The minimum time in cycles it takes a row to complete a full
cycle. This can be determined by; tRC = tRAS + tRP.
HW : Data Storage - RAM Timings (advanced)
14. TREF before a charge is refreshed so it does not lose its charge and corrupt.
TWCL Write CAS number. Write to whatever bank is open to be written too.
CPC Command Per Clock. chip select is executed then commands are
issued.TRD Static tREAD.
HW : Data Storage - RAM Timings (advanced)
15. HW : Data Storage - DIMM Assembly
x8 ⇒ Each DRAM outputs 8 bits
16. HW : Data Storage - 3D X-Point
Public name : Optane
- Low Latency (< 10us)
- High Density
- No memory controller
- Voltage Variation
18. HW : Data Access - {S,D,Q}DR
The quantity of usable data handled by the memory for every Clock Cycle.
Introduced the concept of “Tps” : Transfers per second
Original DRAM : specify RAS + CAS for every operation.
FPM (1990) : multiple reads from the same row without RAS.
EDO (1995): allows to select next column while reading old one
SDR (1997) : single selection then burst following
DDR (2000) : transfers data on both rising and falling edge of the clock
DDR2 (2003) : 2 internal channels
DDR3 (2007) : Doubled transfer speed
DDR4 (2013) : Increased frequency
DDR5 (2020) : JEDEC released specs…
19. HW : Data Access - UMA / NUMA
Uniform Memory Access
Central Northbridge
Non Uniform Memory Access
One MMU for each Socket
20. HW : Data Access - Memory Channels
DIMMs used in parallel to increase the
bandwidth : single / dual / triple channel
Channels must be balanced
21. HW : Data Access - Direct Memory Access
Bypass CPU processing
- PCI-E
- Thunderbolt
- Firewire
- Cardbus
- Expresscard
DMA controller
advertise caches about
RAM changes
Direct Cache Access
24. HW : Data Processing - Cache
DRAM is slow related to CPU cycles
⇒ Let’s use cache
Can be used for read (prefetch) Write
(write-back)
Eviction done by tracking algorithm
- LRU Least Recently Used
- LFU Least Frequently Used
- FIFO First In First Out
- ARC Adaptive Replacement Cache
Hit-Ratio gives usefulness of cache
25. HW : Data Processing - Cache
Distribution policy :
- Fully Associative : All blocks checked simultaneously (heavy hardware)
- Direct Mapped : fast but need balanced spread (rare)
- Set-Associative : mix of 2 previous
Address can be :
- Virtual : Fast access but not unique. Used by L1 and TLB
- Physical : Calculation needed but unique. Used for other caches
Programmers : avoid mapping same @Phys on multiple @Virt
26. HW : Data Processing - Cache Coherency
Multiple CPUs ⇒ Multiple caches ⇒
SYNC
MOESI (Modified Owned Exclusive
Shared Invalid) on NUMA systems
Processors use cache snooping
Request For Ownership ⇒ Very costly
When a CPU changes a data already in
cache of another CPU
27. HW : Data Processing - Writing
Write policy on memory zones done by MTRR (Memory Type Range Register)
Write Through : All data written in cache is also written in memory
Write Back : Delay memory access as long as possible
Write Combining : Force writes to be grouped in bulk
Uncacheable : For some I/O and HW, like BIOS, ACPI, IOAPIC...
28. HW : Data Processing - Memory Management Unit
Switch Between @Virtual and @Physical ⇒ Translation done by CPU (MMU)
Not directly mapped: To @ 16EB of 64bits, direct array would be huge !
We only use 48bits (up to 256TB) and Page Tables to avoid management waste.
4 cascading tables: Page {Global,Upper,Middle} Directory and Page Table
29. HW : Data Processing - Memory Management Unit
Page Walking :
1) @Base for L4
2) Add offset from bits 39-47
⇒ Got @Base for L3
3) Add offset from bits 30-38
⇒ Got @Base for L2
4) Add offset from bits 21-29
⇒ Got @Base for L1
5) Add offset from bits 12-20
⇒ Got @Base for Page
6) Add offset from bits 0-11
⇒ @Physical
30. HW : Data Processing - Memory Management Unit
An empty Page Directory stores 512 (29
) entries ⇒ 64bits * 512 = 4KB
For 32KB (4 * 512) ⇒ @ 2MB
For 2MB (3*512 + 512*512) ⇒ @ 1GB
For 128MB (2*512 + 5123
) ⇒ @ 550GB
⇒ Low overhead for storage... but requires 4 reads.
⇒ let’s cache these translation : Translation Lookaside Buffer (TLB)
Limit TLB flush upon context switching by adding the page-table ID to the TLB entry
32. Name Size (x86) Size (x86_64) Description
DMA < 16MB < 16MB For very old devices (@24 bits)
DMA32 N/A 16 - 4096MB For devices addressing up to 32bits (4GB)
NORMAL 16 - 896MB > 4096MB Memory directly mapped by Kernel
HIGHMEM > 896MB N/A
SW : Linux Internals - Zones
33. SW : Linux Internals - Zones
32bits : 3/1 split (or 2/2 or 1/3) between Userspace & Kernel
On these 1GB of kernel space, 128MB used to map higher pages. 1024 - 128 = 896
Low memory : directly addressable by Kernel
High memory : must use the 128MB indirection table to be addressed
64bit : all space directly addressable by MMU
34. SW : Linux Internals - Zones
Jul 12 22:13:12 [server] kernel: swapper: page allocation failure. order:2, mode:0x4020
Jul 12 22:46:46 [server] kernel: [app_name]: page allocation failure. order:4, mode:0xd0
include/linux/gfp.h : Zone usage per flag
* bit result
* =================
* 0x0 => NORMAL
* 0x1 => DMA or NORMAL
* 0x2 => HIGHMEM or NORMAL
* 0x3 => BAD (DMA+HIGHMEM)
* 0x4 => DMA32 or DMA or NORMAL
* 0x5 => BAD (DMA+DMA32)
* 0x6 => BAD (HIGHMEM+DMA32)
* 0x7 => BAD (HIGHMEM+DMA32+DMA)
* 0x8 => NORMAL (MOVABLE+0)
* 0x9 => DMA or NORMAL (MOVABLE+DMA)
* 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
* 0xb => BAD (MOVABLE+HIGHMEM+DMA)
* 0xc => DMA32 (MOVABLE+DMA32)
* 0xd => BAD (MOVABLE+DMA32+DMA)
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
35. SW : Linux Internals - Buddy Allocator
Goal : limit External Fragmentation and Internal Fragmentation
36. SW : Linux Internals - Buddy Allocator
4K pages are grouped by 29
(2MB) or 210
(4MB) blocks
Blocks are then cut in half (2 buddies) to service the request
Upon release, tries to merge buddy pages back together
37. SW : Linux Internals - Buddy Allocator
Unmovable : Locked in memory
Reclaimable : Reusable after clean
Movable : Immediately available
Reserve : Last resort reserve
Isolate : keep on local NUMA node
CMA : Contiguous Memory
Allocator, for DMA devices with
large contiguous
cat /proc/pagetypeinfo
Page block order: 10
Pages per block: 1024
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8
9 10
Node 0, zone DMA, type Unmovable 1 1 0 0 2 1 1 0 1
1 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0
0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0
0 2
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0
0 1
Node 0, zone DMA32, type Unmovable 8 6 0 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Reclaimable 376 2817 5 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Movable 6323 12025 287 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Reserve 0 0 1 4 6 2 0 0 1
1 0
Node 0, zone Normal, type Unmovable 2611 137 0 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Reclaimable 33847 4321 144 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Movable 37312 9849 1097 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Reserve 0 0 5 1 0 1 1 2 1
0 1
Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 1 0 2 1
Node 0, zone DMA32 13 18 796 1
When no space is available, buddy-allocator will call kswapd
38. SW : Linux Internals - SLAB
SLAB = Allocator for Kernel Objects
Uses cache to avoid fragmentation
Each kernel object is stored in a SLAB
SLAB 1 Queue / NUMA Node
SLUB 1 Queue / CPU
SLOB As compact as possible
Most servers use SLUB :
- Defrag
- Debug
45. SW : Application Memory Allocator
malloc NOT A SYSCALL (mmap + brk)
Speed & minimal fragmentation
Multiple implementations :
- dlmalloc Doug-Lea (Generic)
- ptmalloc2 Current glibc
- jemalloc Jason Evans (FreeBSD, Firefox, FB)
- tcmalloc Thread-Caching (Google)
- libumem Solaris
- €€€€€ lockless, hoard, smartheap...
46. SW : Application Memory Allocator - ptmalloc2
Uses brk or mmap to do allocation
- brk / sbrk for main thread and if req < 128KB
- mmap otherwise.
Maintains arenas : main & thread
Each arena is composed of shards
Upon free() ptmalloc adds freed region to a “bin” to
be reused for later allocations:
- Fast 16 - 80 bytes
- Unsorted No size limit. Latest freed.
- Small < 512 bytes
- Large >= 512 bytes
47. SW : Application Memory Allocator - ptmalloc2
Internal Structures :
- malloc_state : Arena Header. Has multiple heaps,
except for Main Arena (which just grows its heap)
- heap_info : Heap Header. Has multiple chunks
- malloc_chunk : Chunk Header. Result of malloc()