Memory management

Memory Management
From Silicon to Algorithm
Sysadmin #7
Adrien Mahieux - Sysadmin & microsecond hunter
gh: github.com/Saruspete
tw: @Saruspete
em: adrien.mahieux@gmail.com

1) HW : Data Storage
2) HW : Data access
3) HW : Data processing
4) SW : Linux Internals
5) SW : Application allocator
6) SEC : Attacks
Agenda
{S,D,V}RAM, Bank, Rank, {u,r,lr,fb}dimm
CAS Timing, ECC
{S,D,Q}DR, NUMA, Channels, CPU (Cache,
Associativity
CPU Pipeline, Branch prediction, MMU
Zones, Buddy allocator, fragmentation, sl[auo]b
Stack / heap, regions, memory allocator..
Malloc implementation, ptmalloc2 details

Where are we ?
How long ?
How often ?
What ratio ?
What path ?

HW : Data Storage - Hierarchy
DRAM Chip
x4 = 16 Banks
x8 = 8 Banks
x16 = 4 Banks
Bank
Array
Row Decoder
Row Buffer
Column Decoder
Array Cell
1 bit

HW : Data Storage - {S,D,V}RAM
RAM Random Access Memory
SRAM Static RAM
DRAM Dynamic RAM
VRAM Video RAM
SRAM DRAM
SRAM DRAM
Speed CMOS 2T Cond.
Power Consumption Constant Low + burst
Production Cost Expensive Cheap
Production complexity 5 trans. 1 trans.
Read Operation Stable Destructive

HW : Data Storage - DRAM Refresh
DRAM must be refreshed, even if not accessed.
Done every 4 - 64ms
Refresh can be done by :
- Burst refresh : Stop all operation, refresh all memory
- Distributed refresh : refresh one (or more) row at a time
- Self-Refresh (low-power mode) : Turn off memory controller, and refresh
capacitor itself

HW : Data Storage - Bank
- Row Buffer holds read data
- Read gets entire row into the
buffer
- Once read, capacitor is
empty, but value in buffer
- Write bits back before doing
another read
Process is called “opening” or
“closing” a row
0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0
0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

HW : Data Storage - Rank
Rank : Set of DRAM Modules on a DIMM connected to the same
Chip-Select pin (accessed simultaneously)
A Rank has 64 bit wide data bus (72 on DIMM with ECC) and noted
<rank>Rx<dram-width>:
- 1Rx16 : Single Rank, 16bits width (4 DRAM to have 64bits)
- 2Rx8 : Dual Rank, 8bits width (8 DRAM to have 64bits)
- 4Rx8 : Quad Rank, 8bits width (8 DRAM to have 64bits)

DIMM Dual Inline Memory Module ( != SIMM)
SO-DIMM Small Outline DIMM (for laptop & embedded)
UDIMM Unregistered DIMM (standard end-user)
RDIMM Registered DIMM or Buffered DIMM (most servers).
FB-DIMM Fully Buffered DIMM : Buffer for Addr & Data bus.
LR-DIMM Load-Reduced DIMM : Like FB, but only serialize data, not address
ECC Error Correcting Code : Parity value checking (like RAID5)
HW : Data Storage - {U,R,LR,FB}DIMM

HW : Data Storage - {U,R,LR,FB}DIMM
Latency Max Size
(1 DIMM)
@ Bus Data Bus Implementation Details
UDIMM Low 8 GB Parallel Parallel Input command and output data bus are
directly connected to the bus.
RDIMM UDIMM + 1
cycle
32 GB Parallel Parallel Same as UDIMM, but input commands are
stabilized through a register (cost of 1 cycle)
FB-DIMM 8 GB Serial Serial Add a big buffer for both command and data.
But serial implementation generates
hyperfrequency and signal stability issues.
LR-DIMM Similiar to
RDIMM
128 GB Parallel Parallel Fix issues of FBDIMM. Based on RDIMM and
also buffers data-lines.

HW : Data Storage - RAM Timings
Row and Column addresses are sent
on the same (address) bus.
Multiplexer on the memory DIMM
Notation : w-x-y-z-T
- w : CAS Latency (CL)
- x : RAS to CAS delay (TRCD)
- y : RAS precharge (TRP)
- z : Active to Precharge delay (TRAS)
- T : Command Rate
Timings are in cycles
CL : Column select → Data avail. on bus
TRCD : Row select → Column select
TRP : New line activation (opening)
TRAS : Line deactivation (closing)
T : Between 2 commands

TRRD RAS to RAS. Time to activate the next bank of memory.
TWTR Write to Read. Between write command and the next read command.
TWR Write Recovery. Time after a valid write operation and precharge.
TRFC Row Refresh Cycle. to refresh a row on a memory bank.
TRTP Read To Precharge. between a read command to a row pre-charge
command to the same rank.
TRTW Read to Write Delay. When a write command is received, cycles for the
command to be executed.
TRC Row Cycle. The minimum time in cycles it takes a row to complete a full
cycle. This can be determined by; tRC = tRAS + tRP.
HW : Data Storage - RAM Timings (advanced)

TREF before a charge is refreshed so it does not lose its charge and corrupt.
TWCL Write CAS number. Write to whatever bank is open to be written too.
CPC Command Per Clock. chip select is executed then commands are
issued.TRD Static tREAD.
HW : Data Storage - RAM Timings (advanced)

HW : Data Storage - DIMM Assembly
x8 ⇒ Each DRAM outputs 8 bits

HW : Data Storage - 3D X-Point
Public name : Optane
- Low Latency (< 10us)
- High Density
- No memory controller
- Voltage Variation

HW : Data Access - {S,D,Q}DR
The quantity of usable data handled by the memory for every Clock Cycle.
Introduced the concept of “Tps” : Transfers per second
Original DRAM : specify RAS + CAS for every operation.
FPM (1990) : multiple reads from the same row without RAS.
EDO (1995): allows to select next column while reading old one
SDR (1997) : single selection then burst following
DDR (2000) : transfers data on both rising and falling edge of the clock
DDR2 (2003) : 2 internal channels
DDR3 (2007) : Doubled transfer speed
DDR4 (2013) : Increased frequency
DDR5 (2020) : JEDEC released specs…

HW : Data Access - UMA / NUMA
Uniform Memory Access
Central Northbridge
Non Uniform Memory Access
One MMU for each Socket

HW : Data Access - Memory Channels
DIMMs used in parallel to increase the
bandwidth : single / dual / triple channel
Channels must be balanced

HW : Data Access - Direct Memory Access
Bypass CPU processing
- PCI-E
- Thunderbolt
- Firewire
- Cardbus
- Expresscard
DMA controller
advertise caches about
RAM changes
Direct Cache Access

HW : Data Processing - CPU Pipeline

HW : Data Processing - Cache
DRAM is slow related to CPU cycles
⇒ Let’s use cache
Can be used for read (prefetch) Write
(write-back)
Eviction done by tracking algorithm
- LRU Least Recently Used
- LFU Least Frequently Used
- FIFO First In First Out
- ARC Adaptive Replacement Cache
Hit-Ratio gives usefulness of cache

HW : Data Processing - Cache
Distribution policy :
- Fully Associative : All blocks checked simultaneously (heavy hardware)
- Direct Mapped : fast but need balanced spread (rare)
- Set-Associative : mix of 2 previous
Address can be :
- Virtual : Fast access but not unique. Used by L1 and TLB
- Physical : Calculation needed but unique. Used for other caches
Programmers : avoid mapping same @Phys on multiple @Virt

HW : Data Processing - Cache Coherency
Multiple CPUs ⇒ Multiple caches ⇒
SYNC
MOESI (Modified Owned Exclusive
Shared Invalid) on NUMA systems
Processors use cache snooping
Request For Ownership ⇒ Very costly
When a CPU changes a data already in
cache of another CPU

HW : Data Processing - Writing
Write policy on memory zones done by MTRR (Memory Type Range Register)
Write Through : All data written in cache is also written in memory
Write Back : Delay memory access as long as possible
Write Combining : Force writes to be grouped in bulk
Uncacheable : For some I/O and HW, like BIOS, ACPI, IOAPIC...

HW : Data Processing - Memory Management Unit
Switch Between @Virtual and @Physical ⇒ Translation done by CPU (MMU)
Not directly mapped: To @ 16EB of 64bits, direct array would be huge !
We only use 48bits (up to 256TB) and Page Tables to avoid management waste.
4 cascading tables: Page {Global,Upper,Middle} Directory and Page Table

Page Walking :
1) @Base for L4
2) Add offset from bits 39-47
⇒ Got @Base for L3
⇒ Got @Base for Page
⇒ @Physical

An empty Page Directory stores 512 (29
) entries ⇒ 64bits * 512 = 4KB
For 32KB (4 * 512) ⇒ @ 2MB
For 2MB (3*512 + 512*512) ⇒ @ 1GB
For 128MB (2*512 + 5123
) ⇒ @ 550GB
⇒ Low overhead for storage... but requires 4 reads.
⇒ let’s cache these translation : Translation Lookaside Buffer (TLB)
Limit TLB flush upon context switching by adding the page-table ID to the TLB entry

Name Size (x86) Size (x86_64) Description
DMA < 16MB < 16MB For very old devices (@24 bits)
DMA32 N/A 16 - 4096MB For devices addressing up to 32bits (4GB)
NORMAL 16 - 896MB > 4096MB Memory directly mapped by Kernel
HIGHMEM > 896MB N/A
SW : Linux Internals - Zones

32bits : 3/1 split (or 2/2 or 1/3) between Userspace & Kernel
On these 1GB of kernel space, 128MB used to map higher pages. 1024 - 128 = 896
Low memory : directly addressable by Kernel
High memory : must use the 128MB indirection table to be addressed
64bit : all space directly addressable by MMU

Jul 12 22:13:12 [server] kernel: swapper: page allocation failure. order:2, mode:0x4020
Jul 12 22:46:46 [server] kernel: [app_name]: page allocation failure. order:4, mode:0xd0
include/linux/gfp.h : Zone usage per flag
* bit result
* =================
* 0x0 => NORMAL
* 0x1 => DMA or NORMAL
* 0x2 => HIGHMEM or NORMAL
* 0x3 => BAD (DMA+HIGHMEM)
* 0x4 => DMA32 or DMA or NORMAL
* 0x5 => BAD (DMA+DMA32)
* 0x6 => BAD (HIGHMEM+DMA32)
* 0x7 => BAD (HIGHMEM+DMA32+DMA)
* 0x8 => NORMAL (MOVABLE+0)
* 0x9 => DMA or NORMAL (MOVABLE+DMA)
* 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
* 0xb => BAD (MOVABLE+HIGHMEM+DMA)
* 0xc => DMA32 (MOVABLE+DMA32)
* 0xd => BAD (MOVABLE+DMA32+DMA)
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)

SW : Linux Internals - Buddy Allocator
Goal : limit External Fragmentation and Internal Fragmentation

4K pages are grouped by 29
(2MB) or 210
(4MB) blocks
Blocks are then cut in half (2 buddies) to service the request
Upon release, tries to merge buddy pages back together

Unmovable : Locked in memory
Reclaimable : Reusable after clean
Movable : Immediately available
Reserve : Last resort reserve
Isolate : keep on local NUMA node
CMA : Contiguous Memory
Allocator, for DMA devices with
large contiguous
cat /proc/pagetypeinfo
Page block order: 10
Pages per block: 1024
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8
9 10
Node 0, zone DMA, type Unmovable 1 1 0 0 2 1 1 0 1
1 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0
0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0
0 2
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0
0 1
Node 0, zone DMA32, type Unmovable 8 6 0 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Reclaimable 376 2817 5 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Movable 6323 12025 287 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Reserve 0 0 1 4 6 2 0 0 1
1 0
Node 0, zone Normal, type Unmovable 2611 137 0 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Reclaimable 33847 4321 144 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Movable 37312 9849 1097 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Reserve 0 0 5 1 0 1 1 2 1
0 1
Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 1 0 2 1
Node 0, zone DMA32 13 18 796 1
When no space is available, buddy-allocator will call kswapd

SW : Linux Internals - SLAB
SLAB = Allocator for Kernel Objects
Uses cache to avoid fragmentation
Each kernel object is stored in a SLAB
SLAB 1 Queue / NUMA Node
SLUB 1 Queue / CPU
SLOB As compact as possible
Most servers use SLUB :
- Defrag
- Debug

SW : Linux Internals - SLAB
# grep -v '# name' /proc/slabinfo | tr -d '#' | column -t
slabinfo - version: 2.1
[ . . . ]
ext4_inode_cache 294150 294210 1072 30 8 : tunables 0 0 0 : slabdata 9807 9807 0
ext4_allocation_context 224 224 128 32 1 : tunables 0 0 0 : slabdata 7 7 0
ext4_io_end 1472 1472 64 64 1 : tunables 0 0 0 : slabdata 23 23 0
ext4_extent_status 143365 221442 40 102 1 : tunables 0 0 0 : slabdata 2171 2171 0
[ . . . ]
inode_cache 13493 15596 584 28 4 : tunables 0 0 0 : slabdata 557 557 0
dentry 449376 450471 192 21 1 : tunables 0 0 0 : slabdata 21451 21451 0
buffer_head 650327 840294 104 39 1 : tunables 0 0 0 : slabdata 21546 21546 0
task_struct 1299 1336 7936 4 8 : tunables 0 0 0 : slabdata 334 334 0
cred_jar 88986 89271 192 21 1 : tunables 0 0 0 : slabdata 4251 4251 0
[ . . . ]
task_struct (large object)
Object size 7936 B
4 Objects / slab 31744 B
8 Pages / slab 32768 B
Loss : 32768 - 31744 = 1024 B / slab
334 slabs 334 KB lost / 10688 KB (3.125%)
ext4_extent_status (small and compact)
Object size 40 B
102 obj / slab 4080 B
1 Page / slab 4096 B
Loss : 4096 - 4080 = 16B /slab (0.39%)
2171 slabs 334 KB lost / 8892 KB ()

cat /proc/meminfo
MemTotal: 16328616 kB
MemFree: 4021720 kB
MemAvailable: 6653544 kB
Buffers: 380220 kB
Cached: 4688968 kB
SwapCached: 0 kB
Active: 7703764 kB
Inactive: 3890964 kB
Active(anon): 6546524 kB
Inactive(anon): 2354396 kB
Active(file): 1157240 kB
Inactive(file): 1536568 kB
Unevictable: 20988 kB
Mlocked: 20988 kB
SwapTotal: 8191996 kB
SwapFree: 8191996 kB
Dirty: 92 kB
Writeback: 0 kB
AnonPages: 6546580 kB
Mapped: 1763640 kB
Shmem: 2368152 kB
Slab: 383576 kB
SReclaimable: 280636 kB
SUnreclaim: 102940 kB
KernelStack: 19632 kB
PageTables: 115868 kB
[ ... ]
SW : Linux Internals - /proc

SW : Linux Internals - /proc
pmap -X $PID # read /proc/$PID/smaps
21797: /bin/bash
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked Mapping
5576c6639000 r-xp 00000000 fd:04 268557 992 912 80 912 0 0 0 0 0 0 0 bash
5576c6930000 r--p 000f7000 fd:04 268557 16 16 16 16 16 0 0 0 0 0 0 bash
5576c6934000 rw-p 000fb000 fd:04 268557 36 36 36 36 36 0 0 0 0 0 0 bash
5576c693d000 rw-p 00000000 00:00 0 20 20 20 20 20 0 0 0 0 0 0
5576c806d000 rw-p 00000000 00:00 0 1508 1472 1472 1472 1472 0 0 0 0 0 0 [heap]
7f740b9ea000 r-xp 00000000 fd:04 267260 44 44 0 44 0 0 0 0 0 0 0 libnss_files-2.22.so
7f740b9f5000 ---p 0000b000 fd:04 267260 2044 0 0 0 0 0 0 0 0 0 0 libnss_files-2.22.so
7f740bbf4000 r--p 0000a000 fd:04 267260 4 4 4 4 4 0 0 0 0 0 0 libnss_files-2.22.so
7f740bbf5000 rw-p 0000b000 fd:04 267260 4 4 4 4 4 0 0 0 0 0 0 libnss_files-2.22.so
7f740bbf6000 rw-p 00000000 00:00 0 24 0 0 0 0 0 0 0 0 0 0
7f740bbfc000 r--p 00000000 fd:04 267226 109328 504 19 504 0 0 0 0 0 0 0 locale-archive
7f74126c0000 r-xp 00000000 fd:04 267234 1756 1536 10 1536 0 0 0 0 0 0 0 libc-2.22.so
7f7412877000 ---p 001b7000 fd:04 267234 2048 0 0 0 0 0 0 0 0 0 0 libc-2.22.so
7f7412a77000 r--p 001b7000 fd:04 267234 16 16 16 16 16 0 0 0 0 0 0 libc-2.22.so
7f7412a7b000 rw-p 001bb000 fd:04 267234 8 8 8 8 8 0 0 0 0 0 0 libc-2.22.so
7f7412a7d000 rw-p 00000000 00:00 0 16 12 12 12 12 0 0 0 0 0 0
7f7412a81000 r-xp 00000000 fd:04 267240 12 12 0 12 0 0 0 0 0 0 0 libdl-2.22.so
7f7412a84000 ---p 00003000 fd:04 267240 2044 0 0 0 0 0 0 0 0 0 0 libdl-2.22.so
7f7412c83000 r--p 00002000 fd:04 267240 4 4 4 4 4 0 0 0 0 0 0 libdl-2.22.so
7f7412c84000 rw-p 00003000 fd:04 267240 4 4 4 4 4 0 0 0 0 0 0 libdl-2.22.so
7f7412c85000 r-xp 00000000 fd:04 270666 152 152 12 152 0 0 0 0 0 0 0 libtinfo.so.5.9
7f7412cab000 ---p 00026000 fd:04 270666 2044 0 0 0 0 0 0 0 0 0 0 libtinfo.so.5.9
7f7412eaa000 r--p 00025000 fd:04 270666 16 16 16 16 16 0 0 0 0 0 0 libtinfo.so.5.9
7f7412eae000 rw-p 00029000 fd:04 270666 4 4 4 4 4 0 0 0 0 0 0 libtinfo.so.5.9
7f7412eaf000 r-xp 00000000 fd:04 267194 132 132 0 132 0 0 0 0 0 0 0 ld-2.22.so
7f74130a3000 rw-p 00000000 00:00 0 20 20 20 20 20 0 0 0 0 0 0
7f74130a8000 r--p 00000000 fd:04 1442509 124 64 9 64 0 0 0 0 0 0 0 bash.mo
7f74130c7000 r--s 00000000 fd:04 527264 28 28 0 28 0 0 0 0 0 0 0 gconv-modules.cache
7f74130ce000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0 0 0 0 0
7f74130cf000 r--p 00020000 fd:04 267194 4 4 4 4 4 0 0 0 0 0 0 ld-2.22.so
7f74130d0000 rw-p 00021000 fd:04 267194 4 4 4 4 4 0 0 0 0 0 0 ld-2.22.so
7f74130d1000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0 0 0 0 0
7ffc4fbbe000 rw-p 00000000 00:00 0 136 32 32 32 32 0 0 0 0 0 0 [stack]
7ffc4fbe8000 r--p 00000000 00:00 0 8 0 0 0 0 0 0 0 0 0 0 [vvar]
7ffc4fbea000 r-xp 00000000 00:00 0 8 4 0 4 0 0 0 0 0 0 0 [vdso]
ffffffffff600000 r-xp 00000000 00:00 0 4 0 0 0 0 0 0 0 0 0 0 [vsyscall]
====== ==== ==== ========== ========= ============== ============== =============== ==== ======= ======
122620 5072 1814 5072 1684 0 0 0 0 0 0 KB

Software : Application Memory Allocator

SW : Application Memory Allocator

SW : Application Memory Allocator
malloc NOT A SYSCALL (mmap + brk)
Speed & minimal fragmentation
Multiple implementations :
- dlmalloc Doug-Lea (Generic)
- ptmalloc2 Current glibc
- jemalloc Jason Evans (FreeBSD, Firefox, FB)
- tcmalloc Thread-Caching (Google)
- libumem Solaris
- €€€€€ lockless, hoard, smartheap...

SW : Application Memory Allocator - ptmalloc2
Uses brk or mmap to do allocation
- brk / sbrk for main thread and if req < 128KB
- mmap otherwise.
Maintains arenas : main & thread
Each arena is composed of shards
Upon free() ptmalloc adds freed region to a “bin” to
be reused for later allocations:
- Fast 16 - 80 bytes
- Unsorted No size limit. Latest freed.
- Small < 512 bytes
- Large >= 512 bytes

SW : Application Memory Allocator - ptmalloc2
Internal Structures :
- malloc_state : Arena Header. Has multiple heaps,
except for Main Arena (which just grows its heap)
- heap_info : Heap Header. Has multiple chunks
- malloc_chunk : Chunk Header. Result of malloc()

Security : Some Attacks
Rowhammer Bitflip on DRAM Rows
Evil Maid DMA using physical access
Stack Clash Huge stack usage to overlap on heap

Bibliography - Hardware
https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp15/cse502/slides/06-main_mem.pdf
http://www.eng.utah.edu/~cs7810/pres/11-7810-12.pdf
http://slideplayer.fr/slide/3279682/
https://forums.tweaktown.com/gigabyte/27283-memory-timings-explained-suggested-timings-memset-vs-bios.html
http://www.masterslair.com/memory-ram-timings-latency-cas-ras-tcl-trcd-trp-tras
https://www.slideshare.net/abhilash128/lec-21-16642228
https://en.wikichip.org
http://www.overclockingmadeinfrance.com/quest-ce-que-les-cas-latency/
https://stackoverflow.com/questions/29522431/does-a-branch-misprediction-flush-the-entire-pipeline-even-for-very-short-if-st
https://kshitizdange.github.io/418CacheSim/final-report
http://www.lifl.fr/~marquet/cnl/ssam/ssam-c3.pdf
http://www.toves.org/books/cache/
http://www.simmtester.com/page/news/showpubnews.asp?num=168
http://wiki.osdev.org/X86-64

Bibliography - Software
https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/
http://duartes.org/gustavo/blog/post/memory-translation-and-segmentation/
http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/
http://www.memorymanagement.org/mmref/index.html#mmref-intro
http://events.linuxfoundation.org/sites/events/files/slides/slaballocators.pdf
http://iarchsys.com/?p=764

Memory management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Memory management

Similar to Memory management (20)

Recently uploaded

Recently uploaded (20)

Memory management