SlideShare a Scribd company logo
1 of 53
Download to read offline
Memory Management
From Silicon to Algorithm
Sysadmin #7
Adrien Mahieux - Sysadmin & microsecond hunter
gh: github.com/Saruspete
tw: @Saruspete
em: adrien.mahieux@gmail.com
1) HW : Data Storage
2) HW : Data access
3) HW : Data processing
4) SW : Linux Internals
5) SW : Application allocator
6) SEC : Attacks
Agenda
{S,D,V}RAM, Bank, Rank, {u,r,lr,fb}dimm
CAS Timing, ECC
{S,D,Q}DR, NUMA, Channels, CPU (Cache,
Associativity
CPU Pipeline, Branch prediction, MMU
Zones, Buddy allocator, fragmentation, sl[auo]b
Stack / heap, regions, memory allocator..
Malloc implementation, ptmalloc2 details
Where are we ?
How long ?
How often ?
What ratio ?
What path ?
Hardware : Data Storage
HW : Data Storage - Hierarchy
DRAM Chip
x4 = 16 Banks
x8 = 8 Banks
x16 = 4 Banks
Bank
Array
Row Decoder
Row Buffer
Column Decoder
Array Cell
1 bit
HW : Data Storage - {S,D,V}RAM
RAM Random Access Memory
SRAM Static RAM
DRAM Dynamic RAM
VRAM Video RAM
SRAM DRAM
SRAM DRAM
Speed CMOS 2T Cond.
Power Consumption Constant Low + burst
Production Cost Expensive Cheap
Production complexity 5 trans. 1 trans.
Read Operation Stable Destructive
HW : Data Storage - DRAM Refresh
DRAM must be refreshed, even if not accessed.
Done every 4 - 64ms
Refresh can be done by :
- Burst refresh : Stop all operation, refresh all memory
- Distributed refresh : refresh one (or more) row at a time
- Self-Refresh (low-power mode) : Turn off memory controller, and refresh
capacitor itself
HW : Data Storage - Bank
- Row Buffer holds read data
- Read gets entire row into the
buffer
- Once read, capacitor is
empty, but value in buffer
- Write bits back before doing
another read
Process is called “opening” or
“closing” a row
0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0
0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HW : Data Storage - Rank
Rank : Set of DRAM Modules on a DIMM connected to the same
Chip-Select pin (accessed simultaneously)
A Rank has 64 bit wide data bus (72 on DIMM with ECC) and noted
<rank>Rx<dram-width>:
- 1Rx16 : Single Rank, 16bits width (4 DRAM to have 64bits)
- 2Rx8 : Dual Rank, 8bits width (8 DRAM to have 64bits)
- 4Rx8 : Quad Rank, 8bits width (8 DRAM to have 64bits)
DIMM Dual Inline Memory Module ( != SIMM)
SO-DIMM Small Outline DIMM (for laptop & embedded)
UDIMM Unregistered DIMM (standard end-user)
RDIMM Registered DIMM or Buffered DIMM (most servers).
FB-DIMM Fully Buffered DIMM : Buffer for Addr & Data bus.
LR-DIMM Load-Reduced DIMM : Like FB, but only serialize data, not address
ECC Error Correcting Code : Parity value checking (like RAID5)
HW : Data Storage - {U,R,LR,FB}DIMM
HW : Data Storage - {U,R,LR,FB}DIMM
Latency Max Size
(1 DIMM)
@ Bus Data Bus Implementation Details
UDIMM Low 8 GB Parallel Parallel Input command and output data bus are
directly connected to the bus.
RDIMM UDIMM + 1
cycle
32 GB Parallel Parallel Same as UDIMM, but input commands are
stabilized through a register (cost of 1 cycle)
FB-DIMM 8 GB Serial Serial Add a big buffer for both command and data.
But serial implementation generates
hyperfrequency and signal stability issues.
LR-DIMM Similiar to
RDIMM
128 GB Parallel Parallel Fix issues of FBDIMM. Based on RDIMM and
also buffers data-lines.
HW : Data Storage - RAM Timings
Row and Column addresses are sent
on the same (address) bus.
Multiplexer on the memory DIMM
Notation : w-x-y-z-T
- w : CAS Latency (CL)
- x : RAS to CAS delay (TRCD)
- y : RAS precharge (TRP)
- z : Active to Precharge delay (TRAS)
- T : Command Rate
Timings are in cycles
CL : Column select → Data avail. on bus
TRCD : Row select → Column select
TRP : New line activation (opening)
TRAS : Line deactivation (closing)
T : Between 2 commands
TRRD RAS to RAS. Time to activate the next bank of memory.
TWTR Write to Read. Between write command and the next read command.
TWR Write Recovery. Time after a valid write operation and precharge.
TRFC Row Refresh Cycle. to refresh a row on a memory bank.
TRTP Read To Precharge. between a read command to a row pre-charge
command to the same rank.
TRTW Read to Write Delay. When a write command is received, cycles for the
command to be executed.
TRC Row Cycle. The minimum time in cycles it takes a row to complete a full
cycle. This can be determined by; tRC = tRAS + tRP.
HW : Data Storage - RAM Timings (advanced)
TREF before a charge is refreshed so it does not lose its charge and corrupt.
TWCL Write CAS number. Write to whatever bank is open to be written too.
CPC Command Per Clock. chip select is executed then commands are
issued.TRD Static tREAD.
HW : Data Storage - RAM Timings (advanced)
HW : Data Storage - DIMM Assembly
x8 ⇒ Each DRAM outputs 8 bits
HW : Data Storage - 3D X-Point
Public name : Optane
- Low Latency (< 10us)
- High Density
- No memory controller
- Voltage Variation
Hardware : Data Access
HW : Data Access - {S,D,Q}DR
The quantity of usable data handled by the memory for every Clock Cycle.
Introduced the concept of “Tps” : Transfers per second
Original DRAM : specify RAS + CAS for every operation.
FPM (1990) : multiple reads from the same row without RAS.
EDO (1995): allows to select next column while reading old one
SDR (1997) : single selection then burst following
DDR (2000) : transfers data on both rising and falling edge of the clock
DDR2 (2003) : 2 internal channels
DDR3 (2007) : Doubled transfer speed
DDR4 (2013) : Increased frequency
DDR5 (2020) : JEDEC released specs…
HW : Data Access - UMA / NUMA
Uniform Memory Access
Central Northbridge
Non Uniform Memory Access
One MMU for each Socket
HW : Data Access - Memory Channels
DIMMs used in parallel to increase the
bandwidth : single / dual / triple channel
Channels must be balanced
HW : Data Access - Direct Memory Access
Bypass CPU processing
- PCI-E
- Thunderbolt
- Firewire
- Cardbus
- Expresscard
DMA controller
advertise caches about
RAM changes
Direct Cache Access
Hardware : Data Processing
HW : Data Processing - CPU Pipeline
HW : Data Processing - Cache
DRAM is slow related to CPU cycles
⇒ Let’s use cache
Can be used for read (prefetch) Write
(write-back)
Eviction done by tracking algorithm
- LRU Least Recently Used
- LFU Least Frequently Used
- FIFO First In First Out
- ARC Adaptive Replacement Cache
Hit-Ratio gives usefulness of cache
HW : Data Processing - Cache
Distribution policy :
- Fully Associative : All blocks checked simultaneously (heavy hardware)
- Direct Mapped : fast but need balanced spread (rare)
- Set-Associative : mix of 2 previous
Address can be :
- Virtual : Fast access but not unique. Used by L1 and TLB
- Physical : Calculation needed but unique. Used for other caches
Programmers : avoid mapping same @Phys on multiple @Virt
HW : Data Processing - Cache Coherency
Multiple CPUs ⇒ Multiple caches ⇒
SYNC
MOESI (Modified Owned Exclusive
Shared Invalid) on NUMA systems
Processors use cache snooping
Request For Ownership ⇒ Very costly
When a CPU changes a data already in
cache of another CPU
HW : Data Processing - Writing
Write policy on memory zones done by MTRR (Memory Type Range Register)
Write Through : All data written in cache is also written in memory
Write Back : Delay memory access as long as possible
Write Combining : Force writes to be grouped in bulk
Uncacheable : For some I/O and HW, like BIOS, ACPI, IOAPIC...
HW : Data Processing - Memory Management Unit
Switch Between @Virtual and @Physical ⇒ Translation done by CPU (MMU)
Not directly mapped: To @ 16EB of 64bits, direct array would be huge !
We only use 48bits (up to 256TB) and Page Tables to avoid management waste.
4 cascading tables: Page {Global,Upper,Middle} Directory and Page Table
HW : Data Processing - Memory Management Unit
Page Walking :
1) @Base for L4
2) Add offset from bits 39-47
⇒ Got @Base for L3
3) Add offset from bits 30-38
⇒ Got @Base for L2
4) Add offset from bits 21-29
⇒ Got @Base for L1
5) Add offset from bits 12-20
⇒ Got @Base for Page
6) Add offset from bits 0-11
⇒ @Physical
HW : Data Processing - Memory Management Unit
An empty Page Directory stores 512 (29
) entries ⇒ 64bits * 512 = 4KB
For 32KB (4 * 512) ⇒ @ 2MB
For 2MB (3*512 + 512*512) ⇒ @ 1GB
For 128MB (2*512 + 5123
) ⇒ @ 550GB
⇒ Low overhead for storage... but requires 4 reads.
⇒ let’s cache these translation : Translation Lookaside Buffer (TLB)
Limit TLB flush upon context switching by adding the page-table ID to the TLB entry
Software : Linux Internals
Name Size (x86) Size (x86_64) Description
DMA < 16MB < 16MB For very old devices (@24 bits)
DMA32 N/A 16 - 4096MB For devices addressing up to 32bits (4GB)
NORMAL 16 - 896MB > 4096MB Memory directly mapped by Kernel
HIGHMEM > 896MB N/A
SW : Linux Internals - Zones
SW : Linux Internals - Zones
32bits : 3/1 split (or 2/2 or 1/3) between Userspace & Kernel
On these 1GB of kernel space, 128MB used to map higher pages. 1024 - 128 = 896
Low memory : directly addressable by Kernel
High memory : must use the 128MB indirection table to be addressed
64bit : all space directly addressable by MMU
SW : Linux Internals - Zones
Jul 12 22:13:12 [server] kernel: swapper: page allocation failure. order:2, mode:0x4020
Jul 12 22:46:46 [server] kernel: [app_name]: page allocation failure. order:4, mode:0xd0
include/linux/gfp.h : Zone usage per flag
* bit result
* =================
* 0x0 => NORMAL
* 0x1 => DMA or NORMAL
* 0x2 => HIGHMEM or NORMAL
* 0x3 => BAD (DMA+HIGHMEM)
* 0x4 => DMA32 or DMA or NORMAL
* 0x5 => BAD (DMA+DMA32)
* 0x6 => BAD (HIGHMEM+DMA32)
* 0x7 => BAD (HIGHMEM+DMA32+DMA)
* 0x8 => NORMAL (MOVABLE+0)
* 0x9 => DMA or NORMAL (MOVABLE+DMA)
* 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
* 0xb => BAD (MOVABLE+HIGHMEM+DMA)
* 0xc => DMA32 (MOVABLE+DMA32)
* 0xd => BAD (MOVABLE+DMA32+DMA)
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
SW : Linux Internals - Buddy Allocator
Goal : limit External Fragmentation and Internal Fragmentation
SW : Linux Internals - Buddy Allocator
4K pages are grouped by 29
(2MB) or 210
(4MB) blocks
Blocks are then cut in half (2 buddies) to service the request
Upon release, tries to merge buddy pages back together
SW : Linux Internals - Buddy Allocator
Unmovable : Locked in memory
Reclaimable : Reusable after clean
Movable : Immediately available
Reserve : Last resort reserve
Isolate : keep on local NUMA node
CMA : Contiguous Memory
Allocator, for DMA devices with
large contiguous
cat /proc/pagetypeinfo
Page block order: 10
Pages per block: 1024
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8
9 10
Node 0, zone DMA, type Unmovable 1 1 0 0 2 1 1 0 1
1 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0
0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0
0 2
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0
0 1
Node 0, zone DMA32, type Unmovable 8 6 0 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Reclaimable 376 2817 5 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Movable 6323 12025 287 0 0 0 0 0 0
0 0
Node 0, zone DMA32, type Reserve 0 0 1 4 6 2 0 0 1
1 0
Node 0, zone Normal, type Unmovable 2611 137 0 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Reclaimable 33847 4321 144 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Movable 37312 9849 1097 0 0 0 0 0 0
0 0
Node 0, zone Normal, type Reserve 0 0 5 1 0 1 1 2 1
0 1
Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 1 0 2 1
Node 0, zone DMA32 13 18 796 1
When no space is available, buddy-allocator will call kswapd
SW : Linux Internals - SLAB
SLAB = Allocator for Kernel Objects
Uses cache to avoid fragmentation
Each kernel object is stored in a SLAB
SLAB 1 Queue / NUMA Node
SLUB 1 Queue / CPU
SLOB As compact as possible
Most servers use SLUB :
- Defrag
- Debug
SW : Linux Internals - SLAB
# grep -v '# name' /proc/slabinfo | tr -d '#' | column -t
slabinfo - version: 2.1
[ . . . ]
ext4_inode_cache 294150 294210 1072 30 8 : tunables 0 0 0 : slabdata 9807 9807 0
ext4_allocation_context 224 224 128 32 1 : tunables 0 0 0 : slabdata 7 7 0
ext4_io_end 1472 1472 64 64 1 : tunables 0 0 0 : slabdata 23 23 0
ext4_extent_status 143365 221442 40 102 1 : tunables 0 0 0 : slabdata 2171 2171 0
[ . . . ]
inode_cache 13493 15596 584 28 4 : tunables 0 0 0 : slabdata 557 557 0
dentry 449376 450471 192 21 1 : tunables 0 0 0 : slabdata 21451 21451 0
buffer_head 650327 840294 104 39 1 : tunables 0 0 0 : slabdata 21546 21546 0
task_struct 1299 1336 7936 4 8 : tunables 0 0 0 : slabdata 334 334 0
cred_jar 88986 89271 192 21 1 : tunables 0 0 0 : slabdata 4251 4251 0
[ . . . ]
task_struct (large object)
Object size 7936 B
4 Objects / slab 31744 B
8 Pages / slab 32768 B
Loss : 32768 - 31744 = 1024 B / slab
334 slabs 334 KB lost / 10688 KB (3.125%)
ext4_extent_status (small and compact)
Object size 40 B
102 obj / slab 4080 B
1 Page / slab 4096 B
Loss : 4096 - 4080 = 16B /slab (0.39%)
2171 slabs 334 KB lost / 8892 KB ()
cat /proc/meminfo
MemTotal: 16328616 kB
MemFree: 4021720 kB
MemAvailable: 6653544 kB
Buffers: 380220 kB
Cached: 4688968 kB
SwapCached: 0 kB
Active: 7703764 kB
Inactive: 3890964 kB
Active(anon): 6546524 kB
Inactive(anon): 2354396 kB
Active(file): 1157240 kB
Inactive(file): 1536568 kB
Unevictable: 20988 kB
Mlocked: 20988 kB
SwapTotal: 8191996 kB
SwapFree: 8191996 kB
Dirty: 92 kB
Writeback: 0 kB
AnonPages: 6546580 kB
Mapped: 1763640 kB
Shmem: 2368152 kB
Slab: 383576 kB
SReclaimable: 280636 kB
SUnreclaim: 102940 kB
KernelStack: 19632 kB
PageTables: 115868 kB
[ ... ]
SW : Linux Internals - /proc
SW : Linux Internals - /proc
pmap -X $PID # read /proc/$PID/smaps
21797: /bin/bash
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked Mapping
5576c6639000 r-xp 00000000 fd:04 268557 992 912 80 912 0 0 0 0 0 0 0 bash
5576c6930000 r--p 000f7000 fd:04 268557 16 16 16 16 16 0 0 0 0 0 0 bash
5576c6934000 rw-p 000fb000 fd:04 268557 36 36 36 36 36 0 0 0 0 0 0 bash
5576c693d000 rw-p 00000000 00:00 0 20 20 20 20 20 0 0 0 0 0 0
5576c806d000 rw-p 00000000 00:00 0 1508 1472 1472 1472 1472 0 0 0 0 0 0 [heap]
7f740b9ea000 r-xp 00000000 fd:04 267260 44 44 0 44 0 0 0 0 0 0 0 libnss_files-2.22.so
7f740b9f5000 ---p 0000b000 fd:04 267260 2044 0 0 0 0 0 0 0 0 0 0 libnss_files-2.22.so
7f740bbf4000 r--p 0000a000 fd:04 267260 4 4 4 4 4 0 0 0 0 0 0 libnss_files-2.22.so
7f740bbf5000 rw-p 0000b000 fd:04 267260 4 4 4 4 4 0 0 0 0 0 0 libnss_files-2.22.so
7f740bbf6000 rw-p 00000000 00:00 0 24 0 0 0 0 0 0 0 0 0 0
7f740bbfc000 r--p 00000000 fd:04 267226 109328 504 19 504 0 0 0 0 0 0 0 locale-archive
7f74126c0000 r-xp 00000000 fd:04 267234 1756 1536 10 1536 0 0 0 0 0 0 0 libc-2.22.so
7f7412877000 ---p 001b7000 fd:04 267234 2048 0 0 0 0 0 0 0 0 0 0 libc-2.22.so
7f7412a77000 r--p 001b7000 fd:04 267234 16 16 16 16 16 0 0 0 0 0 0 libc-2.22.so
7f7412a7b000 rw-p 001bb000 fd:04 267234 8 8 8 8 8 0 0 0 0 0 0 libc-2.22.so
7f7412a7d000 rw-p 00000000 00:00 0 16 12 12 12 12 0 0 0 0 0 0
7f7412a81000 r-xp 00000000 fd:04 267240 12 12 0 12 0 0 0 0 0 0 0 libdl-2.22.so
7f7412a84000 ---p 00003000 fd:04 267240 2044 0 0 0 0 0 0 0 0 0 0 libdl-2.22.so
7f7412c83000 r--p 00002000 fd:04 267240 4 4 4 4 4 0 0 0 0 0 0 libdl-2.22.so
7f7412c84000 rw-p 00003000 fd:04 267240 4 4 4 4 4 0 0 0 0 0 0 libdl-2.22.so
7f7412c85000 r-xp 00000000 fd:04 270666 152 152 12 152 0 0 0 0 0 0 0 libtinfo.so.5.9
7f7412cab000 ---p 00026000 fd:04 270666 2044 0 0 0 0 0 0 0 0 0 0 libtinfo.so.5.9
7f7412eaa000 r--p 00025000 fd:04 270666 16 16 16 16 16 0 0 0 0 0 0 libtinfo.so.5.9
7f7412eae000 rw-p 00029000 fd:04 270666 4 4 4 4 4 0 0 0 0 0 0 libtinfo.so.5.9
7f7412eaf000 r-xp 00000000 fd:04 267194 132 132 0 132 0 0 0 0 0 0 0 ld-2.22.so
7f74130a3000 rw-p 00000000 00:00 0 20 20 20 20 20 0 0 0 0 0 0
7f74130a8000 r--p 00000000 fd:04 1442509 124 64 9 64 0 0 0 0 0 0 0 bash.mo
7f74130c7000 r--s 00000000 fd:04 527264 28 28 0 28 0 0 0 0 0 0 0 gconv-modules.cache
7f74130ce000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0 0 0 0 0
7f74130cf000 r--p 00020000 fd:04 267194 4 4 4 4 4 0 0 0 0 0 0 ld-2.22.so
7f74130d0000 rw-p 00021000 fd:04 267194 4 4 4 4 4 0 0 0 0 0 0 ld-2.22.so
7f74130d1000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0 0 0 0 0
7ffc4fbbe000 rw-p 00000000 00:00 0 136 32 32 32 32 0 0 0 0 0 0 [stack]
7ffc4fbe8000 r--p 00000000 00:00 0 8 0 0 0 0 0 0 0 0 0 0 [vvar]
7ffc4fbea000 r-xp 00000000 00:00 0 8 4 0 4 0 0 0 0 0 0 0 [vdso]
ffffffffff600000 r-xp 00000000 00:00 0 4 0 0 0 0 0 0 0 0 0 0 [vsyscall]
====== ==== ==== ========== ========= ============== ============== =============== ==== ======= ======
122620 5072 1814 5072 1684 0 0 0 0 0 0 KB
Software : Application Memory Allocator
SW : Application Memory Allocator
SW : Application Memory Allocator
SW : Application Memory Allocator
malloc NOT A SYSCALL (mmap + brk)
Speed & minimal fragmentation
Multiple implementations :
- dlmalloc Doug-Lea (Generic)
- ptmalloc2 Current glibc
- jemalloc Jason Evans (FreeBSD, Firefox, FB)
- tcmalloc Thread-Caching (Google)
- libumem Solaris
- €€€€€ lockless, hoard, smartheap...
SW : Application Memory Allocator - ptmalloc2
Uses brk or mmap to do allocation
- brk / sbrk for main thread and if req < 128KB
- mmap otherwise.
Maintains arenas : main & thread
Each arena is composed of shards
Upon free() ptmalloc adds freed region to a “bin” to
be reused for later allocations:
- Fast 16 - 80 bytes
- Unsorted No size limit. Latest freed.
- Small < 512 bytes
- Large >= 512 bytes
SW : Application Memory Allocator - ptmalloc2
Internal Structures :
- malloc_state : Arena Header. Has multiple heaps,
except for Main Arena (which just grows its heap)
- heap_info : Heap Header. Has multiple chunks
- malloc_chunk : Chunk Header. Result of malloc()
Security : Attacks
Security : Some Attacks
Rowhammer Bitflip on DRAM Rows
Evil Maid DMA using physical access
Stack Clash Huge stack usage to overlap on heap
Bibliography
Bibliography - Hardware
https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp15/cse502/slides/06-main_mem.pdf
http://www.eng.utah.edu/~cs7810/pres/11-7810-12.pdf
http://slideplayer.fr/slide/3279682/
https://forums.tweaktown.com/gigabyte/27283-memory-timings-explained-suggested-timings-memset-vs-bios.html
http://www.masterslair.com/memory-ram-timings-latency-cas-ras-tcl-trcd-trp-tras
https://www.slideshare.net/abhilash128/lec-21-16642228
https://en.wikichip.org
http://www.overclockingmadeinfrance.com/quest-ce-que-les-cas-latency/
https://stackoverflow.com/questions/29522431/does-a-branch-misprediction-flush-the-entire-pipeline-even-for-very-short-if-st
https://kshitizdange.github.io/418CacheSim/final-report
http://www.lifl.fr/~marquet/cnl/ssam/ssam-c3.pdf
http://www.toves.org/books/cache/
http://www.simmtester.com/page/news/showpubnews.asp?num=168
http://wiki.osdev.org/X86-64
Bibliography - Software
https://sploitfun.wordpress.com/2015/02/10/understanding-glibc-malloc/
http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/
http://duartes.org/gustavo/blog/post/memory-translation-and-segmentation/
http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/
http://www.memorymanagement.org/mmref/index.html#mmref-intro
http://events.linuxfoundation.org/sites/events/files/slides/slaballocators.pdf
http://iarchsys.com/?p=764
Questions ?

More Related Content

What's hot

Message Signaled Interrupts
Message Signaled InterruptsMessage Signaled Interrupts
Message Signaled Interrupts
Anshuman Biswal
 

What's hot (20)

Message Signaled Interrupts
Message Signaled InterruptsMessage Signaled Interrupts
Message Signaled Interrupts
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
Double data rate (ddr)
Double data rate (ddr)Double data rate (ddr)
Double data rate (ddr)
 
Computer architecture memory system
Computer architecture memory systemComputer architecture memory system
Computer architecture memory system
 
Spi drivers
Spi driversSpi drivers
Spi drivers
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
 
ARM stacks, subroutines, Cortex M3, LPC 214X
ARM  stacks, subroutines, Cortex M3, LPC 214XARM  stacks, subroutines, Cortex M3, LPC 214X
ARM stacks, subroutines, Cortex M3, LPC 214X
 
Direct Memory Access ppt
Direct Memory Access pptDirect Memory Access ppt
Direct Memory Access ppt
 
Linux memory
Linux memoryLinux memory
Linux memory
 
Dma
DmaDma
Dma
 
05 internal memory
05 internal memory05 internal memory
05 internal memory
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
eMMC 5.0 Total IP Solution
eMMC 5.0 Total IP SolutioneMMC 5.0 Total IP Solution
eMMC 5.0 Total IP Solution
 
Memory organization (Computer architecture)
Memory organization (Computer architecture)Memory organization (Computer architecture)
Memory organization (Computer architecture)
 
1 Computer Architecture
1 Computer Architecture1 Computer Architecture
1 Computer Architecture
 
DDR2 SDRAM
DDR2 SDRAMDDR2 SDRAM
DDR2 SDRAM
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
 
VIRTUAL MEMORY
VIRTUAL MEMORYVIRTUAL MEMORY
VIRTUAL MEMORY
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 

Similar to Memory management

memeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesmemeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memories
GauravDaware2
 

Similar to Memory management (20)

Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
 
memeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memoriesmemeoryorganization PPT for organization of memories
memeoryorganization PPT for organization of memories
 
Memory (Part 3)
Memory (Part 3)Memory (Part 3)
Memory (Part 3)
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
 
Computer Organisation and Architecture
Computer Organisation and ArchitectureComputer Organisation and Architecture
Computer Organisation and Architecture
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
 
COMPUTER ORGANIZATION NOTES Unit 5
COMPUTER ORGANIZATION NOTES Unit 5COMPUTER ORGANIZATION NOTES Unit 5
COMPUTER ORGANIZATION NOTES Unit 5
 
05 Internal Memory
05  Internal  Memory05  Internal  Memory
05 Internal Memory
 
Microelectronics U4.pptx.ppt
Microelectronics U4.pptx.pptMicroelectronics U4.pptx.ppt
Microelectronics U4.pptx.ppt
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
Dd sdram
Dd sdramDd sdram
Dd sdram
 
Kiến trúc máy tính - COE 301 - Memory.ppt
Kiến trúc máy tính - COE 301 - Memory.pptKiến trúc máy tính - COE 301 - Memory.ppt
Kiến trúc máy tính - COE 301 - Memory.ppt
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
DDR DIMM Design
DDR DIMM DesignDDR DIMM Design
DDR DIMM Design
 
L010236974
L010236974L010236974
L010236974
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Memory management

  • 1. Memory Management From Silicon to Algorithm Sysadmin #7 Adrien Mahieux - Sysadmin & microsecond hunter gh: github.com/Saruspete tw: @Saruspete em: adrien.mahieux@gmail.com
  • 2. 1) HW : Data Storage 2) HW : Data access 3) HW : Data processing 4) SW : Linux Internals 5) SW : Application allocator 6) SEC : Attacks Agenda {S,D,V}RAM, Bank, Rank, {u,r,lr,fb}dimm CAS Timing, ECC {S,D,Q}DR, NUMA, Channels, CPU (Cache, Associativity CPU Pipeline, Branch prediction, MMU Zones, Buddy allocator, fragmentation, sl[auo]b Stack / heap, regions, memory allocator.. Malloc implementation, ptmalloc2 details
  • 3. Where are we ? How long ? How often ? What ratio ? What path ?
  • 4. Hardware : Data Storage
  • 5. HW : Data Storage - Hierarchy DRAM Chip x4 = 16 Banks x8 = 8 Banks x16 = 4 Banks Bank Array Row Decoder Row Buffer Column Decoder Array Cell 1 bit
  • 6. HW : Data Storage - {S,D,V}RAM RAM Random Access Memory SRAM Static RAM DRAM Dynamic RAM VRAM Video RAM SRAM DRAM SRAM DRAM Speed CMOS 2T Cond. Power Consumption Constant Low + burst Production Cost Expensive Cheap Production complexity 5 trans. 1 trans. Read Operation Stable Destructive
  • 7. HW : Data Storage - DRAM Refresh DRAM must be refreshed, even if not accessed. Done every 4 - 64ms Refresh can be done by : - Burst refresh : Stop all operation, refresh all memory - Distributed refresh : refresh one (or more) row at a time - Self-Refresh (low-power mode) : Turn off memory controller, and refresh capacitor itself
  • 8. HW : Data Storage - Bank - Row Buffer holds read data - Read gets entire row into the buffer - Once read, capacitor is empty, but value in buffer - Write bits back before doing another read Process is called “opening” or “closing” a row 0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 9. HW : Data Storage - Rank Rank : Set of DRAM Modules on a DIMM connected to the same Chip-Select pin (accessed simultaneously) A Rank has 64 bit wide data bus (72 on DIMM with ECC) and noted <rank>Rx<dram-width>: - 1Rx16 : Single Rank, 16bits width (4 DRAM to have 64bits) - 2Rx8 : Dual Rank, 8bits width (8 DRAM to have 64bits) - 4Rx8 : Quad Rank, 8bits width (8 DRAM to have 64bits)
  • 10. DIMM Dual Inline Memory Module ( != SIMM) SO-DIMM Small Outline DIMM (for laptop & embedded) UDIMM Unregistered DIMM (standard end-user) RDIMM Registered DIMM or Buffered DIMM (most servers). FB-DIMM Fully Buffered DIMM : Buffer for Addr & Data bus. LR-DIMM Load-Reduced DIMM : Like FB, but only serialize data, not address ECC Error Correcting Code : Parity value checking (like RAID5) HW : Data Storage - {U,R,LR,FB}DIMM
  • 11. HW : Data Storage - {U,R,LR,FB}DIMM Latency Max Size (1 DIMM) @ Bus Data Bus Implementation Details UDIMM Low 8 GB Parallel Parallel Input command and output data bus are directly connected to the bus. RDIMM UDIMM + 1 cycle 32 GB Parallel Parallel Same as UDIMM, but input commands are stabilized through a register (cost of 1 cycle) FB-DIMM 8 GB Serial Serial Add a big buffer for both command and data. But serial implementation generates hyperfrequency and signal stability issues. LR-DIMM Similiar to RDIMM 128 GB Parallel Parallel Fix issues of FBDIMM. Based on RDIMM and also buffers data-lines.
  • 12. HW : Data Storage - RAM Timings Row and Column addresses are sent on the same (address) bus. Multiplexer on the memory DIMM Notation : w-x-y-z-T - w : CAS Latency (CL) - x : RAS to CAS delay (TRCD) - y : RAS precharge (TRP) - z : Active to Precharge delay (TRAS) - T : Command Rate Timings are in cycles CL : Column select → Data avail. on bus TRCD : Row select → Column select TRP : New line activation (opening) TRAS : Line deactivation (closing) T : Between 2 commands
  • 13. TRRD RAS to RAS. Time to activate the next bank of memory. TWTR Write to Read. Between write command and the next read command. TWR Write Recovery. Time after a valid write operation and precharge. TRFC Row Refresh Cycle. to refresh a row on a memory bank. TRTP Read To Precharge. between a read command to a row pre-charge command to the same rank. TRTW Read to Write Delay. When a write command is received, cycles for the command to be executed. TRC Row Cycle. The minimum time in cycles it takes a row to complete a full cycle. This can be determined by; tRC = tRAS + tRP. HW : Data Storage - RAM Timings (advanced)
  • 14. TREF before a charge is refreshed so it does not lose its charge and corrupt. TWCL Write CAS number. Write to whatever bank is open to be written too. CPC Command Per Clock. chip select is executed then commands are issued.TRD Static tREAD. HW : Data Storage - RAM Timings (advanced)
  • 15. HW : Data Storage - DIMM Assembly x8 ⇒ Each DRAM outputs 8 bits
  • 16. HW : Data Storage - 3D X-Point Public name : Optane - Low Latency (< 10us) - High Density - No memory controller - Voltage Variation
  • 17. Hardware : Data Access
  • 18. HW : Data Access - {S,D,Q}DR The quantity of usable data handled by the memory for every Clock Cycle. Introduced the concept of “Tps” : Transfers per second Original DRAM : specify RAS + CAS for every operation. FPM (1990) : multiple reads from the same row without RAS. EDO (1995): allows to select next column while reading old one SDR (1997) : single selection then burst following DDR (2000) : transfers data on both rising and falling edge of the clock DDR2 (2003) : 2 internal channels DDR3 (2007) : Doubled transfer speed DDR4 (2013) : Increased frequency DDR5 (2020) : JEDEC released specs…
  • 19. HW : Data Access - UMA / NUMA Uniform Memory Access Central Northbridge Non Uniform Memory Access One MMU for each Socket
  • 20. HW : Data Access - Memory Channels DIMMs used in parallel to increase the bandwidth : single / dual / triple channel Channels must be balanced
  • 21. HW : Data Access - Direct Memory Access Bypass CPU processing - PCI-E - Thunderbolt - Firewire - Cardbus - Expresscard DMA controller advertise caches about RAM changes Direct Cache Access
  • 22. Hardware : Data Processing
  • 23. HW : Data Processing - CPU Pipeline
  • 24. HW : Data Processing - Cache DRAM is slow related to CPU cycles ⇒ Let’s use cache Can be used for read (prefetch) Write (write-back) Eviction done by tracking algorithm - LRU Least Recently Used - LFU Least Frequently Used - FIFO First In First Out - ARC Adaptive Replacement Cache Hit-Ratio gives usefulness of cache
  • 25. HW : Data Processing - Cache Distribution policy : - Fully Associative : All blocks checked simultaneously (heavy hardware) - Direct Mapped : fast but need balanced spread (rare) - Set-Associative : mix of 2 previous Address can be : - Virtual : Fast access but not unique. Used by L1 and TLB - Physical : Calculation needed but unique. Used for other caches Programmers : avoid mapping same @Phys on multiple @Virt
  • 26. HW : Data Processing - Cache Coherency Multiple CPUs ⇒ Multiple caches ⇒ SYNC MOESI (Modified Owned Exclusive Shared Invalid) on NUMA systems Processors use cache snooping Request For Ownership ⇒ Very costly When a CPU changes a data already in cache of another CPU
  • 27. HW : Data Processing - Writing Write policy on memory zones done by MTRR (Memory Type Range Register) Write Through : All data written in cache is also written in memory Write Back : Delay memory access as long as possible Write Combining : Force writes to be grouped in bulk Uncacheable : For some I/O and HW, like BIOS, ACPI, IOAPIC...
  • 28. HW : Data Processing - Memory Management Unit Switch Between @Virtual and @Physical ⇒ Translation done by CPU (MMU) Not directly mapped: To @ 16EB of 64bits, direct array would be huge ! We only use 48bits (up to 256TB) and Page Tables to avoid management waste. 4 cascading tables: Page {Global,Upper,Middle} Directory and Page Table
  • 29. HW : Data Processing - Memory Management Unit Page Walking : 1) @Base for L4 2) Add offset from bits 39-47 ⇒ Got @Base for L3 3) Add offset from bits 30-38 ⇒ Got @Base for L2 4) Add offset from bits 21-29 ⇒ Got @Base for L1 5) Add offset from bits 12-20 ⇒ Got @Base for Page 6) Add offset from bits 0-11 ⇒ @Physical
  • 30. HW : Data Processing - Memory Management Unit An empty Page Directory stores 512 (29 ) entries ⇒ 64bits * 512 = 4KB For 32KB (4 * 512) ⇒ @ 2MB For 2MB (3*512 + 512*512) ⇒ @ 1GB For 128MB (2*512 + 5123 ) ⇒ @ 550GB ⇒ Low overhead for storage... but requires 4 reads. ⇒ let’s cache these translation : Translation Lookaside Buffer (TLB) Limit TLB flush upon context switching by adding the page-table ID to the TLB entry
  • 31. Software : Linux Internals
  • 32. Name Size (x86) Size (x86_64) Description DMA < 16MB < 16MB For very old devices (@24 bits) DMA32 N/A 16 - 4096MB For devices addressing up to 32bits (4GB) NORMAL 16 - 896MB > 4096MB Memory directly mapped by Kernel HIGHMEM > 896MB N/A SW : Linux Internals - Zones
  • 33. SW : Linux Internals - Zones 32bits : 3/1 split (or 2/2 or 1/3) between Userspace & Kernel On these 1GB of kernel space, 128MB used to map higher pages. 1024 - 128 = 896 Low memory : directly addressable by Kernel High memory : must use the 128MB indirection table to be addressed 64bit : all space directly addressable by MMU
  • 34. SW : Linux Internals - Zones Jul 12 22:13:12 [server] kernel: swapper: page allocation failure. order:2, mode:0x4020 Jul 12 22:46:46 [server] kernel: [app_name]: page allocation failure. order:4, mode:0xd0 include/linux/gfp.h : Zone usage per flag * bit result * ================= * 0x0 => NORMAL * 0x1 => DMA or NORMAL * 0x2 => HIGHMEM or NORMAL * 0x3 => BAD (DMA+HIGHMEM) * 0x4 => DMA32 or DMA or NORMAL * 0x5 => BAD (DMA+DMA32) * 0x6 => BAD (HIGHMEM+DMA32) * 0x7 => BAD (HIGHMEM+DMA32+DMA) * 0x8 => NORMAL (MOVABLE+0) * 0x9 => DMA or NORMAL (MOVABLE+DMA) * 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too) * 0xb => BAD (MOVABLE+HIGHMEM+DMA) * 0xc => DMA32 (MOVABLE+DMA32) * 0xd => BAD (MOVABLE+DMA32+DMA) * 0xe => BAD (MOVABLE+DMA32+HIGHMEM) * 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
  • 35. SW : Linux Internals - Buddy Allocator Goal : limit External Fragmentation and Internal Fragmentation
  • 36. SW : Linux Internals - Buddy Allocator 4K pages are grouped by 29 (2MB) or 210 (4MB) blocks Blocks are then cut in half (2 buddies) to service the request Upon release, tries to merge buddy pages back together
  • 37. SW : Linux Internals - Buddy Allocator Unmovable : Locked in memory Reclaimable : Reusable after clean Movable : Immediately available Reserve : Last resort reserve Isolate : keep on local NUMA node CMA : Contiguous Memory Allocator, for DMA devices with large contiguous cat /proc/pagetypeinfo Page block order: 10 Pages per block: 1024 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 0 0 2 1 1 0 1 1 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 2 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA32, type Unmovable 8 6 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Reclaimable 376 2817 5 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Movable 6323 12025 287 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Reserve 0 0 1 4 6 2 0 0 1 1 0 Node 0, zone Normal, type Unmovable 2611 137 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Reclaimable 33847 4321 144 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Movable 37312 9849 1097 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Reserve 0 0 5 1 0 1 1 2 1 0 1 Number of blocks type Unmovable Reclaimable Movable Reserve Node 0, zone DMA 1 0 2 1 Node 0, zone DMA32 13 18 796 1 When no space is available, buddy-allocator will call kswapd
  • 38. SW : Linux Internals - SLAB SLAB = Allocator for Kernel Objects Uses cache to avoid fragmentation Each kernel object is stored in a SLAB SLAB 1 Queue / NUMA Node SLUB 1 Queue / CPU SLOB As compact as possible Most servers use SLUB : - Defrag - Debug
  • 39. SW : Linux Internals - SLAB # grep -v '# name' /proc/slabinfo | tr -d '#' | column -t slabinfo - version: 2.1 [ . . . ] ext4_inode_cache 294150 294210 1072 30 8 : tunables 0 0 0 : slabdata 9807 9807 0 ext4_allocation_context 224 224 128 32 1 : tunables 0 0 0 : slabdata 7 7 0 ext4_io_end 1472 1472 64 64 1 : tunables 0 0 0 : slabdata 23 23 0 ext4_extent_status 143365 221442 40 102 1 : tunables 0 0 0 : slabdata 2171 2171 0 [ . . . ] inode_cache 13493 15596 584 28 4 : tunables 0 0 0 : slabdata 557 557 0 dentry 449376 450471 192 21 1 : tunables 0 0 0 : slabdata 21451 21451 0 buffer_head 650327 840294 104 39 1 : tunables 0 0 0 : slabdata 21546 21546 0 task_struct 1299 1336 7936 4 8 : tunables 0 0 0 : slabdata 334 334 0 cred_jar 88986 89271 192 21 1 : tunables 0 0 0 : slabdata 4251 4251 0 [ . . . ] task_struct (large object) Object size 7936 B 4 Objects / slab 31744 B 8 Pages / slab 32768 B Loss : 32768 - 31744 = 1024 B / slab 334 slabs 334 KB lost / 10688 KB (3.125%) ext4_extent_status (small and compact) Object size 40 B 102 obj / slab 4080 B 1 Page / slab 4096 B Loss : 4096 - 4080 = 16B /slab (0.39%) 2171 slabs 334 KB lost / 8892 KB ()
  • 40. cat /proc/meminfo MemTotal: 16328616 kB MemFree: 4021720 kB MemAvailable: 6653544 kB Buffers: 380220 kB Cached: 4688968 kB SwapCached: 0 kB Active: 7703764 kB Inactive: 3890964 kB Active(anon): 6546524 kB Inactive(anon): 2354396 kB Active(file): 1157240 kB Inactive(file): 1536568 kB Unevictable: 20988 kB Mlocked: 20988 kB SwapTotal: 8191996 kB SwapFree: 8191996 kB Dirty: 92 kB Writeback: 0 kB AnonPages: 6546580 kB Mapped: 1763640 kB Shmem: 2368152 kB Slab: 383576 kB SReclaimable: 280636 kB SUnreclaim: 102940 kB KernelStack: 19632 kB PageTables: 115868 kB [ ... ] SW : Linux Internals - /proc
  • 41. SW : Linux Internals - /proc pmap -X $PID # read /proc/$PID/smaps 21797: /bin/bash Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked Mapping 5576c6639000 r-xp 00000000 fd:04 268557 992 912 80 912 0 0 0 0 0 0 0 bash 5576c6930000 r--p 000f7000 fd:04 268557 16 16 16 16 16 0 0 0 0 0 0 bash 5576c6934000 rw-p 000fb000 fd:04 268557 36 36 36 36 36 0 0 0 0 0 0 bash 5576c693d000 rw-p 00000000 00:00 0 20 20 20 20 20 0 0 0 0 0 0 5576c806d000 rw-p 00000000 00:00 0 1508 1472 1472 1472 1472 0 0 0 0 0 0 [heap] 7f740b9ea000 r-xp 00000000 fd:04 267260 44 44 0 44 0 0 0 0 0 0 0 libnss_files-2.22.so 7f740b9f5000 ---p 0000b000 fd:04 267260 2044 0 0 0 0 0 0 0 0 0 0 libnss_files-2.22.so 7f740bbf4000 r--p 0000a000 fd:04 267260 4 4 4 4 4 0 0 0 0 0 0 libnss_files-2.22.so 7f740bbf5000 rw-p 0000b000 fd:04 267260 4 4 4 4 4 0 0 0 0 0 0 libnss_files-2.22.so 7f740bbf6000 rw-p 00000000 00:00 0 24 0 0 0 0 0 0 0 0 0 0 7f740bbfc000 r--p 00000000 fd:04 267226 109328 504 19 504 0 0 0 0 0 0 0 locale-archive 7f74126c0000 r-xp 00000000 fd:04 267234 1756 1536 10 1536 0 0 0 0 0 0 0 libc-2.22.so 7f7412877000 ---p 001b7000 fd:04 267234 2048 0 0 0 0 0 0 0 0 0 0 libc-2.22.so 7f7412a77000 r--p 001b7000 fd:04 267234 16 16 16 16 16 0 0 0 0 0 0 libc-2.22.so 7f7412a7b000 rw-p 001bb000 fd:04 267234 8 8 8 8 8 0 0 0 0 0 0 libc-2.22.so 7f7412a7d000 rw-p 00000000 00:00 0 16 12 12 12 12 0 0 0 0 0 0 7f7412a81000 r-xp 00000000 fd:04 267240 12 12 0 12 0 0 0 0 0 0 0 libdl-2.22.so 7f7412a84000 ---p 00003000 fd:04 267240 2044 0 0 0 0 0 0 0 0 0 0 libdl-2.22.so 7f7412c83000 r--p 00002000 fd:04 267240 4 4 4 4 4 0 0 0 0 0 0 libdl-2.22.so 7f7412c84000 rw-p 00003000 fd:04 267240 4 4 4 4 4 0 0 0 0 0 0 libdl-2.22.so 7f7412c85000 r-xp 00000000 fd:04 270666 152 152 12 152 0 0 0 0 0 0 0 libtinfo.so.5.9 7f7412cab000 ---p 00026000 fd:04 270666 2044 0 0 0 0 0 0 0 0 0 0 libtinfo.so.5.9 7f7412eaa000 r--p 00025000 fd:04 270666 16 16 16 16 16 0 0 0 0 0 0 libtinfo.so.5.9 7f7412eae000 rw-p 00029000 fd:04 270666 4 4 4 4 4 0 0 0 0 0 0 libtinfo.so.5.9 7f7412eaf000 r-xp 00000000 fd:04 267194 132 132 0 132 0 0 0 0 0 0 0 ld-2.22.so 7f74130a3000 rw-p 00000000 00:00 0 20 20 20 20 20 0 0 0 0 0 0 7f74130a8000 r--p 00000000 fd:04 1442509 124 64 9 64 0 0 0 0 0 0 0 bash.mo 7f74130c7000 r--s 00000000 fd:04 527264 28 28 0 28 0 0 0 0 0 0 0 gconv-modules.cache 7f74130ce000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0 0 0 0 0 7f74130cf000 r--p 00020000 fd:04 267194 4 4 4 4 4 0 0 0 0 0 0 ld-2.22.so 7f74130d0000 rw-p 00021000 fd:04 267194 4 4 4 4 4 0 0 0 0 0 0 ld-2.22.so 7f74130d1000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0 0 0 0 0 7ffc4fbbe000 rw-p 00000000 00:00 0 136 32 32 32 32 0 0 0 0 0 0 [stack] 7ffc4fbe8000 r--p 00000000 00:00 0 8 0 0 0 0 0 0 0 0 0 0 [vvar] 7ffc4fbea000 r-xp 00000000 00:00 0 8 4 0 4 0 0 0 0 0 0 0 [vdso] ffffffffff600000 r-xp 00000000 00:00 0 4 0 0 0 0 0 0 0 0 0 0 [vsyscall] ====== ==== ==== ========== ========= ============== ============== =============== ==== ======= ====== 122620 5072 1814 5072 1684 0 0 0 0 0 0 KB
  • 42. Software : Application Memory Allocator
  • 43. SW : Application Memory Allocator
  • 44. SW : Application Memory Allocator
  • 45. SW : Application Memory Allocator malloc NOT A SYSCALL (mmap + brk) Speed & minimal fragmentation Multiple implementations : - dlmalloc Doug-Lea (Generic) - ptmalloc2 Current glibc - jemalloc Jason Evans (FreeBSD, Firefox, FB) - tcmalloc Thread-Caching (Google) - libumem Solaris - €€€€€ lockless, hoard, smartheap...
  • 46. SW : Application Memory Allocator - ptmalloc2 Uses brk or mmap to do allocation - brk / sbrk for main thread and if req < 128KB - mmap otherwise. Maintains arenas : main & thread Each arena is composed of shards Upon free() ptmalloc adds freed region to a “bin” to be reused for later allocations: - Fast 16 - 80 bytes - Unsorted No size limit. Latest freed. - Small < 512 bytes - Large >= 512 bytes
  • 47. SW : Application Memory Allocator - ptmalloc2 Internal Structures : - malloc_state : Arena Header. Has multiple heaps, except for Main Arena (which just grows its heap) - heap_info : Heap Header. Has multiple chunks - malloc_chunk : Chunk Header. Result of malloc()
  • 49. Security : Some Attacks Rowhammer Bitflip on DRAM Rows Evil Maid DMA using physical access Stack Clash Huge stack usage to overlap on heap
  • 51. Bibliography - Hardware https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp15/cse502/slides/06-main_mem.pdf http://www.eng.utah.edu/~cs7810/pres/11-7810-12.pdf http://slideplayer.fr/slide/3279682/ https://forums.tweaktown.com/gigabyte/27283-memory-timings-explained-suggested-timings-memset-vs-bios.html http://www.masterslair.com/memory-ram-timings-latency-cas-ras-tcl-trcd-trp-tras https://www.slideshare.net/abhilash128/lec-21-16642228 https://en.wikichip.org http://www.overclockingmadeinfrance.com/quest-ce-que-les-cas-latency/ https://stackoverflow.com/questions/29522431/does-a-branch-misprediction-flush-the-entire-pipeline-even-for-very-short-if-st https://kshitizdange.github.io/418CacheSim/final-report http://www.lifl.fr/~marquet/cnl/ssam/ssam-c3.pdf http://www.toves.org/books/cache/ http://www.simmtester.com/page/news/showpubnews.asp?num=168 http://wiki.osdev.org/X86-64