3. What’s Page Frame
❖ page frame = A page-sized/aligned piece of RAM!
❖ struct page = An one-on-one structure in kernel for each page frame!
❖ mem_map!
❖ Unique array of struct page's which covers all RAM that a kernel
manages.!
❖ but in CONFIG_SPARSEMEM environment!
❖ There's no unique mem_map.!
❖ Instead, there's a list of 2MB-sized arrays of struct page's.!
❖ You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
4. What’s NUMA
❖ NUMA(Non-Uniform Memory Architecture)!
❖ System is comprised of nodes.!
❖ Each node is defined by a set of CPUs and one physical memory range.!
❖ Memory access latency differs depending on source and destination nodes.!
❖ NUMA configuration!
❖ ACPI provides NUMA configuration:!
❖ SRAT(Static Resource Affinity Table)!
❖ To know which CPUs and memory range are contained in which NUMA
node?!
❖ SLIT(System Locality Information Table)!
❖ To know how far a NUMA node is from another node?
5. What’s Memory Zone
❖ Physical memory is separated by address range:!
❖ ZONE_DMA: <16MB!
❖ ZONE_DMA32: <4GB!
❖ ZONE_NORMAL: the rest!
❖ ZONE_MOVABLE: none by default.!
❖ This is used to define a hot-removable physical
memory range.
6. struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
};
Memory node, zone
物理アドレス Range1 Range2
CPU1 CPU2 CPU3 CPU4
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
…!
};
NUMA node1 NUMA node2
❖ どのpglist_dataにも各ZONE(DMA∼MOVABLE)に対応する
zone構造体が用意される(但し一部の中身は空かもしれない)
7. Memory Allocation
1. At first, checks threshold for each zone
(threshold = watermark and dirty-ratio).!
❖ If all zones are failed, the kernel goes into page reclaim
path (=today’s topic).!
2. If some zone is ok, allocates a page from the zone’s buddy
system.!
❖ 0-order page is allocated from per-cpu cache.!
❖ higher order page is obtained from per-order lists of pages
8. Memory Deallocation
❖ Page is returned to buddy system.!
❖ 0-order page is returned to per-cpu cache via
free_hot_cold_page().!
❖ Cold page: A page estimated not to be on CPU cache!
❖ This is linked to the tail of LRU list of the per-cpu cache.!
❖ Hot page: A page estimated to be on CPU cache!
❖ This is linked to the head of LRU list of the per-cpu cache.!
❖ higher order page is directly returned to per-order lists of pages.
9. Buddy System
4k 4k 4k
8k 8k 8k
4m 4m 4m
・・・
Per-cpu cache
4k 4k 4k
Per-zone buddy system
order0
(de)alloc
HOT COLD
order1
order10
・・・
15. do_try_to_free_pages()
❖ Core function for page reclaim, which is called at 3 different scenes!
❖ try_to_free_pages() → Global reclaim path via __alloc_pages_nodemask()!
❖ try_to_free_mem_cgroup_pages() → Per-memcg reclaim path!
❖ Right before per-memcg slab allocation!
❖ Right before per-memcg file page allocation!
❖ Right before per-memcg anon page allocation!
❖ Right before per-memcg swapin allocation!
❖ shrink_all_memory() → Hibernation path!
❖ Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
16. struct scan_control
struct scan_control {!
! unsigned long nr_scanned;!
! unsigned long nr_reclaimed;!
! unsigned long nr_to_reclaim;!
! …!
! int swappiness; // 0..100!
! …!
! struct mem_cgroup *target_mem_cgroup;!
! …!
! nodemask_t! *nodemask;!
};!
20. shrink_list()
❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを
shrinkするのは、対となるinactiveリストより大きい場合のみ!
1. if ACTIVEなリストを指定:!
❖ if size of lru(ACTIVE) > size of lru(INACTIVE):!
❖ shrink_active_list!
2. else:!
❖ shrink_inactive_list
21. shrink_{active,inactive}_list
❖ shrink_active_list()!
1. Traverse pages in an active list!
2. Find inactive pages in the list and move them to an
inactive list!
❖ shrink_inactive_list()!
❖ foreach page:!
1. page_mapped(page) => try_to_unmap(page)!
2. if PageDirty(page) => pageout(page)
23. try_to_unmap()
❖ Unmap a specified page from all corresponding mappings!
1. Set up struct rmap_walk_control.!
2. rmap_walk_{file, anon, or ksm}!
❖ rmap walk is iterating VMAs and unmapping from it!
A. file: traverse address_space::i_mmap tree!
B. anon: traverse anon_vma tree!
C. ksm: traverse all merged anon_vma trees!
❖ each operation is similar to that for anon
28. kswapd
❖ Processing overview!
1. Wake up!
2. balance_pgdat()!
3. Sleep!
❖ balance_pgdat()!
❖ Work until all zones of pgdat are at or over hi-wmark.!
❖ reclaim function: kswapd_shrink_zone()