SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Linuxカーネル
ページ回収
吉田雅徳@siburu!
2014/7/27(Sun)
1. 前回のあらすじ
What’s Page Frame
❖ page frame = A page-sized/aligned piece of RAM!
❖ struct page = An one-on-one structure in kernel for each page frame!
❖ mem_map!
❖ Unique array of struct page's which covers all RAM that a kernel
manages.!
❖ but in CONFIG_SPARSEMEM environment!
❖ There's no unique mem_map.!
❖ Instead, there's a list of 2MB-sized arrays of struct page's.!
❖ You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
What’s NUMA
❖ NUMA(Non-Uniform Memory Architecture)!
❖ System is comprised of nodes.!
❖ Each node is defined by a set of CPUs and one physical memory range.!
❖ Memory access latency differs depending on source and destination nodes.!
❖ NUMA configuration!
❖ ACPI provides NUMA configuration:!
❖ SRAT(Static Resource Affinity Table)!
❖ To know which CPUs and memory range are contained in which NUMA
node?!
❖ SLIT(System Locality Information Table)!
❖ To know how far a NUMA node is from another node?
What’s Memory Zone
❖ Physical memory is separated by address range:!
❖ ZONE_DMA: <16MB!
❖ ZONE_DMA32: <4GB!
❖ ZONE_NORMAL: the rest!
❖ ZONE_MOVABLE: none by default.!
❖ This is used to define a hot-removable physical
memory range.
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
};
Memory node, zone
物理アドレス Range1 Range2
CPU1 CPU2 CPU3 CPU4
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
…!
};
NUMA node1 NUMA node2
❖ どのpglist_dataにも各ZONE(DMA∼MOVABLE)に対応する
zone構造体が用意される(但し一部の中身は空かもしれない)
Memory Allocation
1. At first, checks threshold for each zone

(threshold = watermark and dirty-ratio).!
❖ If all zones are failed, the kernel goes into page reclaim
path (=today’s topic).!
2. If some zone is ok, allocates a page from the zone’s buddy
system.!
❖ 0-order page is allocated from per-cpu cache.!
❖ higher order page is obtained from per-order lists of pages
Memory Deallocation
❖ Page is returned to buddy system.!
❖ 0-order page is returned to per-cpu cache via
free_hot_cold_page().!
❖ Cold page: A page estimated not to be on CPU cache!
❖ This is linked to the tail of LRU list of the per-cpu cache.!
❖ Hot page: A page estimated to be on CPU cache!
❖ This is linked to the head of LRU list of the per-cpu cache.!
❖ higher order page is directly returned to per-order lists of pages.
Buddy System
4k 4k 4k
8k 8k 8k
4m 4m 4m
・・・
Per-cpu cache
4k 4k 4k
Per-zone buddy system
order0

(de)alloc
HOT COLD
order1
order10
・・・
2. ページの回収
2.1 Direct reclaim!
2.2 Daemon reclaim
ページ割当フローの復習
❖ __alloc_pages_nodemask(ページ割当基本関数)!
❖ get_page_from_freelist(1st: local zones, low wmark) → get_page_from_freelist(2nd: all zones)!
❖ __alloc_pages_slowpath!
1. wake_all_kswapds(kswapd達の起床)!
2. get_page_from_freelist(3rd: all zones, min wmark)!
3. if {__GFP,PF}_MEMALLOC → __alloc_pages_high_priority!
4. __alloc_pages_direct_compact(非同期的)!
5. __alloc_pages_direct_reclaim(本コンテキストで直接ページ回収)!
6. if not did_some_progress → __alloc_pages_may_oom!
7. リトライ(2.へ) 又は __alloc_pages_direct_compact(同期的)
2.1 Direct Reclaim
(ページ割当要求者本人による回収)
__alloc_pages_direct_reclaim()
❖ __perform_reclaim!
❖ current->flags |= PF_MEMALLOC!
❖ ページ回収の延長でページ割当が必要になった時に、緊急備蓄分を使用できるように!
❖ try_to_free_pages!
❖ throttle_direct_reclaim!
❖ if !pfmemalloc_watermark_ok →  kswapdによりokになるのを待機!
❖ do_try_to_free_pages!
❖ current->flags &= ~PF_MEMALLOC!
❖ get_page_from_freelist!
❖ drain_all_pages!
❖ get_page_from_freelist
pfmemalloc_watermark_ok()
❖ ARGS!
❖ pgdat(type: struct pglist_data)!
❖ RETURN!
❖ type: bool!
❖ node’s free_pages > 0.5 * node’s min_wmark!
❖ DESC!
❖ node単位で(zone単位でなく)、フリーページ量を min watermarkの半分と比較し、超え
ていればOK!
❖ 下回っていればfalseを返すとともに、 当該nodeのkswapdを起床!
❖ メモリ 迫したnodeではdirect reclaimはやめて kswapdに任せる、その閾値を決める関数。
do_try_to_free_pages()
❖ Core function for page reclaim, which is called at 3 different scenes!
❖ try_to_free_pages() → Global reclaim path via __alloc_pages_nodemask()!
❖ try_to_free_mem_cgroup_pages() → Per-memcg reclaim path!
❖ Right before per-memcg slab allocation!
❖ Right before per-memcg file page allocation!
❖ Right before per-memcg anon page allocation!
❖ Right before per-memcg swapin allocation!
❖ shrink_all_memory() → Hibernation path!
❖ Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
struct scan_control
struct scan_control {!
! unsigned long nr_scanned;!
! unsigned long nr_reclaimed;!
! unsigned long nr_to_reclaim;!
! …!
! int swappiness; // 0..100!
! …!
! struct mem_cgroup *target_mem_cgroup;!
! …!
! nodemask_t! *nodemask;!
};!
do_try_to_free_pagesの処理
❖ 以下二つのループ!
❖ shrink_zones()!
❖ 後述!
❖ wakeup_flusher_threads()!
❖ shrink_zonesが、回収目標(scan_context::nr_to_reclaim)の1.5
倍以上のページをスキャンするたび、呼び出し。!
❖ 最大で、スキャンした分のページをライトバックするよう、
全ブロックデバイス(bdi)に要求。
shrink_zones()
1. for_each_zone_zonelist_nodemask:!
1. mem_cgroup_soft_limit_reclaim!
❖ while mem_cgroup_largest_soft_limit_node:!
❖ mem_cgroup_soft_reclaim!
❖ shrink_zoneに進む前に、当該zoneを使ってる memcgでlimitを超えてるものについて、 ページ
回収を済ませる処理!
2. shrink_zone!
❖ foreach mem_cgroup_iter:!
❖ shrink_lruvec!
❖ ここでのiterationはGlobal reclaimの場合は root memcgから回収!
2. shrink_slab!
❖ スラブについては次回以降で・・・
shrink_lruvec()
❖ per-zone page freer!
1. get_scan_count!
❖ 回収目標ページ数決定!
2. while 目標未達:!
❖ shrink_list(LRU_INACTIVE_ANON)!
❖ shrink_list(LRU_ACTIVE_ANON)!
❖ shrink_list(LRU_INACTIVE_FILE)!
❖ shrink_list(LRU_ACTIVE_FILE)!
3. if INACTIVEな無名メモリだけでは不足:!
❖ shrink_active_list
shrink_list()
❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを
shrinkするのは、対となるinactiveリストより大きい場合のみ!
1. if ACTIVEなリストを指定:!
❖ if size of lru(ACTIVE) > size of lru(INACTIVE):!
❖ shrink_active_list!
2. else:!
❖ shrink_inactive_list
shrink_{active,inactive}_list
❖ shrink_active_list()!
1. Traverse pages in an active list!
2. Find inactive pages in the list and move them to an
inactive list!
❖ shrink_inactive_list()!
❖ foreach page:!
1. page_mapped(page) => try_to_unmap(page)!
2. if PageDirty(page) => pageout(page)
inactiveなページとは
❖ !laptop_modeの場合!
❖ active LRU listの末尾から、単純に指定数分のページ
をinactiveなページとして取得!
❖ laptop_modeの場合!
❖ active LRU listの末尾から、cleanな指定数分のページ
をinactiveなページとして取得
try_to_unmap()
❖ Unmap a specified page from all corresponding mappings!
1. Set up struct rmap_walk_control.!
2. rmap_walk_{file, anon, or ksm}!
❖ rmap walk is iterating VMAs and unmapping from it!
A. file: traverse address_space::i_mmap tree!
B. anon: traverse anon_vma tree!
C. ksm: traverse all merged anon_vma trees!
❖ each operation is similar to that for anon
A. rmap_walk_file
page
address_space(inode)
i_mmap(type: rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
B. rmap_walk_anon
page
anon_vma
rb_root(type:rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
C. rmap_walk_ksm
page
stable_node
hlist
anon!
vma
anon

vma
anon!
vma
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
anon!
vma
2.2 Daemon Reclaim
(KSwapDによる代行回収)
kswapd
❖ Processing overview!
1. Wake up!
2. balance_pgdat()!
3. Sleep!
❖ balance_pgdat()!
❖ Work until all zones of pgdat are at or over hi-wmark.!
❖ reclaim function: kswapd_shrink_zone()

Contenu connexe

Tendances

地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについてKumazaki Hiroki
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 VirtualTech Japan Inc.
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤTakashi Hoshino
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきこと0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきことmao999
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Linux女子部 systemd徹底入門
Linux女子部 systemd徹底入門Linux女子部 systemd徹底入門
Linux女子部 systemd徹底入門Etsuji Nakai
 
AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解MITSUNARI Shigeo
 
Rescale で Singularity を使ってみよう!
Rescale で Singularity を使ってみよう!Rescale で Singularity を使ってみよう!
Rescale で Singularity を使ってみよう!Shinnosuke Furuya
 
インテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor Skochinsky
インテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor SkochinskyインテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor Skochinsky
インテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor SkochinskyCODE BLUE
 
プログラマ目線から見たRDMAのメリットと その応用例について
プログラマ目線から見たRDMAのメリットとその応用例についてプログラマ目線から見たRDMAのメリットとその応用例について
プログラマ目線から見たRDMAのメリットと その応用例についてMasanori Itoh
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdfYasunori Goto
 
Linuxのプロセススケジューラ(Reading the Linux process scheduler)
Linuxのプロセススケジューラ(Reading the Linux process scheduler)Linuxのプロセススケジューラ(Reading the Linux process scheduler)
Linuxのプロセススケジューラ(Reading the Linux process scheduler)Hiraku Toyooka
 
エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)
エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)
エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)Takeshi HASEGAWA
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)Takanori Sejima
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt AffinityについてTakuya ASADA
 

Tendances (20)

地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについて
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきこと0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきこと
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
レシピの作り方入門
レシピの作り方入門レシピの作り方入門
レシピの作り方入門
 
Linux女子部 systemd徹底入門
Linux女子部 systemd徹底入門Linux女子部 systemd徹底入門
Linux女子部 systemd徹底入門
 
AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解
 
Rescale で Singularity を使ってみよう!
Rescale で Singularity を使ってみよう!Rescale で Singularity を使ってみよう!
Rescale で Singularity を使ってみよう!
 
インテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor Skochinsky
インテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor SkochinskyインテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor Skochinsky
インテルMEの秘密 - チップセットに隠されたコードと、それが一体何をするかを見出す方法 - by イゴール・スコチンスキー - Igor Skochinsky
 
MPIによる並列計算
MPIによる並列計算MPIによる並列計算
MPIによる並列計算
 
プログラマ目線から見たRDMAのメリットと その応用例について
プログラマ目線から見たRDMAのメリットとその応用例についてプログラマ目線から見たRDMAのメリットとその応用例について
プログラマ目線から見たRDMAのメリットと その応用例について
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
Linuxのプロセススケジューラ(Reading the Linux process scheduler)
Linuxのプロセススケジューラ(Reading the Linux process scheduler)Linuxのプロセススケジューラ(Reading the Linux process scheduler)
Linuxのプロセススケジューラ(Reading the Linux process scheduler)
 
エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)
エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)
エンジニアなら知っておきたい「仮想マシン」のしくみ v1.1 (hbstudy 17)
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt Affinityについて
 

Similaire à Page reclaim

Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuThe Linux Foundation
 
Perl Memory Use 201209
Perl Memory Use 201209Perl Memory Use 201209
Perl Memory Use 201209Tim Bunce
 
EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...
EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...
EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...Kuniyasu Suzaki
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Tim Bunce
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internalsSisimon Soman
 
Live memory forensics
Live memory forensicsLive memory forensics
Live memory forensicsMehedi Hasan
 
Memory hierarchy.pdf
Memory hierarchy.pdfMemory hierarchy.pdf
Memory hierarchy.pdfISHAN194169
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfBasics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfstroganovboris
 
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]RootedCON
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamalKamal Maiti
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.pptLECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.pptNikhilKumarJaiswal2
 

Similaire à Page reclaim (20)

Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream Qemu
 
Perl Memory Use 201209
Perl Memory Use 201209Perl Memory Use 201209
Perl Memory Use 201209
 
EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...
EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...
EuroSec2012 "Effects of Memory Randomization, Sanitization and Page Cache on ...
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
Linux memory
Linux memoryLinux memory
Linux memory
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )
 
Build Your OS Part1
Build Your OS Part1Build Your OS Part1
Build Your OS Part1
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
Live memory forensics
Live memory forensicsLive memory forensics
Live memory forensics
 
Memory hierarchy.pdf
Memory hierarchy.pdfMemory hierarchy.pdf
Memory hierarchy.pdf
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfBasics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
 
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
 
Vmfs
VmfsVmfs
Vmfs
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.pptLECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 

Dernier

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 

Dernier (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 

Page reclaim

  • 3. What’s Page Frame ❖ page frame = A page-sized/aligned piece of RAM! ❖ struct page = An one-on-one structure in kernel for each page frame! ❖ mem_map! ❖ Unique array of struct page's which covers all RAM that a kernel manages.! ❖ but in CONFIG_SPARSEMEM environment! ❖ There's no unique mem_map.! ❖ Instead, there's a list of 2MB-sized arrays of struct page's.! ❖ You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
  • 4. What’s NUMA ❖ NUMA(Non-Uniform Memory Architecture)! ❖ System is comprised of nodes.! ❖ Each node is defined by a set of CPUs and one physical memory range.! ❖ Memory access latency differs depending on source and destination nodes.! ❖ NUMA configuration! ❖ ACPI provides NUMA configuration:! ❖ SRAT(Static Resource Affinity Table)! ❖ To know which CPUs and memory range are contained in which NUMA node?! ❖ SLIT(System Locality Information Table)! ❖ To know how far a NUMA node is from another node?
  • 5. What’s Memory Zone ❖ Physical memory is separated by address range:! ❖ ZONE_DMA: <16MB! ❖ ZONE_DMA32: <4GB! ❖ ZONE_NORMAL: the rest! ❖ ZONE_MOVABLE: none by default.! ❖ This is used to define a hot-removable physical memory range.
  • 6. struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! }; Memory node, zone 物理アドレス Range1 Range2 CPU1 CPU2 CPU3 CPU4 struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! …! }; NUMA node1 NUMA node2 ❖ どのpglist_dataにも各ZONE(DMA∼MOVABLE)に対応する zone構造体が用意される(但し一部の中身は空かもしれない)
  • 7. Memory Allocation 1. At first, checks threshold for each zone
 (threshold = watermark and dirty-ratio).! ❖ If all zones are failed, the kernel goes into page reclaim path (=today’s topic).! 2. If some zone is ok, allocates a page from the zone’s buddy system.! ❖ 0-order page is allocated from per-cpu cache.! ❖ higher order page is obtained from per-order lists of pages
  • 8. Memory Deallocation ❖ Page is returned to buddy system.! ❖ 0-order page is returned to per-cpu cache via free_hot_cold_page().! ❖ Cold page: A page estimated not to be on CPU cache! ❖ This is linked to the tail of LRU list of the per-cpu cache.! ❖ Hot page: A page estimated to be on CPU cache! ❖ This is linked to the head of LRU list of the per-cpu cache.! ❖ higher order page is directly returned to per-order lists of pages.
  • 9. Buddy System 4k 4k 4k 8k 8k 8k 4m 4m 4m ・・・ Per-cpu cache 4k 4k 4k Per-zone buddy system order0
 (de)alloc HOT COLD order1 order10 ・・・
  • 10. 2. ページの回収 2.1 Direct reclaim! 2.2 Daemon reclaim
  • 11. ページ割当フローの復習 ❖ __alloc_pages_nodemask(ページ割当基本関数)! ❖ get_page_from_freelist(1st: local zones, low wmark) → get_page_from_freelist(2nd: all zones)! ❖ __alloc_pages_slowpath! 1. wake_all_kswapds(kswapd達の起床)! 2. get_page_from_freelist(3rd: all zones, min wmark)! 3. if {__GFP,PF}_MEMALLOC → __alloc_pages_high_priority! 4. __alloc_pages_direct_compact(非同期的)! 5. __alloc_pages_direct_reclaim(本コンテキストで直接ページ回収)! 6. if not did_some_progress → __alloc_pages_may_oom! 7. リトライ(2.へ) 又は __alloc_pages_direct_compact(同期的)
  • 13. __alloc_pages_direct_reclaim() ❖ __perform_reclaim! ❖ current->flags |= PF_MEMALLOC! ❖ ページ回収の延長でページ割当が必要になった時に、緊急備蓄分を使用できるように! ❖ try_to_free_pages! ❖ throttle_direct_reclaim! ❖ if !pfmemalloc_watermark_ok →  kswapdによりokになるのを待機! ❖ do_try_to_free_pages! ❖ current->flags &= ~PF_MEMALLOC! ❖ get_page_from_freelist! ❖ drain_all_pages! ❖ get_page_from_freelist
  • 14. pfmemalloc_watermark_ok() ❖ ARGS! ❖ pgdat(type: struct pglist_data)! ❖ RETURN! ❖ type: bool! ❖ node’s free_pages > 0.5 * node’s min_wmark! ❖ DESC! ❖ node単位で(zone単位でなく)、フリーページ量を min watermarkの半分と比較し、超え ていればOK! ❖ 下回っていればfalseを返すとともに、 当該nodeのkswapdを起床! ❖ メモリ 迫したnodeではdirect reclaimはやめて kswapdに任せる、その閾値を決める関数。
  • 15. do_try_to_free_pages() ❖ Core function for page reclaim, which is called at 3 different scenes! ❖ try_to_free_pages() → Global reclaim path via __alloc_pages_nodemask()! ❖ try_to_free_mem_cgroup_pages() → Per-memcg reclaim path! ❖ Right before per-memcg slab allocation! ❖ Right before per-memcg file page allocation! ❖ Right before per-memcg anon page allocation! ❖ Right before per-memcg swapin allocation! ❖ shrink_all_memory() → Hibernation path! ❖ Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
  • 16. struct scan_control struct scan_control {! ! unsigned long nr_scanned;! ! unsigned long nr_reclaimed;! ! unsigned long nr_to_reclaim;! ! …! ! int swappiness; // 0..100! ! …! ! struct mem_cgroup *target_mem_cgroup;! ! …! ! nodemask_t! *nodemask;! };!
  • 17. do_try_to_free_pagesの処理 ❖ 以下二つのループ! ❖ shrink_zones()! ❖ 後述! ❖ wakeup_flusher_threads()! ❖ shrink_zonesが、回収目標(scan_context::nr_to_reclaim)の1.5 倍以上のページをスキャンするたび、呼び出し。! ❖ 最大で、スキャンした分のページをライトバックするよう、 全ブロックデバイス(bdi)に要求。
  • 18. shrink_zones() 1. for_each_zone_zonelist_nodemask:! 1. mem_cgroup_soft_limit_reclaim! ❖ while mem_cgroup_largest_soft_limit_node:! ❖ mem_cgroup_soft_reclaim! ❖ shrink_zoneに進む前に、当該zoneを使ってる memcgでlimitを超えてるものについて、 ページ 回収を済ませる処理! 2. shrink_zone! ❖ foreach mem_cgroup_iter:! ❖ shrink_lruvec! ❖ ここでのiterationはGlobal reclaimの場合は root memcgから回収! 2. shrink_slab! ❖ スラブについては次回以降で・・・
  • 19. shrink_lruvec() ❖ per-zone page freer! 1. get_scan_count! ❖ 回収目標ページ数決定! 2. while 目標未達:! ❖ shrink_list(LRU_INACTIVE_ANON)! ❖ shrink_list(LRU_ACTIVE_ANON)! ❖ shrink_list(LRU_INACTIVE_FILE)! ❖ shrink_list(LRU_ACTIVE_FILE)! 3. if INACTIVEな無名メモリだけでは不足:! ❖ shrink_active_list
  • 20. shrink_list() ❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを shrinkするのは、対となるinactiveリストより大きい場合のみ! 1. if ACTIVEなリストを指定:! ❖ if size of lru(ACTIVE) > size of lru(INACTIVE):! ❖ shrink_active_list! 2. else:! ❖ shrink_inactive_list
  • 21. shrink_{active,inactive}_list ❖ shrink_active_list()! 1. Traverse pages in an active list! 2. Find inactive pages in the list and move them to an inactive list! ❖ shrink_inactive_list()! ❖ foreach page:! 1. page_mapped(page) => try_to_unmap(page)! 2. if PageDirty(page) => pageout(page)
  • 22. inactiveなページとは ❖ !laptop_modeの場合! ❖ active LRU listの末尾から、単純に指定数分のページ をinactiveなページとして取得! ❖ laptop_modeの場合! ❖ active LRU listの末尾から、cleanな指定数分のページ をinactiveなページとして取得
  • 23. try_to_unmap() ❖ Unmap a specified page from all corresponding mappings! 1. Set up struct rmap_walk_control.! 2. rmap_walk_{file, anon, or ksm}! ❖ rmap walk is iterating VMAs and unmapping from it! A. file: traverse address_space::i_mmap tree! B. anon: traverse anon_vma tree! C. ksm: traverse all merged anon_vma trees! ❖ each operation is similar to that for anon
  • 25. B. rmap_walk_anon page anon_vma rb_root(type:rb_root) vma vma vma vma pgtbl pgtbl pgtbl pgtbl unmap
  • 28. kswapd ❖ Processing overview! 1. Wake up! 2. balance_pgdat()! 3. Sleep! ❖ balance_pgdat()! ❖ Work until all zones of pgdat are at or over hi-wmark.! ❖ reclaim function: kswapd_shrink_zone()