SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Memory Mapping Implementation (mmap) in
Linux Kernel
Adrian Huang | May, 2022
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr norandmaps
* Userspace: ASLR is disabled
* Legacy BIOS
Agenda
• Three different IO types
• Process Address Space – mm_struct & VMA
• Four types of memory mappings
• mmap system call implementation
• Demand page: page fault handling for four types of memory mappings
• fork()
• COW mapping configuration: set ‘write protect’ when calling fork()
• COW fault call path
Three different IO types
• Buffered IO
o Leverage page cache
• Direct IO
o Bypass page cache
• Memory-mapped IO (file-based mapping,
memory-mapped file)
• File content can be accessed by operations on the
bytes in the corresponding memory region.
Process Address Space – mm_struct & VMA (1/2)
Process Address Space – mm_struct & VMA (2/2)
Four types of memory mappings
Reference from: Chapter 49, The Linux Programming Interface
File Anonymous
Private
(Modification is not visible to other processes)
Initializing memory from contents of file
Example: Process's .text and .data segements
(Changes are not carried through to the underlying file)
Memory Allocation
Shared
(Modification is visible to other processes)
1. Memory-mapped IO: Changes are carried through to
the underlying file
2. Sharing memory between processes (IPC)
Sharing memory between processes (IPC)
Visibility of Modification
Mapping Type
mmap system call implementation (1/2)
mmap system call implementation (2/2)
1. thp_get_unmapped_area() eventually calls arch_get_unmapped_area_topdown()
2. unmapped_area_topdown() is the key point for allocating an userspace address
inode_operations & file_operations: file
thp_get_unmapped_area(): DAX supported for huge page
inode_operations & file_operations: directory
vma
vm_start =
0x400000
vm_end =
0x401000
vma
vm_start =
0x401000
vm_end =
0x496000
vma
vm_start =
0x496000
vm_end =
0x4bd000
vma
vm_start =
0x4be000
vm_end =
0x4c1000
vma
vm_start =
0x4c1000
vm_end =
0x4c4000
vma (heap)
vm_start =
0x4c4000
vm_end =
0x4e8000
vma (vvar)
vm_start =
0x7ffff7ffa000
vm_end =
0x7ffff7ffe000
vma (vdso)
vm_start =
0x7ffff7ffe000
vm_end =
0x7ffff7fff000
vma (stack)
vm_start =
0x7ffffffde000
vm_end =
0x7ffffffff000
GAP
GAP
G
A
P
GAP
user space address
0 0x7ffffffff000
Process address space: Doubly linked-list
Process address space: VMA rbtree
Process address space: VMA rbtree -> rb_subtree_gap
• Maximum of the following items:
o gap between this vma and previous one
o rb_left.subtree_gap
o rb_right.subtree_gap
rb_subtree_gap calculation
Process address space: VMA rbtree -> rb_subtree_gap
• Maximum of the following items:
o gap between this vma and previous one
o rb_left.subtree_gap
o rb_right.subtree_gap
rb_subtree_gap calculation
max(0x7ffff7ffa000 – 0x4e8000, 0, 0) =
0x7fff7b12000
Process address space: VMA rbtree -> rb_subtree_gap
• VM_GROWSDOWN is set in vma->vm_flags
• max(0x7ffffffde000 – 0x7ffff7fff000 -
stack_guard_gap, 0, 0) = 0x7edf000
• stack_guard_gap = 256UL<<PAGE_SHIFT
unmapped_area_topdown()
unmapped_area_topdown() - Principle
vm_unmapped_area_info
length = 0x1000
low_limit = PAGE_SIZE
high_limit = mm->mmap_base
= 0x7ffff7fff000
gap_end = info->high_limit (0x7ffff7fff000)
high_limit = gap_end – info->length = 0x7ffff7ffe000
gap_start = mm->highest_vm_end = 0x7ffffffff000
unmapped_area_topdown(): Preparation
[Efficiency] Traverse vma rbtree to find gap space instead of traversing vma linked list
unmapped_area_topdown()
unmapped_area_topdown()
unmapped_area_topdown()
unmapped_area_topdown()
Stack guard gap
unmapped_area_topdown()
Stack guard gap
gap_end = 0x7ffffffde000 – 0x100000 = 0x7fffffede000 * stack_guard_gap = 256UL<<PAGE_SHIFT = 0x100000
unmapped_area_topdown()
Not found yet
unmapped_area_topdown()
1
2
unmapped_area_topdown()
unmapped_area_topdown()
Found
return 0x7ffff7ffa000 – 0x1000 = 0x7ffff7ff9000
vma_merge() – Possible ways to merge
Note: VMAs must have the same attributes
mmap_region()->munmap_vma_range()->find_vma_links
find_vma_links(…, 0x7ffff7ff9000, …)
Iterate rbtree to find an empty VMA link (rb_link)
• rb_parent: The parent of rb_link
• rb_prev: Previous vma of rb_parent
• rb_prev is used in vma_merge()
• rb_link, rb_parent and rb_prev
are used in vma_link().
• rb_prev is the previous VMA
after the new VMA is inserted.
Note
vm_area_alloc() & vma_link()
__vma_link_file(): i_mmap for recording all memory mappings
(vma) of the memory-mapped file
task_struct
mm
files_struct
fd_array[]
file
f_inode
f_pos
f_mapping
.
.
file
inode
*i_mapping
i_atime
i_mtime
i_ctime
mnt
dentry
f_path
address_space
i_data
host
page_tree
i_mmap
page
mapping
index
radix_tree_root
height = 2
rnode
radix_tree_node
count = 2
63
0 1 …
page
2 3
radix_tree_node
count = 1
63
0 1 …
2 3
page page
slots[0]
slots[3]
slots[1] slots[3] slots[2]
index = 1 index = 3 index = 194
radix_tree_node
count = 1
63
0 1 …
2 3
Interval tree implemented
via red-black tree
Radix Tree (v4.19 or earlier)
Xarray (v4.20 or later)
files
mm
struct vm_area_struct *mmap
get_unmapped_area
mmap_base
page cache
vm_area_struct
vm_mm
vm_ops
vm_file
vm_area_struct
vm_file
.
.
pgd
1. i_mmap: Reverse mapping (RMAP) for the memory-mapped file (page cache) → check __vma_link_file ()
2. anon_vma: Reverse mapping for anonymous pages -> anon_vma_prepare() invoked during page fault
RMAP
mm_populate()
Anonymous Page: Page Fault – Discussion List
• Private (MAP_PRIATE)
• Write
• Read before a write
• Share (MAP_SHARED)
Demand Paging: page fault – Anonymous page (MAP_PRIVATE)
Demand Paging: page fault → Page table configuration - Anonymous page
(MAP_PRIVATE)
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PDE #511
PML4E #255
PML4E for
kernel
PDPTE #511
Physical Memory
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PDE #447
PTE #505 mmap
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Demand Paging: page fault → Page table configuration
[anonymous page: MAP_PRIVATE] page fault flow
man mmap
[gdb] backtrace
Demand Paging: page fault → Page table configuration - Anonymous page
(MAP_PRIVATE) [anonymous page: MAP_PRIVATE] page fault flow
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PDE #511
PML4E #255
PML4E for
kernel
PDPTE #511
Physical Memory
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PDE #447
PTE #505 mmap
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Demand Paging: page fault → Page table configuration - Anonymous page
(MAP_PRIVATE)
Demand Paging: page fault → call trace - Anonymous page (MAP_PRIVATE)
Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE)
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PDE #511
PML4E #255
PML4E for
kernel
PDPTE #511
Physical Memory
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PDE #447
PTE #505 mmap
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PDE #511
PML4E #255
PML4E for
kernel
PDPTE #511
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PDE #447
PTE #505
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) –
Read before writing data (read fault)
Pre-allocated
zero page
Physical Memory
read fault
1. A pre-allocated zero page is initialized during system init -> Check init_zero_pfn()
2. Anonymous private page + read fault -> Link to the pre-allocated zero page
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PDE #511
PML4E #255
PML4E for
kernel
PDPTE #511
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PDE #447
PTE #505
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) –
Read before writing data (read fault)
Pre-allocated
zero page
Physical Memory
read fault
1. A pre-allocated zero page is initialized during system init -> Check init_zero_pfn()
2. Anonymous private page + read fault -> Link to the pre-allocated zero page
Link to the pre-allocated zero page
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PDE #511
PML4E #255
PML4E for
kernel
PDPTE #511
Physical Memory
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PDE #447
PTE #505
Dedicated zero
page
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) –
Read before writing data (read fault) -> Write fault
mmap
(anonymous)
1
Allocate a
zeroed page
2
Link to the newly
allocated page
write fault
Page fault handler for userspace address
Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE &
MAP_SHARED): first-time access
Anonymous page +
MAP_PRIVATE
Anonymous page +
MAP_SHARED + WRITE
Anonymous page +
MAP_SHARED + READ
[do_anonymous_page]
• Read fault: Apply the pre-allocated zero page
Demand Paging: page fault → VMAs – Anonymous page (MAP_SHARED):
first-time write
Demand Paging: page fault → VMAs – Anonymous page (MAP_SHARED):
first-time write
Anonymous page + MAP_SHARED
Types of memory mappings: update
Reference from: Chapter 49, The Linux Programming Interface
Description Backing file
Private
(Modification is not visible to other processes)
Initializing memory from contents of file
Example: Process's .text and .data segements
(Changes are not carried through to the underlying file)
Memory Allocation No
Shared
(Modification is visible to other processes)
1. Memory-mapped IO: Changes are carried through to
the underlying file
2. Sharing memory between processes (IPC)
Sharing memory between processes (IPC) /dev/zero
Mapping Type
Anonymous
File
Visibility of Modification
Memory-mapped file
vma->vm_ops: invoke vm_ops callbacks in page fault
handler
• struct vm_operations_struct
o fault : Invoked by page fault handler to read
the corresponding data into a physical page.
o map_pages : Map the page if it is in page
cache (check function ‘do_fault_around’:
warm/cold page cache).
o page_mkwrite: Notification that a previously
read-only page is about to become writable.
Memory-mapped file: related data structures
NULL: anonymous page
Page fault handler for userspace address: Memory-mapped file
Look at this firstly
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PML4E #255
PML4E for
kernel
PDPTE #511
Physical Memory
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PTE #505 mmap
(memory-mapped file
= page cache)
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Disk
PDE #511
PDE #447
Already in page cache:
file->f_mapping
1
2
Demand Paging: page fault → do_read_fault(): warm page cache
read page fault
Memory-mapped file
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PML4E #255
PML4E for
kernel
PDPTE #511
Physical Memory
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
Legend
Allocated pages or page table entry
Will be allocated if page fault occurs
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PTE #505 mmap
(memory-mapped file
= page cache)
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Disk
PDE #511
PDE #447
Already in page cache:
file->f_mapping
1
2
Demand Paging: page fault → do_read_fault(): warm page cache
read page fault
Memory-mapped file
Demand Paging: page fault → filemap_map_pages(): warm page cache
read page fault
1
2
map_pages callback: map a page cache (already available) to process address space
(process page table) → warm page cache
Memory-mapped file
Demand Paging: page fault → call path for warm/cold page cache
do_read_fault
alloc_set_pte
filemap_map_pages
vmf->vma->vm_ops->map_pages
do_fault_around
Find the page cache from
address_space (page cache pool)
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
warm page cache
cold page cache
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PML4E #255
PML4E for
kernel
PDPTE #511
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PTE #505 mmap
(memory-mapped file
= page cache)
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Disk
PDE #511
PDE #447
Already in page cache:
file->f_mapping
1
2
warm page cache: make sure “page cache = mmap physical page” (1/3)
Physical Memory
Page Cache - Verification
page->mapping
• [bit 0] = 1: anonymous page → mapping field = anon_vma descriptor
• [bit 0] = 0: page cache → mapping field = address_space descriptor
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PML4E #255
PML4E for
kernel
PDPTE #511
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PTE #505 mmap
(memory-mapped file
= page cache)
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Disk
PDE #511
PDE #447
Already in page cache:
file->f_mapping
1
2
After alloc_set_pte()
After alloc_set_pte()
Physical Memory
<< PAGE_SHIFT
warm page cache: make sure “page cache = mmap physical page” (2/3)
Page Map
Level-4 Table
Sign-extend
Page Map
Level-4 Offset
30 21
39 38 29
47
48
63
Page Directory
Pointer Offset
Page Directory
Offset
Page Directory
Pointer Table
Page Directory
Table
PML4E #255
PML4E for
kernel
PDPTE #511
PTE #510
stack
task_struct
pgd
mm
mm_struct
mmap
PTE #509
…
PTE #478
PML4E #0
PDE #2
PDPTE #0
PTE #0
.text, .data, …
Linear Address: 0x7ffff7ff9000
PTE #188
…
PTE #190
PTE #199
…
heap
PTE #505 mmap
(memory-mapped file
= page cache)
12
20 11 0
Page Table
Page Directory
Pointer Offset
Page Directory Offset
Disk
PDE #511
PDE #447
Already in page cache:
file->f_mapping
1
2
After alloc_set_pte()
After alloc_set_pte()
Physical Memory
will generate a write fault for next write
warm page cache: make sure “page cache = mmap physical page” (3/3)
write fault (write-protected fault)
write fault
Memory-mapped file
do_wp_page
wp_page_copy
new_page = alloc_page_vma(…)
cow_user_page
copy_user_highpage
maybe_mkwrite
wp_page_shared
[MAP_PRIVATE] COW: Copy On Write
[MAP_SHARED] vma is (VM_WRITE|VM_SHARED)
write fault (write-protected fault): do_wp_page
write fault
[MAP_PRIVATE] COW: Quote from `man mmap`
Memory-mapped file
write fault (write-protected fault) – call path
write fault without previously reading (MAP_SHARED): do_shared_fault
write fault
Memory-mapped file
write fault without previously reading (MAP_SHARED): do_shared_fault
do_shared_fault
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache
do_page_mkwrite
fault_dirty_shared_page
write fault without previously reading (MAP_SHARED): do_shared_fault
do_shared_fault
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache: COW
do_page_mkwrite
fault_dirty_shared_page
write fault without previously reading (MAP_PRIVATE): do_cow_fault
write fault
Memory-mapped file
write fault without previously reading (MAP_PRIVATE): do_cow_fault
do_cow_fault
vmf->cow_page = alloc_page_vma(…)
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache
copy_user_highpage(vmf->cow_page, vmf->page, …)
1. Apply vmf->page if not a COW page
2. Apply vmf->cow_page if a COW page
Quote from `man mmap`
do_cow_fault
vmf->cow_page = alloc_page_vma(…)
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache: COW
copy_user_highpage(vmf->cow_page, vmf->page, …)
1. Apply vmf->page if not a COW page
2. Apply vmf->cow_page if a COW page
1
2
3
4
breakpoint
write fault without previously reading (MAP_PRIVATE): do_cow_fault
do_cow_fault
vmf->cow_page = alloc_page_vma(…)
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache: COW
copy_user_highpage(vmf->cow_page, vmf->page, …)
1. Apply vmf->page if not a COW page
2. Apply vmf->cow_page if a COW page
1
2
3
4
breakpoint
write fault without previously reading (MAP_PRIVATE): do_cow_fault
2
do_cow_fault
vmf->cow_page = alloc_page_vma(…)
__do_fault
vmf->vma->vm_ops->fault
ext4_filemap_fault
filemap_fault
do_sync_mmap_readahead
finish_fault
alloc_set_pte
find_get_page
do_async_mmap_readahead
No page in page cache
page in page cache: COW
copy_user_highpage(vmf->cow_page, vmf->page, …)
1. Apply vmf->page if not a COW page
2. Apply vmf->cow_page if a COW page
1
2
3
4
breakpoint
write fault without previously reading (MAP_PRIVATE): do_cow_fault
[Recap] Think about this: #1
write fault
read fault
1
1
2
2
Memory-mapped file
Think about this: #2
fork(): Which function (do_cow_fault or do_wp_page)
is called when COW is triggered?
Think about this…
this one?
or, this one?
fork(): COW mapping – set ‘write protect’ for src/dst PTEs
fork(): COW mapping – set ‘write protect’ for src/dst PTEs
fork(): COW Fault Call Path
Backup
vdso & vvar
• vsyscall (Virtual System Call)
o The context switch overhead (user <-> kernel) of some system calls (gettimeofday, time, getcpu) is greater than
execution time of those functions: Built on top of the fixed-mapped address
o Machine code format
o Core dump: debugger cannot provide the debugging info because symbols of this area are unavailable
• vDSO (Virtual Dynamic Shared Object)
• ELF format
• A small shared library that the kernel automatically maps into the address space of all userspace applications
• VVAR (vDSO Variable)

Contenu connexe

Tendances

Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
Ni Zo-Ma
 

Tendances (20)

malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdf
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernel
 
COSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingCOSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem porting
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Network Drivers
Network DriversNetwork Drivers
Network Drivers
 

Similaire à Memory Mapping Implementation (mmap) in Linux Kernel

Os8 2
Os8 2Os8 2
Os8 2
issbp
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
Sisimon Soman
 
Chapter 04
Chapter 04Chapter 04
Chapter 04
Google
 
Assignment of SOS operating systemThe file lmemman.c has one incom.pdf
Assignment of SOS operating systemThe file lmemman.c has one incom.pdfAssignment of SOS operating systemThe file lmemman.c has one incom.pdf
Assignment of SOS operating systemThe file lmemman.c has one incom.pdf
sktambifortune
 

Similaire à Memory Mapping Implementation (mmap) in Linux Kernel (20)

memory_mapping.ppt
memory_mapping.pptmemory_mapping.ppt
memory_mapping.ppt
 
Memory
MemoryMemory
Memory
 
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
 
Linux memory
Linux memoryLinux memory
Linux memory
 
Os8 2
Os8 2Os8 2
Os8 2
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
Memory Management
Memory ManagementMemory Management
Memory Management
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
Page Cache in Linux 2.6.pdf
Page Cache in Linux 2.6.pdfPage Cache in Linux 2.6.pdf
Page Cache in Linux 2.6.pdf
 
Chapter 04
Chapter 04Chapter 04
Chapter 04
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
INFLOW-2014-NVM-Compression
INFLOW-2014-NVM-CompressionINFLOW-2014-NVM-Compression
INFLOW-2014-NVM-Compression
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Assignment of SOS operating systemThe file lmemman.c has one incom.pdf
Assignment of SOS operating systemThe file lmemman.c has one incom.pdfAssignment of SOS operating systemThe file lmemman.c has one incom.pdf
Assignment of SOS operating systemThe file lmemman.c has one incom.pdf
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
cPanelCon 2015: InnoDB Alchemy
cPanelCon 2015: InnoDB AlchemycPanelCon 2015: InnoDB Alchemy
cPanelCon 2015: InnoDB Alchemy
 
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 

Dernier

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

Memory Mapping Implementation (mmap) in Linux Kernel

  • 1. Memory Mapping Implementation (mmap) in Linux Kernel Adrian Huang | May, 2022 * Based on kernel 5.11 (x86_64) – QEMU * SMP (4 CPUs) and 8GB memory * Kernel parameter: nokaslr norandmaps * Userspace: ASLR is disabled * Legacy BIOS
  • 2. Agenda • Three different IO types • Process Address Space – mm_struct & VMA • Four types of memory mappings • mmap system call implementation • Demand page: page fault handling for four types of memory mappings • fork() • COW mapping configuration: set ‘write protect’ when calling fork() • COW fault call path
  • 3. Three different IO types • Buffered IO o Leverage page cache • Direct IO o Bypass page cache • Memory-mapped IO (file-based mapping, memory-mapped file) • File content can be accessed by operations on the bytes in the corresponding memory region.
  • 4. Process Address Space – mm_struct & VMA (1/2)
  • 5. Process Address Space – mm_struct & VMA (2/2)
  • 6. Four types of memory mappings Reference from: Chapter 49, The Linux Programming Interface File Anonymous Private (Modification is not visible to other processes) Initializing memory from contents of file Example: Process's .text and .data segements (Changes are not carried through to the underlying file) Memory Allocation Shared (Modification is visible to other processes) 1. Memory-mapped IO: Changes are carried through to the underlying file 2. Sharing memory between processes (IPC) Sharing memory between processes (IPC) Visibility of Modification Mapping Type
  • 7. mmap system call implementation (1/2)
  • 8. mmap system call implementation (2/2) 1. thp_get_unmapped_area() eventually calls arch_get_unmapped_area_topdown() 2. unmapped_area_topdown() is the key point for allocating an userspace address
  • 9. inode_operations & file_operations: file thp_get_unmapped_area(): DAX supported for huge page
  • 11. vma vm_start = 0x400000 vm_end = 0x401000 vma vm_start = 0x401000 vm_end = 0x496000 vma vm_start = 0x496000 vm_end = 0x4bd000 vma vm_start = 0x4be000 vm_end = 0x4c1000 vma vm_start = 0x4c1000 vm_end = 0x4c4000 vma (heap) vm_start = 0x4c4000 vm_end = 0x4e8000 vma (vvar) vm_start = 0x7ffff7ffa000 vm_end = 0x7ffff7ffe000 vma (vdso) vm_start = 0x7ffff7ffe000 vm_end = 0x7ffff7fff000 vma (stack) vm_start = 0x7ffffffde000 vm_end = 0x7ffffffff000 GAP GAP G A P GAP user space address 0 0x7ffffffff000 Process address space: Doubly linked-list
  • 13. Process address space: VMA rbtree -> rb_subtree_gap • Maximum of the following items: o gap between this vma and previous one o rb_left.subtree_gap o rb_right.subtree_gap rb_subtree_gap calculation
  • 14. Process address space: VMA rbtree -> rb_subtree_gap • Maximum of the following items: o gap between this vma and previous one o rb_left.subtree_gap o rb_right.subtree_gap rb_subtree_gap calculation max(0x7ffff7ffa000 – 0x4e8000, 0, 0) = 0x7fff7b12000
  • 15. Process address space: VMA rbtree -> rb_subtree_gap • VM_GROWSDOWN is set in vma->vm_flags • max(0x7ffffffde000 – 0x7ffff7fff000 - stack_guard_gap, 0, 0) = 0x7edf000 • stack_guard_gap = 256UL<<PAGE_SHIFT
  • 17. unmapped_area_topdown() - Principle vm_unmapped_area_info length = 0x1000 low_limit = PAGE_SIZE high_limit = mm->mmap_base = 0x7ffff7fff000 gap_end = info->high_limit (0x7ffff7fff000) high_limit = gap_end – info->length = 0x7ffff7ffe000 gap_start = mm->highest_vm_end = 0x7ffffffff000 unmapped_area_topdown(): Preparation [Efficiency] Traverse vma rbtree to find gap space instead of traversing vma linked list
  • 22. unmapped_area_topdown() Stack guard gap gap_end = 0x7ffffffde000 – 0x100000 = 0x7fffffede000 * stack_guard_gap = 256UL<<PAGE_SHIFT = 0x100000
  • 27. vma_merge() – Possible ways to merge Note: VMAs must have the same attributes
  • 28. mmap_region()->munmap_vma_range()->find_vma_links find_vma_links(…, 0x7ffff7ff9000, …) Iterate rbtree to find an empty VMA link (rb_link) • rb_parent: The parent of rb_link • rb_prev: Previous vma of rb_parent • rb_prev is used in vma_merge() • rb_link, rb_parent and rb_prev are used in vma_link(). • rb_prev is the previous VMA after the new VMA is inserted. Note
  • 30. __vma_link_file(): i_mmap for recording all memory mappings (vma) of the memory-mapped file task_struct mm files_struct fd_array[] file f_inode f_pos f_mapping . . file inode *i_mapping i_atime i_mtime i_ctime mnt dentry f_path address_space i_data host page_tree i_mmap page mapping index radix_tree_root height = 2 rnode radix_tree_node count = 2 63 0 1 … page 2 3 radix_tree_node count = 1 63 0 1 … 2 3 page page slots[0] slots[3] slots[1] slots[3] slots[2] index = 1 index = 3 index = 194 radix_tree_node count = 1 63 0 1 … 2 3 Interval tree implemented via red-black tree Radix Tree (v4.19 or earlier) Xarray (v4.20 or later) files mm struct vm_area_struct *mmap get_unmapped_area mmap_base page cache vm_area_struct vm_mm vm_ops vm_file vm_area_struct vm_file . . pgd 1. i_mmap: Reverse mapping (RMAP) for the memory-mapped file (page cache) → check __vma_link_file () 2. anon_vma: Reverse mapping for anonymous pages -> anon_vma_prepare() invoked during page fault RMAP
  • 32. Anonymous Page: Page Fault – Discussion List • Private (MAP_PRIATE) • Write • Read before a write • Share (MAP_SHARED)
  • 33. Demand Paging: page fault – Anonymous page (MAP_PRIVATE)
  • 34. Demand Paging: page fault → Page table configuration - Anonymous page (MAP_PRIVATE) Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PDE #511 PML4E #255 PML4E for kernel PDPTE #511 Physical Memory PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PDE #447 PTE #505 mmap 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset
  • 35. Demand Paging: page fault → Page table configuration [anonymous page: MAP_PRIVATE] page fault flow
  • 36. man mmap [gdb] backtrace Demand Paging: page fault → Page table configuration - Anonymous page (MAP_PRIVATE) [anonymous page: MAP_PRIVATE] page fault flow
  • 37. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PDE #511 PML4E #255 PML4E for kernel PDPTE #511 Physical Memory PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PDE #447 PTE #505 mmap 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Demand Paging: page fault → Page table configuration - Anonymous page (MAP_PRIVATE)
  • 38. Demand Paging: page fault → call trace - Anonymous page (MAP_PRIVATE)
  • 39. Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PDE #511 PML4E #255 PML4E for kernel PDPTE #511 Physical Memory PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PDE #447 PTE #505 mmap 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset
  • 40. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PDE #511 PML4E #255 PML4E for kernel PDPTE #511 PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PDE #447 PTE #505 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) – Read before writing data (read fault) Pre-allocated zero page Physical Memory read fault 1. A pre-allocated zero page is initialized during system init -> Check init_zero_pfn() 2. Anonymous private page + read fault -> Link to the pre-allocated zero page
  • 41. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PDE #511 PML4E #255 PML4E for kernel PDPTE #511 PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PDE #447 PTE #505 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) – Read before writing data (read fault) Pre-allocated zero page Physical Memory read fault 1. A pre-allocated zero page is initialized during system init -> Check init_zero_pfn() 2. Anonymous private page + read fault -> Link to the pre-allocated zero page Link to the pre-allocated zero page
  • 42. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PDE #511 PML4E #255 PML4E for kernel PDPTE #511 Physical Memory PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PDE #447 PTE #505 Dedicated zero page 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE) – Read before writing data (read fault) -> Write fault mmap (anonymous) 1 Allocate a zeroed page 2 Link to the newly allocated page write fault
  • 43. Page fault handler for userspace address
  • 44. Demand Paging: page fault → VMAs – Anonymous page (MAP_PRIVATE & MAP_SHARED): first-time access Anonymous page + MAP_PRIVATE Anonymous page + MAP_SHARED + WRITE Anonymous page + MAP_SHARED + READ [do_anonymous_page] • Read fault: Apply the pre-allocated zero page
  • 45. Demand Paging: page fault → VMAs – Anonymous page (MAP_SHARED): first-time write
  • 46. Demand Paging: page fault → VMAs – Anonymous page (MAP_SHARED): first-time write Anonymous page + MAP_SHARED
  • 47. Types of memory mappings: update Reference from: Chapter 49, The Linux Programming Interface Description Backing file Private (Modification is not visible to other processes) Initializing memory from contents of file Example: Process's .text and .data segements (Changes are not carried through to the underlying file) Memory Allocation No Shared (Modification is visible to other processes) 1. Memory-mapped IO: Changes are carried through to the underlying file 2. Sharing memory between processes (IPC) Sharing memory between processes (IPC) /dev/zero Mapping Type Anonymous File Visibility of Modification
  • 48. Memory-mapped file vma->vm_ops: invoke vm_ops callbacks in page fault handler • struct vm_operations_struct o fault : Invoked by page fault handler to read the corresponding data into a physical page. o map_pages : Map the page if it is in page cache (check function ‘do_fault_around’: warm/cold page cache). o page_mkwrite: Notification that a previously read-only page is about to become writable.
  • 49. Memory-mapped file: related data structures NULL: anonymous page
  • 50. Page fault handler for userspace address: Memory-mapped file Look at this firstly
  • 51. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PML4E #255 PML4E for kernel PDPTE #511 Physical Memory PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PTE #505 mmap (memory-mapped file = page cache) 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Disk PDE #511 PDE #447 Already in page cache: file->f_mapping 1 2 Demand Paging: page fault → do_read_fault(): warm page cache read page fault Memory-mapped file
  • 52. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PML4E #255 PML4E for kernel PDPTE #511 Physical Memory PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 Legend Allocated pages or page table entry Will be allocated if page fault occurs PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PTE #505 mmap (memory-mapped file = page cache) 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Disk PDE #511 PDE #447 Already in page cache: file->f_mapping 1 2 Demand Paging: page fault → do_read_fault(): warm page cache read page fault Memory-mapped file
  • 53. Demand Paging: page fault → filemap_map_pages(): warm page cache read page fault 1 2 map_pages callback: map a page cache (already available) to process address space (process page table) → warm page cache Memory-mapped file
  • 54. Demand Paging: page fault → call path for warm/cold page cache do_read_fault alloc_set_pte filemap_map_pages vmf->vma->vm_ops->map_pages do_fault_around Find the page cache from address_space (page cache pool) __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte warm page cache cold page cache find_get_page do_async_mmap_readahead No page in page cache page in page cache
  • 55. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PML4E #255 PML4E for kernel PDPTE #511 PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PTE #505 mmap (memory-mapped file = page cache) 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Disk PDE #511 PDE #447 Already in page cache: file->f_mapping 1 2 warm page cache: make sure “page cache = mmap physical page” (1/3) Physical Memory Page Cache - Verification page->mapping • [bit 0] = 1: anonymous page → mapping field = anon_vma descriptor • [bit 0] = 0: page cache → mapping field = address_space descriptor
  • 56. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PML4E #255 PML4E for kernel PDPTE #511 PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PTE #505 mmap (memory-mapped file = page cache) 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Disk PDE #511 PDE #447 Already in page cache: file->f_mapping 1 2 After alloc_set_pte() After alloc_set_pte() Physical Memory << PAGE_SHIFT warm page cache: make sure “page cache = mmap physical page” (2/3)
  • 57. Page Map Level-4 Table Sign-extend Page Map Level-4 Offset 30 21 39 38 29 47 48 63 Page Directory Pointer Offset Page Directory Offset Page Directory Pointer Table Page Directory Table PML4E #255 PML4E for kernel PDPTE #511 PTE #510 stack task_struct pgd mm mm_struct mmap PTE #509 … PTE #478 PML4E #0 PDE #2 PDPTE #0 PTE #0 .text, .data, … Linear Address: 0x7ffff7ff9000 PTE #188 … PTE #190 PTE #199 … heap PTE #505 mmap (memory-mapped file = page cache) 12 20 11 0 Page Table Page Directory Pointer Offset Page Directory Offset Disk PDE #511 PDE #447 Already in page cache: file->f_mapping 1 2 After alloc_set_pte() After alloc_set_pte() Physical Memory will generate a write fault for next write warm page cache: make sure “page cache = mmap physical page” (3/3)
  • 58. write fault (write-protected fault) write fault Memory-mapped file
  • 59. do_wp_page wp_page_copy new_page = alloc_page_vma(…) cow_user_page copy_user_highpage maybe_mkwrite wp_page_shared [MAP_PRIVATE] COW: Copy On Write [MAP_SHARED] vma is (VM_WRITE|VM_SHARED) write fault (write-protected fault): do_wp_page write fault [MAP_PRIVATE] COW: Quote from `man mmap` Memory-mapped file
  • 60. write fault (write-protected fault) – call path
  • 61. write fault without previously reading (MAP_SHARED): do_shared_fault write fault Memory-mapped file
  • 62. write fault without previously reading (MAP_SHARED): do_shared_fault do_shared_fault __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte find_get_page do_async_mmap_readahead No page in page cache page in page cache do_page_mkwrite fault_dirty_shared_page
  • 63. write fault without previously reading (MAP_SHARED): do_shared_fault do_shared_fault __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte find_get_page do_async_mmap_readahead No page in page cache page in page cache: COW do_page_mkwrite fault_dirty_shared_page
  • 64. write fault without previously reading (MAP_PRIVATE): do_cow_fault write fault Memory-mapped file
  • 65. write fault without previously reading (MAP_PRIVATE): do_cow_fault do_cow_fault vmf->cow_page = alloc_page_vma(…) __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte find_get_page do_async_mmap_readahead No page in page cache page in page cache copy_user_highpage(vmf->cow_page, vmf->page, …) 1. Apply vmf->page if not a COW page 2. Apply vmf->cow_page if a COW page Quote from `man mmap`
  • 66. do_cow_fault vmf->cow_page = alloc_page_vma(…) __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte find_get_page do_async_mmap_readahead No page in page cache page in page cache: COW copy_user_highpage(vmf->cow_page, vmf->page, …) 1. Apply vmf->page if not a COW page 2. Apply vmf->cow_page if a COW page 1 2 3 4 breakpoint write fault without previously reading (MAP_PRIVATE): do_cow_fault
  • 67. do_cow_fault vmf->cow_page = alloc_page_vma(…) __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte find_get_page do_async_mmap_readahead No page in page cache page in page cache: COW copy_user_highpage(vmf->cow_page, vmf->page, …) 1. Apply vmf->page if not a COW page 2. Apply vmf->cow_page if a COW page 1 2 3 4 breakpoint write fault without previously reading (MAP_PRIVATE): do_cow_fault 2
  • 68. do_cow_fault vmf->cow_page = alloc_page_vma(…) __do_fault vmf->vma->vm_ops->fault ext4_filemap_fault filemap_fault do_sync_mmap_readahead finish_fault alloc_set_pte find_get_page do_async_mmap_readahead No page in page cache page in page cache: COW copy_user_highpage(vmf->cow_page, vmf->page, …) 1. Apply vmf->page if not a COW page 2. Apply vmf->cow_page if a COW page 1 2 3 4 breakpoint write fault without previously reading (MAP_PRIVATE): do_cow_fault
  • 69. [Recap] Think about this: #1 write fault read fault 1 1 2 2 Memory-mapped file
  • 70. Think about this: #2 fork(): Which function (do_cow_fault or do_wp_page) is called when COW is triggered? Think about this… this one? or, this one?
  • 71. fork(): COW mapping – set ‘write protect’ for src/dst PTEs
  • 72. fork(): COW mapping – set ‘write protect’ for src/dst PTEs
  • 73. fork(): COW Fault Call Path
  • 75. vdso & vvar • vsyscall (Virtual System Call) o The context switch overhead (user <-> kernel) of some system calls (gettimeofday, time, getcpu) is greater than execution time of those functions: Built on top of the fixed-mapped address o Machine code format o Core dump: debugger cannot provide the debugging info because symbols of this area are unavailable • vDSO (Virtual Dynamic Shared Object) • ELF format • A small shared library that the kernel automatically maps into the address space of all userspace applications • VVAR (vDSO Variable)