SlideShare a Scribd company logo
1 of 50
Download to read offline
Copyright © 2017 Red Hat Inc.
1
20 years of Linux Virtual Memory20 years of Linux Virtual Memory
Red Hat, Inc.
Andrea Arcangeli <aarcange at redhat.com>
Kernel Recipes, Paris
29 Sep 2017
Copyright © 2017 Red Hat Inc.
2
Topics
●
Milestones in the evolution of the Virtual
Memory subsystem
●
Virtual Memory latest innovations
– Automatic NUMA balancing
– THP developments
– KSMscale
– userfaultfd
●
Postcopy live Migration, etc..
Copyright © 2017 Red Hat Inc.
3
Virtual Memory (simplified)
Virtual pages
They cost “nothing”
Practically unlimited
on 64bit archs
Physical pages
They cost money!
This is the RAM
arrows = pagetables
virtual to physical mapping
Copyright © 2017 Red Hat Inc.
4
PageTablesPageTables
pgd
pmd
pte
pud
pte
pmd
pte pte
pmd
pte
pud
pte
pmd
pte pte
2^9 = 512 arrows, not just 2
●
Common code and x86 pagetable format is a tree
●
All pagetables are 4KB in size
●
Total: grep PageTables /proc/meminfo
●
(((2**9)**4)*4096)>>48 = 1 → 48bits → 256 TiB
●
5 levels in v4.13 → (48+9) bits → 57bits → 128 PiB
– Build time to avoid slowdown
Copyright © 2017 Red Hat Inc.
5
The Fabric of the Virtual Memory
●
The fabric are all those data structures that connects to the hardware
constrained structures like pagetables and that collectively create all the
software abstractions we're accustomed to
– tasks, processes, virtual memory areas, mmap (glibc malloc) ...
●
The fabric is the most black and white part of the Virtual Memory
●
The algorithms doing the computations on those data structures are the
Virtual Memory heuristics
– They need to solve hard problems with no guaranteed perfect solution
– i.e. when it's the right time to start to unmap pages (swappiness)
●
Some of the design didn't change: we still measure how hard it is to
free memory while we're trying to free it
●
All free memory is used as cache and we overcommit by default (not
excessively by default)
– Android uses: echo 1 >/proc/sys/vm/overcommit_memory
Copyright © 2017 Red Hat Inc.
6
Physical page and struct page
PAGE_SIZE (4KiB) large physical pagestruct page
64 bytes
PAGE
SIZE
4KiB
...Physical RAMmem_map
64/4096
1.56%
of the RAM
Copyright © 2017 Red Hat Inc.
7
MM & VMA
●
mm_struct aka MM
– Memory of the process
●
Shared by threads
●
vm_area_struct aka VMA
– Virtual Memory Area
●
Created and teardown by mmap and
munmap
●
Defines the virtual address space of an
“MM”
Copyright © 2017 Red Hat Inc.
11
Original bitmap...
Active and Inactive list LRU
●
The active page LRU preserves the the active memory working set
– only the inactive LRU loses information as fast as use-once I/O goes
– Introduced in 2001, it works good enough also with an arbitrary balance
– Active/inactive list optimum balancing algorithm was solved in 2012-2014
●
shadow radix tree nodes that detect re-faults
Copyright © 2017 Red Hat Inc.
13
Active and Inactive list LRU
$ grep -i active /proc/meminfo
Active: 3555744 kB
Inactive: 2511156 kB
Active(anon): 2286400 kB
Inactive(anon): 1472540 kB
Active(file): 1269344 kB
Inactive(file): 1038616 kB
Copyright © 2017 Red Hat Inc.
14
Active LRU working-set
detection
●
From Johannes Weiner
fault ------------------------+
|
+--------------+ | +-------------+
reclaim <- T inactive H <-+-- demotion T active H <--+
+--------------+ +-------------+ |
| |
+-------------- promotion ------------------+
H = Head, T = Tail
+-memory available to cache-+
| |
+-inactive------+-active----+
a b | c d e f g h i | J K L M N |
+---------------+-----------+
Copyright © 2017 Red Hat Inc.
15
lru→inactive_age and radix tree
shadow entries
inode/file
radix_tree root
inode/file
radix_tree root
inode/file
radix_tree node
inode/file
radix_tree node
inode/file
radix_tree node
inode/file
radix_tree node
pagepage
Copyright © 2017 Red Hat Inc.
16
Reclaim saving inactive_age
inode/file
radix_tree root
inode/file
radix_tree root
inode/file
radix_tree node
inode/file
radix_tree node
inode/file
radix_tree node
inode/file
radix_tree node
pagepage Exceptional/shadow entry
memcg LRU
LRU reclaim inactive_age++
Exceptional/shadow entry
memcg LRU
LRU reclaim inactive_age++
Copyright © 2017 Red Hat Inc.
20
Object-based reverse mapping
PHYSICAL PAGE
Anonymous
anon_vma
inode
(objrmap)
vma
vma
vma vma
prio_tree
rmap
item
rmap
item
vma
vma
PHYSICAL PAGE
Filesystem
PHYSICAL PAGE
Anonymous
PHYSICAL PAGE
Filesystem
PHYSICAL PAGE
KSM
Copyright © 2017 Red Hat Inc.
21
Active & inactive + rmap
page[0]
page[1]
page[N]
mem_map
process
virtualmemory
process
virtualmemory
process
virtualmemory
process
virtualmemory
active_lruinactive_lru
rmap
rmap
rmap
rmap
Use-once pages to trash Use-many pages
Copyright © 2017 Red Hat Inc.
22
Many more LRUs
●
Separated LRU for anon and file backed
mappings
●
Memcg (memory cgroups) introduced per-
memcg LRUs
●
Removal of unfreeable pages from LRUs
– anonymous memory with no swap
– mlocked memory
●
Transparent Hugepages in the LRU increase
scalability further (lru size decreased 512 times)
Copyright © 2017 Red Hat Inc.
23
Recent Virtual Memory trends
●
Optimizing the workloads for you, without manual tuning
– NUMA hard bindings (numactl) → Automatic NUMA
Balancing
– Hugetlbfs → Transparent Hugepage
– Programs or Virtual Machines duplicating memory →
KSM
– Page pinning (RDMA/KVM shadow MMU) -> MMU notifier
– Private device memory managed by hand and pinned →
HMM/UVM (unified virtual memory) for GPU seamlessly
computing in GPU memory
●
The optimizations can be optionally disabled
Copyright © 2017 Red Hat Inc.
30
Copyright © 2017 Red Hat Inc.
31
Automatic NUMA Balancing
benchmark
Intel SandyBridge (Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz)
2 Sockets – 32 Cores with Hyperthreads
256G Memory
RHEV 3.6
Host bare metal – 3.10.0-327.el7 (RHEL7.2)
VM guest – 3.10.0-324.el7 (RHEL7.2)
VM – 32P , 160G (Optimized for Server)
Storage – Violin 6616 – 16G Fibre Channel
Oracle – 12C , 128G SGA
Test – Running Oracle OLTP workload with increasing user count and
measuring Trans / min for each run as a metric for comparison
Copyright © 2017 Red Hat Inc.
32
10U 20U 40U 80U 100U
0
200000
400000
600000
800000
1000000
1200000
1400000
4 VMs with different NUMA options
OLTP workload
Auto numa OFF
Auto Numa on - No pinning
Auto Numa on - Numa pinning
User count
Trans/min
Copyright © 2017 Red Hat Inc.
33
Automatic NUMA balancing
configuration
●
https://tinyurl.com/zupp9v3
https://access.redhat.com/
●
In RHEL7 Automatic NUMA balancing is enabled when:
# numactl --hardware shows multiple nodes
●
To disable automatic NUMA balancing:
# echo 0 > /proc/sys/kernel/numa_balancing
●
To enable automatic NUMA balancing:
# echo 1 > /proc/sys/kernel/numa_balancing
●
At boot:
numa_balancing=enable|disable
Copyright © 2017 Red Hat Inc.
34
Copyright © 2017 Red Hat Inc.
35
Hugepages
●
Traditionally x86 hardware gave us 4KiB
pages
●
The more memory the bigger the overhead
in managing 4KiB pages
●
What if you had bigger pages?
– 512 times bigger → 2MiB
Copyright © 2017 Red Hat Inc.
36
PageTablesPageTables
pgd
pmd
pte
pud
pte
pmd
pte pte
pmd
pte
pud
pte
pmd
pte pte
2^9 = 512 arrows, not just 2
Copyright © 2017 Red Hat Inc.
37
●
Improve CPU performance
– Enlarge TLB size (essential for KVM)
– Speed up TLB miss (essential for KVM)
●
Need 3 accesses to memory instead of 4 to refill the TLB
– Faster to allocate memory initially (minor)
– Page colouring inside the hugepage (minor)
– Higher scalability of the page LRUs
●
Cons
– clear_page/copy_page less cache friendly
●
Clear faulting subpage last included in v4.13 from Andi Kleen and
Ying Hhuang
– 28.3% increase in vm-scalability anon-w-seq
– higher memory footprint sometime
– Direct compaction takes time
Benefit of hugepages
Copyright © 2017 Red Hat Inc.
38
TLB miss cost:TLB miss cost:
number of accesses to memorynumber of accesses to memory
host THP off guest THP off
host THP on guest THP off
host THP off guest THP on
host THP on guest THP on
novirt THP off
novirt THP on
0 5 10 15 20 25
BaremetalVirtualization
Copyright © 2017 Red Hat Inc.
39
●
How do we get the benefits of hugetlbfs
without having to configure anything?
●
Any Linux process will receive 2M pages
– if the mmap region is 2M naturally
aligned
– If compaction succeeds in producing
hugepages
●
Entirely transparent to userland
Transparent Hugepage design
Copyright © 2017 Red Hat Inc.
40
● /sys/kernel/mm/transparent_hugepage/enabled
– [always] madvise never
●
Always use THP if vma start/end permits
– always [madvise] never
●
Use THP only inside MADV_HUGEPAGE
– Applies to khugepaged too
– always madvise [never]
●
Never use THP
– khugepaged quits
●
Default selected at build time
THP sysfs enabled
Copyright © 2017 Red Hat Inc.
41
● /sys/kernel/mm/transparent_hugepage/defrag
– [always] defer defer+madvise madvise never
●
Always use direct compaction (ideal for long lived allocations)
– always [defer] defer+madvise madvise never
●
Defer compaction asynchronously (kswapd/kcompactd)
– always defer [defer+madvise] madvise never
● Direct compaction in MADV_HUGEPAGE, async otherwise
– always defer defer+madvise [madvise] never
●
Use direct compaction only inside MADV_HUGEPAGE
●
khugepaged enabled in the background
– always defer defer+madvise madvise [never]
●
Never use compaction, quit khugepaged too
●
Disabling THP is excessive if only direct compaction is too expensive
●
KVM uses MADV_HUGEPAGE
– MADV_HUGEPAGE will still use direct compaction
THP defrag - compaction control
Copyright © 2017 Red Hat Inc.
42
Priming the VM
●
To achieve maximum THP utilization you can run
the two commands below
# cat /proc/buddyinfo
Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3
Node 0, zone DMA32 8357 4904 2670 472 126 82 8 1 0 3 5
Node 0, zone Normal 5364 38097 608 63 11 1 1 55 0 0 0
# echo 3 >/proc/sys/vm/drop_caches
# cat /proc/buddyinfo
Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3
Node 0, zone DMA32 5989 5445 5235 5022 4420 3583 2424 1331 471 119 25
Node 0, zone Normal 71579 58338 36733 19523 6431 1572 559 282 115 29 2
# echo >/proc/sys/vm/compact_memory
# cat /proc/buddyinfo
Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3
Node 0, zone DMA32 1218 1161 1227 1240 1135 925 688 502 342 305 369
Node 0, zone Normal 7479 5147 5124 4240 2768 1959 1391 814 389 270 149
4k 8k 16k 32k 64k 2M 4M
Copyright © 2017 Red Hat Inc.
43
●
From Kirill A. Shutemov and Hugh Dickins
– Merged in the v4.8 kernel
● # mount -o remount,huge=always none /dev/shm
– Including small files (i.e. < 2MB in size)
●
Very high memory usage on small files
● # mount -o remount,huge=never none /dev/shm
– Current default
● # mount -o remount,huge=within_size none /dev/shm
– Allocate THP if inode i_size can contain hugepages, or if there’s a
madvise/fadvise hint
●
Ideal for apps benefiting from THP on tmpfs!
● # mount -o remount,huge=advise none /dev/shm
– Only if requested through madvise/fadvise
THP on tmpfs
Copyright © 2017 Red Hat Inc.
44
●
shmem uses an internal tmpfs mount for: SysV SHM, memfd,
MAP_SHARED of /dev/zero or MAP_ANONYMOUS, DRM objects,
ashmem
●
Internal mount is controlled by:
$ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
always within_size advise [never] deny force
● deny
– Disable THP for all tmpfs mounts
●
For debugging
● force
– Always use THP for all tmpfs mounts
●
For testing
THP on shmem
Copyright © 2017 Red Hat Inc.
45
Copyright © 2017 Red Hat Inc.
46
●
Virtual memory to dedup is practically unlimited (even
on 32bit if you deduplicate across multiple processes)
●
The same page content can be deduplicated an
unlimited number of times
KSMscale
PageKSM
stable_node
rmap_item rmap_item
Virtual address
In userland process
(mm, addr1)
Virtual address
In userland process
(mm, addr2)
Copyright © 2017 Red Hat Inc.
47
$ cat /sys/kernel/mm/ksm/max_page_sharing
256
●
KSMscale limits the maximum deduplication for each physical
PageKSM allocated
– Limiting “compression” to x256 is enough
– Higher maximum deduplication generates diminishing
returns
●
To alter the max page sharing:
$ echo 2 > /sys/kernel/mm/ksm/run
$ echo 512 > /sys/kernel/mm/ksm/max_page_sharing
●
Included in v4.13
KSMscale
Copyright © 2017 Red Hat Inc.
48
Copyright © 2017 Red Hat Inc.
49
Why: Memory Externalization
●
Memory externalization is about running a program with
part (or all) of its memory residing on a remote node
●
Memory is transferred from the memory node to the
compute node on access
●
Memory can be transferred from the compute node to the
memory node if it's not frequently used during memory
pressure
Node 0Compute node
Local Memory
Node 0Memory node
Remote Memory
userfault
Node 0Compute node
Local Memory
Node 0Memory node
Remote Memory
memory
pressure
Copyright © 2017 Red Hat Inc.
50
Postcopy live migration
●
Postcopy live migration is a form of memory
externalization
●
When the QEMU compute node (destination) faults on a
missing page that resides in the memory node (source) the
kernel has no way to fetch the page
– Solution: let QEMU in userland handle the pagefault
Partially funded by the Orbit European Union project
Node 0Compute node
Local Memory
Node 0Memory node
Remote Memory
userfault
QEMU
source
QEMU
destination
Postcopy live migration
Copyright © 2017 Red Hat Inc.
51
userfaultfd latency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0
200
400
600
800
1000
1200
1400
1600
userfault latency during postcopy live migration - 10Gbit
qemu 2.5+ - RHEL7.2+ - stressapptest running in guest
<= latency in milliseconds
numberofuserfaults
Userfaults triggered on pages that were already in network-flight are
instantaneous. Background transfer seeks at the last userfault address.
Copyright © 2017 Red Hat Inc.
52
KVM precopy live migration
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Before precopy
After precopy
During precopy
Precopy never completes until the database benchmark completes
10Gbit NIC
120GiB guest
Database TPM
~120sec
Time to
Transfer RAM
over network
pre
copy
Copyright © 2017 Red Hat Inc.
53
KVM postcopy live migration
00:10
00:50
01:30
02:10
02:50
03:30
04:10
04:50
05:30
06:10
06:50
07:30
08:10
08:50
09:30
10:10
10:50
11:30
12:10
12:50
13:30
14:10
14:50
15:30
16:10
16:50
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Before postcopy
During postcopy
After postcopy
# virsh migrate .. --postcopy --timeout <sec> --timeout-postcopy
# virsh migrate .. --postcopy --postcopy-after-precopy
precopy runs
From 5m
To 7m
postcopy runs
From 7m
To about ~9m
deterministic
pre
copy
post
copy
khugepaged
collapses THPs
Copyright © 2017 Red Hat Inc.
54
All available upstream
●
Userfaultfd() syscall in Linux Kernel >= v4.3
●
Postcopy live migration in:
– QEMU >= v2.5.0
●
Author: David Gilbert @ Red Hat Inc.
– Postcopy in Libvirt >= 1.3.4
– OpenStack Nova >= Newton
●
In production since RHEL 7.3
Copyright © 2017 Red Hat Inc.
55
Live migration total time
Total time
0
50
100
150
200
250
300
350
400
450
500
autoconverge
postcopy
seconds
Copyright © 2017 Red Hat Inc.
56
Max UDP latency
0
5
10
15
20
25
precopy timeout
autoconverge
postcopy
seconds
Live migration max perceived
downtime latency
Copyright © 2017 Red Hat Inc.
57
userfaultfd v4.13 features
●
The current v4.13 upstream kernel supports:
– Missing faults (i.e. missing pages EVENT_PAGEFAULT)
●
Anonymous (UFFDIO_COPY, UFFDIO_ZEROPAGE, UFFDIO_WAKE)
●
Hugetlbfs (UFFDIO_COPY, UFFDIO_WAKE)
●
Shmem (UFFDIO_COPY, UFFDIO_ZEROPAGE, UFFDIO_WAKE)
– UFFD_FEATURE_THREAD_ID (to collect vcpu statistics)
– UFFD_FEATURE_SIGBUS (raises SIGBUS instead of userfault)
– Non cooperative Events
●
EVENT_REMOVE (MADV_REMOVE/DONTNEED/FREE)
●
EVENT_UNMAP (munmap notification to stop background transfer)
●
EVENT_REMAP (non-cooperative process called mremap)
●
EVENT_FORK (create new uffd over fork())
●
UFFDIO_COPY returns -ESRCH after exit()
Copyright © 2017 Red Hat Inc.
58
userfaultfd in-progress features
●
v4.13 rebased aa.git for anonymous memory
supports:
– UFFDIO_WRITEPROTECT
– UFFDIO_COPY_MODE_WP
●
Same as UFFDIO_COPY but wrprotected
– UFFDIO_REMAP
●
monitor can remove memory atomically
●
Floating patchset for synchronous EVENT_REMOVE
to allow multithreaded non cooperative manager
Copyright © 2017 Red Hat Inc.
59
struct uffd_msg/* read() structure */
struct uffd_msg {
__u8 event;
__u8 reserved1;
__u16 reserved2;
__u32 reserved3;
union {
struct {
__u64 flags;
__u64 address;
union {
__u32 ptid;
} feat;
} pagefault;
struct {
__u32 ufd;
} fork;
struct {
__u64 from;
__u64 to;
__u64 len;
} remap;
struct {
__u64 start;
__u64 end;
} remove;
struct {
/* unused reserved fields */
__u64 reserved1;
__u64 reserved2;
__u64 reserved3;
} reserved;
} arg;
} __packed;
sizeof(struct uffd_msg) 32bit/64bit ABI enforcement
Zeros here, can extend with UFFD_FEATURE flags
UFFD_EVENT_* tells which part of the union is valid
Default cooperative support tracking pagefaults
UFFD_EVENT_PAGEFAULT
Non cooperative support tracking MM syscalls
UFFD_EVENT_FORK
UFFD_EVENT_REMAP
UFFD_EVENT_REMOVE|UNMAP
Copyright © 2017 Red Hat Inc.
60
userfaultfd potential use cases
●
Efficient snapshotting (drop fork())
●
JIT/to-native compilers (drop write bits)
●
Add robustness to hugetlbfs/tmpfs
●
Host enforcement for virtualization memory ballooning
●
Distributed shared memory projects: at Berkeley and
University of Colorado
●
Tracking of anon pages written by other threads: at
llnl.gov
●
Obsoletes soft-dirty
●
Obsoletes volatile pages SIGBUS
Copyright © 2017 Red Hat Inc.
61
Git userfaultfd branch
●
https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault
Copyright © 2017 Red Hat Inc.
62
Copyright © 2017 Red Hat Inc.
63
Virtual Memory evolution
●
Amazing to see the room for further innovation there was back then
– Things constantly looks pretty mature
●
They may actually have been considering my hardware back then
was much less powerful and not more complex than my
cellphone
●
Unthinkable to maintain the current level of mission critical
complexity by reinventing the wheel in a not Open Source way
– Can perhaps be still done in a limited set of laptops and
cellphones models, but for how long?
●
Innovation in the Virtual Memory space is probably one among the
plenty of factors that contributed to Linux success and the KVM lead in
OpenStack user base too
– KVM (unlike the preceding Hypervisor designs) leverages the power
of the Linux Virtual Memory in its entirety

More Related Content

What's hot

Embedded_Linux_Booting
Embedded_Linux_BootingEmbedded_Linux_Booting
Embedded_Linux_BootingRashila Rr
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Q4.11: Porting Android to new Platforms
Q4.11: Porting Android to new PlatformsQ4.11: Porting Android to new Platforms
Q4.11: Porting Android to new PlatformsLinaro
 
Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBhoomil Chavda
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKBshimosawa
 
Android Storage - Vold
Android Storage - VoldAndroid Storage - Vold
Android Storage - VoldWilliam Lee
 
Learning AOSP - Android Booting Process
Learning AOSP - Android Booting ProcessLearning AOSP - Android Booting Process
Learning AOSP - Android Booting ProcessNanik Tolaram
 
Introduction to Kernel and Device Drivers
Introduction to Kernel and Device DriversIntroduction to Kernel and Device Drivers
Introduction to Kernel and Device DriversRajKumar Rampelli
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation Jiann-Fuh Liaw
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal BootloaderSatpal Parmar
 
Luca Ceresoli - Buildroot vs Yocto: Differences for Your Daily Job
Luca Ceresoli - Buildroot vs Yocto: Differences for Your Daily JobLuca Ceresoli - Buildroot vs Yocto: Differences for Your Daily Job
Luca Ceresoli - Buildroot vs Yocto: Differences for Your Daily Joblinuxlab_conf
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 

What's hot (20)

Embedded_Linux_Booting
Embedded_Linux_BootingEmbedded_Linux_Booting
Embedded_Linux_Booting
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Q4.11: Porting Android to new Platforms
Q4.11: Porting Android to new PlatformsQ4.11: Porting Android to new Platforms
Q4.11: Porting Android to new Platforms
 
Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handling
 
Linux kernel architecture
Linux kernel architectureLinux kernel architecture
Linux kernel architecture
 
Toolchain
ToolchainToolchain
Toolchain
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKB
 
Android Storage - Vold
Android Storage - VoldAndroid Storage - Vold
Android Storage - Vold
 
Learning AOSP - Android Booting Process
Learning AOSP - Android Booting ProcessLearning AOSP - Android Booting Process
Learning AOSP - Android Booting Process
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
Introduction to Kernel and Device Drivers
Introduction to Kernel and Device DriversIntroduction to Kernel and Device Drivers
Introduction to Kernel and Device Drivers
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
QEMU - Binary Translation
QEMU - Binary Translation QEMU - Binary Translation
QEMU - Binary Translation
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
 
The Internals of "Hello World" Program
The Internals of "Hello World" ProgramThe Internals of "Hello World" Program
The Internals of "Hello World" Program
 
Linux Containers (LXC)
Linux Containers (LXC)Linux Containers (LXC)
Linux Containers (LXC)
 
Luca Ceresoli - Buildroot vs Yocto: Differences for Your Daily Job
Luca Ceresoli - Buildroot vs Yocto: Differences for Your Daily JobLuca Ceresoli - Buildroot vs Yocto: Differences for Your Daily Job
Luca Ceresoli - Buildroot vs Yocto: Differences for Your Daily Job
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 

Viewers also liked

Kernel Recipes 2017 - Linux Kernel Release Model - Greg KH
Kernel Recipes 2017 - Linux Kernel Release Model - Greg KHKernel Recipes 2017 - Linux Kernel Release Model - Greg KH
Kernel Recipes 2017 - Linux Kernel Release Model - Greg KHAnne Nicolas
 
Kernel Recipes 2017 - Build farm again - Willy Tarreau
Kernel Recipes 2017 - Build farm again - Willy TarreauKernel Recipes 2017 - Build farm again - Willy Tarreau
Kernel Recipes 2017 - Build farm again - Willy TarreauAnne Nicolas
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
 
Inside Google's Numbers in 2017
Inside Google's Numbers in 2017Inside Google's Numbers in 2017
Inside Google's Numbers in 2017Rand Fishkin
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (7)

Kernel Recipes 2017 - Linux Kernel Release Model - Greg KH
Kernel Recipes 2017 - Linux Kernel Release Model - Greg KHKernel Recipes 2017 - Linux Kernel Release Model - Greg KH
Kernel Recipes 2017 - Linux Kernel Release Model - Greg KH
 
Kernel Recipes 2017 - Build farm again - Willy Tarreau
Kernel Recipes 2017 - Build farm again - Willy TarreauKernel Recipes 2017 - Build farm again - Willy Tarreau
Kernel Recipes 2017 - Build farm again - Willy Tarreau
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Inside Google's Numbers in 2017
Inside Google's Numbers in 2017Inside Google's Numbers in 2017
Inside Google's Numbers in 2017
 
10 facts about jobs in the future
10 facts about jobs in the future10 facts about jobs in the future
10 facts about jobs in the future
 
The AI Rush
The AI RushThe AI Rush
The AI Rush
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli

Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelKernel TLV
 
DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)Jaroslav Jacjuk
 
The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelYasunori Goto
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesTakahiro Hirofuchi
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT Group
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
 
Software Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsSoftware Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsC4Media
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsSamsung Open Source Group
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlFlorence Dubois
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 

Similar to Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli (20)

Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)
 
The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux Kernel
 
Time For D.I.M.E?
Time For D.I.M.E?Time For D.I.M.E?
Time For D.I.M.E?
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory Devices
 
Time For DIME
Time For DIMETime For DIME
Time For DIME
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Software Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsSoftware Design for Persistent Memory Systems
Software Design for Persistent Memory Systems
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
VM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache PrefetchingVM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache Prefetching
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and control
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 

More from Anne Nicolas

Kernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream firstKernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream firstAnne Nicolas
 
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMIKernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMIAnne Nicolas
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas
 
Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyAnne Nicolas
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureAnne Nicolas
 
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...Anne Nicolas
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
 
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Anne Nicolas
 
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxEmbedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxAnne Nicolas
 
Embedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialEmbedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialAnne Nicolas
 
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre SiliconEmbedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre SiliconAnne Nicolas
 
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) pictureEmbedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) pictureAnne Nicolas
 
Embedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops wayEmbedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops wayAnne Nicolas
 
Embedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmakerEmbedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmakerAnne Nicolas
 
Embedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integrationEmbedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integrationAnne Nicolas
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingAnne Nicolas
 
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaEmbedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaAnne Nicolas
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
Kernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPKernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Anne Nicolas
 

More from Anne Nicolas (20)

Kernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream firstKernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream first
 
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMIKernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are money
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and future
 
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
 
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxEmbedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
 
Embedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialEmbedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less special
 
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre SiliconEmbedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
 
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) pictureEmbedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
 
Embedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops wayEmbedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops way
 
Embedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmakerEmbedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmaker
 
Embedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integrationEmbedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integration
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
 
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaEmbedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Kernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPKernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDP
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli

  • 1. Copyright © 2017 Red Hat Inc. 1 20 years of Linux Virtual Memory20 years of Linux Virtual Memory Red Hat, Inc. Andrea Arcangeli <aarcange at redhat.com> Kernel Recipes, Paris 29 Sep 2017
  • 2. Copyright © 2017 Red Hat Inc. 2 Topics ● Milestones in the evolution of the Virtual Memory subsystem ● Virtual Memory latest innovations – Automatic NUMA balancing – THP developments – KSMscale – userfaultfd ● Postcopy live Migration, etc..
  • 3. Copyright © 2017 Red Hat Inc. 3 Virtual Memory (simplified) Virtual pages They cost “nothing” Practically unlimited on 64bit archs Physical pages They cost money! This is the RAM arrows = pagetables virtual to physical mapping
  • 4. Copyright © 2017 Red Hat Inc. 4 PageTablesPageTables pgd pmd pte pud pte pmd pte pte pmd pte pud pte pmd pte pte 2^9 = 512 arrows, not just 2 ● Common code and x86 pagetable format is a tree ● All pagetables are 4KB in size ● Total: grep PageTables /proc/meminfo ● (((2**9)**4)*4096)>>48 = 1 → 48bits → 256 TiB ● 5 levels in v4.13 → (48+9) bits → 57bits → 128 PiB – Build time to avoid slowdown
  • 5. Copyright © 2017 Red Hat Inc. 5 The Fabric of the Virtual Memory ● The fabric are all those data structures that connects to the hardware constrained structures like pagetables and that collectively create all the software abstractions we're accustomed to – tasks, processes, virtual memory areas, mmap (glibc malloc) ... ● The fabric is the most black and white part of the Virtual Memory ● The algorithms doing the computations on those data structures are the Virtual Memory heuristics – They need to solve hard problems with no guaranteed perfect solution – i.e. when it's the right time to start to unmap pages (swappiness) ● Some of the design didn't change: we still measure how hard it is to free memory while we're trying to free it ● All free memory is used as cache and we overcommit by default (not excessively by default) – Android uses: echo 1 >/proc/sys/vm/overcommit_memory
  • 6. Copyright © 2017 Red Hat Inc. 6 Physical page and struct page PAGE_SIZE (4KiB) large physical pagestruct page 64 bytes PAGE SIZE 4KiB ...Physical RAMmem_map 64/4096 1.56% of the RAM
  • 7. Copyright © 2017 Red Hat Inc. 7 MM & VMA ● mm_struct aka MM – Memory of the process ● Shared by threads ● vm_area_struct aka VMA – Virtual Memory Area ● Created and teardown by mmap and munmap ● Defines the virtual address space of an “MM”
  • 8. Copyright © 2017 Red Hat Inc. 11 Original bitmap... Active and Inactive list LRU ● The active page LRU preserves the the active memory working set – only the inactive LRU loses information as fast as use-once I/O goes – Introduced in 2001, it works good enough also with an arbitrary balance – Active/inactive list optimum balancing algorithm was solved in 2012-2014 ● shadow radix tree nodes that detect re-faults
  • 9. Copyright © 2017 Red Hat Inc. 13 Active and Inactive list LRU $ grep -i active /proc/meminfo Active: 3555744 kB Inactive: 2511156 kB Active(anon): 2286400 kB Inactive(anon): 1472540 kB Active(file): 1269344 kB Inactive(file): 1038616 kB
  • 10. Copyright © 2017 Red Hat Inc. 14 Active LRU working-set detection ● From Johannes Weiner fault ------------------------+ | +--------------+ | +-------------+ reclaim <- T inactive H <-+-- demotion T active H <--+ +--------------+ +-------------+ | | | +-------------- promotion ------------------+ H = Head, T = Tail +-memory available to cache-+ | | +-inactive------+-active----+ a b | c d e f g h i | J K L M N | +---------------+-----------+
  • 11. Copyright © 2017 Red Hat Inc. 15 lru→inactive_age and radix tree shadow entries inode/file radix_tree root inode/file radix_tree root inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node pagepage
  • 12. Copyright © 2017 Red Hat Inc. 16 Reclaim saving inactive_age inode/file radix_tree root inode/file radix_tree root inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node pagepage Exceptional/shadow entry memcg LRU LRU reclaim inactive_age++ Exceptional/shadow entry memcg LRU LRU reclaim inactive_age++
  • 13. Copyright © 2017 Red Hat Inc. 20 Object-based reverse mapping PHYSICAL PAGE Anonymous anon_vma inode (objrmap) vma vma vma vma prio_tree rmap item rmap item vma vma PHYSICAL PAGE Filesystem PHYSICAL PAGE Anonymous PHYSICAL PAGE Filesystem PHYSICAL PAGE KSM
  • 14. Copyright © 2017 Red Hat Inc. 21 Active & inactive + rmap page[0] page[1] page[N] mem_map process virtualmemory process virtualmemory process virtualmemory process virtualmemory active_lruinactive_lru rmap rmap rmap rmap Use-once pages to trash Use-many pages
  • 15. Copyright © 2017 Red Hat Inc. 22 Many more LRUs ● Separated LRU for anon and file backed mappings ● Memcg (memory cgroups) introduced per- memcg LRUs ● Removal of unfreeable pages from LRUs – anonymous memory with no swap – mlocked memory ● Transparent Hugepages in the LRU increase scalability further (lru size decreased 512 times)
  • 16. Copyright © 2017 Red Hat Inc. 23 Recent Virtual Memory trends ● Optimizing the workloads for you, without manual tuning – NUMA hard bindings (numactl) → Automatic NUMA Balancing – Hugetlbfs → Transparent Hugepage – Programs or Virtual Machines duplicating memory → KSM – Page pinning (RDMA/KVM shadow MMU) -> MMU notifier – Private device memory managed by hand and pinned → HMM/UVM (unified virtual memory) for GPU seamlessly computing in GPU memory ● The optimizations can be optionally disabled
  • 17. Copyright © 2017 Red Hat Inc. 30
  • 18. Copyright © 2017 Red Hat Inc. 31 Automatic NUMA Balancing benchmark Intel SandyBridge (Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz) 2 Sockets – 32 Cores with Hyperthreads 256G Memory RHEV 3.6 Host bare metal – 3.10.0-327.el7 (RHEL7.2) VM guest – 3.10.0-324.el7 (RHEL7.2) VM – 32P , 160G (Optimized for Server) Storage – Violin 6616 – 16G Fibre Channel Oracle – 12C , 128G SGA Test – Running Oracle OLTP workload with increasing user count and measuring Trans / min for each run as a metric for comparison
  • 19. Copyright © 2017 Red Hat Inc. 32 10U 20U 40U 80U 100U 0 200000 400000 600000 800000 1000000 1200000 1400000 4 VMs with different NUMA options OLTP workload Auto numa OFF Auto Numa on - No pinning Auto Numa on - Numa pinning User count Trans/min
  • 20. Copyright © 2017 Red Hat Inc. 33 Automatic NUMA balancing configuration ● https://tinyurl.com/zupp9v3 https://access.redhat.com/ ● In RHEL7 Automatic NUMA balancing is enabled when: # numactl --hardware shows multiple nodes ● To disable automatic NUMA balancing: # echo 0 > /proc/sys/kernel/numa_balancing ● To enable automatic NUMA balancing: # echo 1 > /proc/sys/kernel/numa_balancing ● At boot: numa_balancing=enable|disable
  • 21. Copyright © 2017 Red Hat Inc. 34
  • 22. Copyright © 2017 Red Hat Inc. 35 Hugepages ● Traditionally x86 hardware gave us 4KiB pages ● The more memory the bigger the overhead in managing 4KiB pages ● What if you had bigger pages? – 512 times bigger → 2MiB
  • 23. Copyright © 2017 Red Hat Inc. 36 PageTablesPageTables pgd pmd pte pud pte pmd pte pte pmd pte pud pte pmd pte pte 2^9 = 512 arrows, not just 2
  • 24. Copyright © 2017 Red Hat Inc. 37 ● Improve CPU performance – Enlarge TLB size (essential for KVM) – Speed up TLB miss (essential for KVM) ● Need 3 accesses to memory instead of 4 to refill the TLB – Faster to allocate memory initially (minor) – Page colouring inside the hugepage (minor) – Higher scalability of the page LRUs ● Cons – clear_page/copy_page less cache friendly ● Clear faulting subpage last included in v4.13 from Andi Kleen and Ying Hhuang – 28.3% increase in vm-scalability anon-w-seq – higher memory footprint sometime – Direct compaction takes time Benefit of hugepages
  • 25. Copyright © 2017 Red Hat Inc. 38 TLB miss cost:TLB miss cost: number of accesses to memorynumber of accesses to memory host THP off guest THP off host THP on guest THP off host THP off guest THP on host THP on guest THP on novirt THP off novirt THP on 0 5 10 15 20 25 BaremetalVirtualization
  • 26. Copyright © 2017 Red Hat Inc. 39 ● How do we get the benefits of hugetlbfs without having to configure anything? ● Any Linux process will receive 2M pages – if the mmap region is 2M naturally aligned – If compaction succeeds in producing hugepages ● Entirely transparent to userland Transparent Hugepage design
  • 27. Copyright © 2017 Red Hat Inc. 40 ● /sys/kernel/mm/transparent_hugepage/enabled – [always] madvise never ● Always use THP if vma start/end permits – always [madvise] never ● Use THP only inside MADV_HUGEPAGE – Applies to khugepaged too – always madvise [never] ● Never use THP – khugepaged quits ● Default selected at build time THP sysfs enabled
  • 28. Copyright © 2017 Red Hat Inc. 41 ● /sys/kernel/mm/transparent_hugepage/defrag – [always] defer defer+madvise madvise never ● Always use direct compaction (ideal for long lived allocations) – always [defer] defer+madvise madvise never ● Defer compaction asynchronously (kswapd/kcompactd) – always defer [defer+madvise] madvise never ● Direct compaction in MADV_HUGEPAGE, async otherwise – always defer defer+madvise [madvise] never ● Use direct compaction only inside MADV_HUGEPAGE ● khugepaged enabled in the background – always defer defer+madvise madvise [never] ● Never use compaction, quit khugepaged too ● Disabling THP is excessive if only direct compaction is too expensive ● KVM uses MADV_HUGEPAGE – MADV_HUGEPAGE will still use direct compaction THP defrag - compaction control
  • 29. Copyright © 2017 Red Hat Inc. 42 Priming the VM ● To achieve maximum THP utilization you can run the two commands below # cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3 Node 0, zone DMA32 8357 4904 2670 472 126 82 8 1 0 3 5 Node 0, zone Normal 5364 38097 608 63 11 1 1 55 0 0 0 # echo 3 >/proc/sys/vm/drop_caches # cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3 Node 0, zone DMA32 5989 5445 5235 5022 4420 3583 2424 1331 471 119 25 Node 0, zone Normal 71579 58338 36733 19523 6431 1572 559 282 115 29 2 # echo >/proc/sys/vm/compact_memory # cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3 Node 0, zone DMA32 1218 1161 1227 1240 1135 925 688 502 342 305 369 Node 0, zone Normal 7479 5147 5124 4240 2768 1959 1391 814 389 270 149 4k 8k 16k 32k 64k 2M 4M
  • 30. Copyright © 2017 Red Hat Inc. 43 ● From Kirill A. Shutemov and Hugh Dickins – Merged in the v4.8 kernel ● # mount -o remount,huge=always none /dev/shm – Including small files (i.e. < 2MB in size) ● Very high memory usage on small files ● # mount -o remount,huge=never none /dev/shm – Current default ● # mount -o remount,huge=within_size none /dev/shm – Allocate THP if inode i_size can contain hugepages, or if there’s a madvise/fadvise hint ● Ideal for apps benefiting from THP on tmpfs! ● # mount -o remount,huge=advise none /dev/shm – Only if requested through madvise/fadvise THP on tmpfs
  • 31. Copyright © 2017 Red Hat Inc. 44 ● shmem uses an internal tmpfs mount for: SysV SHM, memfd, MAP_SHARED of /dev/zero or MAP_ANONYMOUS, DRM objects, ashmem ● Internal mount is controlled by: $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled always within_size advise [never] deny force ● deny – Disable THP for all tmpfs mounts ● For debugging ● force – Always use THP for all tmpfs mounts ● For testing THP on shmem
  • 32. Copyright © 2017 Red Hat Inc. 45
  • 33. Copyright © 2017 Red Hat Inc. 46 ● Virtual memory to dedup is practically unlimited (even on 32bit if you deduplicate across multiple processes) ● The same page content can be deduplicated an unlimited number of times KSMscale PageKSM stable_node rmap_item rmap_item Virtual address In userland process (mm, addr1) Virtual address In userland process (mm, addr2)
  • 34. Copyright © 2017 Red Hat Inc. 47 $ cat /sys/kernel/mm/ksm/max_page_sharing 256 ● KSMscale limits the maximum deduplication for each physical PageKSM allocated – Limiting “compression” to x256 is enough – Higher maximum deduplication generates diminishing returns ● To alter the max page sharing: $ echo 2 > /sys/kernel/mm/ksm/run $ echo 512 > /sys/kernel/mm/ksm/max_page_sharing ● Included in v4.13 KSMscale
  • 35. Copyright © 2017 Red Hat Inc. 48
  • 36. Copyright © 2017 Red Hat Inc. 49 Why: Memory Externalization ● Memory externalization is about running a program with part (or all) of its memory residing on a remote node ● Memory is transferred from the memory node to the compute node on access ● Memory can be transferred from the compute node to the memory node if it's not frequently used during memory pressure Node 0Compute node Local Memory Node 0Memory node Remote Memory userfault Node 0Compute node Local Memory Node 0Memory node Remote Memory memory pressure
  • 37. Copyright © 2017 Red Hat Inc. 50 Postcopy live migration ● Postcopy live migration is a form of memory externalization ● When the QEMU compute node (destination) faults on a missing page that resides in the memory node (source) the kernel has no way to fetch the page – Solution: let QEMU in userland handle the pagefault Partially funded by the Orbit European Union project Node 0Compute node Local Memory Node 0Memory node Remote Memory userfault QEMU source QEMU destination Postcopy live migration
  • 38. Copyright © 2017 Red Hat Inc. 51 userfaultfd latency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 0 200 400 600 800 1000 1200 1400 1600 userfault latency during postcopy live migration - 10Gbit qemu 2.5+ - RHEL7.2+ - stressapptest running in guest <= latency in milliseconds numberofuserfaults Userfaults triggered on pages that were already in network-flight are instantaneous. Background transfer seeks at the last userfault address.
  • 39. Copyright © 2017 Red Hat Inc. 52 KVM precopy live migration 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Before precopy After precopy During precopy Precopy never completes until the database benchmark completes 10Gbit NIC 120GiB guest Database TPM ~120sec Time to Transfer RAM over network pre copy
  • 40. Copyright © 2017 Red Hat Inc. 53 KVM postcopy live migration 00:10 00:50 01:30 02:10 02:50 03:30 04:10 04:50 05:30 06:10 06:50 07:30 08:10 08:50 09:30 10:10 10:50 11:30 12:10 12:50 13:30 14:10 14:50 15:30 16:10 16:50 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Before postcopy During postcopy After postcopy # virsh migrate .. --postcopy --timeout <sec> --timeout-postcopy # virsh migrate .. --postcopy --postcopy-after-precopy precopy runs From 5m To 7m postcopy runs From 7m To about ~9m deterministic pre copy post copy khugepaged collapses THPs
  • 41. Copyright © 2017 Red Hat Inc. 54 All available upstream ● Userfaultfd() syscall in Linux Kernel >= v4.3 ● Postcopy live migration in: – QEMU >= v2.5.0 ● Author: David Gilbert @ Red Hat Inc. – Postcopy in Libvirt >= 1.3.4 – OpenStack Nova >= Newton ● In production since RHEL 7.3
  • 42. Copyright © 2017 Red Hat Inc. 55 Live migration total time Total time 0 50 100 150 200 250 300 350 400 450 500 autoconverge postcopy seconds
  • 43. Copyright © 2017 Red Hat Inc. 56 Max UDP latency 0 5 10 15 20 25 precopy timeout autoconverge postcopy seconds Live migration max perceived downtime latency
  • 44. Copyright © 2017 Red Hat Inc. 57 userfaultfd v4.13 features ● The current v4.13 upstream kernel supports: – Missing faults (i.e. missing pages EVENT_PAGEFAULT) ● Anonymous (UFFDIO_COPY, UFFDIO_ZEROPAGE, UFFDIO_WAKE) ● Hugetlbfs (UFFDIO_COPY, UFFDIO_WAKE) ● Shmem (UFFDIO_COPY, UFFDIO_ZEROPAGE, UFFDIO_WAKE) – UFFD_FEATURE_THREAD_ID (to collect vcpu statistics) – UFFD_FEATURE_SIGBUS (raises SIGBUS instead of userfault) – Non cooperative Events ● EVENT_REMOVE (MADV_REMOVE/DONTNEED/FREE) ● EVENT_UNMAP (munmap notification to stop background transfer) ● EVENT_REMAP (non-cooperative process called mremap) ● EVENT_FORK (create new uffd over fork()) ● UFFDIO_COPY returns -ESRCH after exit()
  • 45. Copyright © 2017 Red Hat Inc. 58 userfaultfd in-progress features ● v4.13 rebased aa.git for anonymous memory supports: – UFFDIO_WRITEPROTECT – UFFDIO_COPY_MODE_WP ● Same as UFFDIO_COPY but wrprotected – UFFDIO_REMAP ● monitor can remove memory atomically ● Floating patchset for synchronous EVENT_REMOVE to allow multithreaded non cooperative manager
  • 46. Copyright © 2017 Red Hat Inc. 59 struct uffd_msg/* read() structure */ struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3; union { struct { __u64 flags; __u64 address; union { __u32 ptid; } feat; } pagefault; struct { __u32 ufd; } fork; struct { __u64 from; __u64 to; __u64 len; } remap; struct { __u64 start; __u64 end; } remove; struct { /* unused reserved fields */ __u64 reserved1; __u64 reserved2; __u64 reserved3; } reserved; } arg; } __packed; sizeof(struct uffd_msg) 32bit/64bit ABI enforcement Zeros here, can extend with UFFD_FEATURE flags UFFD_EVENT_* tells which part of the union is valid Default cooperative support tracking pagefaults UFFD_EVENT_PAGEFAULT Non cooperative support tracking MM syscalls UFFD_EVENT_FORK UFFD_EVENT_REMAP UFFD_EVENT_REMOVE|UNMAP
  • 47. Copyright © 2017 Red Hat Inc. 60 userfaultfd potential use cases ● Efficient snapshotting (drop fork()) ● JIT/to-native compilers (drop write bits) ● Add robustness to hugetlbfs/tmpfs ● Host enforcement for virtualization memory ballooning ● Distributed shared memory projects: at Berkeley and University of Colorado ● Tracking of anon pages written by other threads: at llnl.gov ● Obsoletes soft-dirty ● Obsoletes volatile pages SIGBUS
  • 48. Copyright © 2017 Red Hat Inc. 61 Git userfaultfd branch ● https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault
  • 49. Copyright © 2017 Red Hat Inc. 62
  • 50. Copyright © 2017 Red Hat Inc. 63 Virtual Memory evolution ● Amazing to see the room for further innovation there was back then – Things constantly looks pretty mature ● They may actually have been considering my hardware back then was much less powerful and not more complex than my cellphone ● Unthinkable to maintain the current level of mission critical complexity by reinventing the wheel in a not Open Source way – Can perhaps be still done in a limited set of laptops and cellphones models, but for how long? ● Innovation in the Virtual Memory space is probably one among the plenty of factors that contributed to Linux success and the KVM lead in OpenStack user base too – KVM (unlike the preceding Hypervisor designs) leverages the power of the Linux Virtual Memory in its entirety