The document describes the Turtles Project, which implements nested virtualization on x86 CPUs. The key aspects are:
1) It runs multiple guest hypervisors (KVM, VMware, Linux, Windows) simultaneously on a single x86 CPU using nested virtualization, which is not directly supported in hardware.
2) It addresses the challenges of nested virtualization through techniques like multi-dimensional paging to map multiple virtual-to-physical address spaces, and multi-level device assignment to allow direct device access across multiple hypervisor layers.
3) Micro-optimizations are used to reduce the performance overhead of moving between hypervisor layers, such as minimizing data copying during VM exits. Experiments show each additional level of
Strategies for Landing an Oracle DBA Job as a Fresher
Turtles dc9723
1. The Turtles Project:
Design and Implementation of Nested Virtualization
Muli Ben-Yehuda† Michael D. Day‡ Zvi Dubitzky† Michael Factor†
†
Nadav Har’El Abel Gordon† Anthony Liguori‡ Orit Wasserman†
Ben-Ami Yassour†
† IBM Research—Haifa
‡ IBM Linux Technology Center
Technion—Israel Institute of Technology
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 1/1
2. What is nested x86 virtualization?
Running multiple unmodified
Guest Guest
hypervisors OS OS
With their associated
unmodified VM’s Guest Guest
Hypervisor OS
Simultaneously
On the x86 architecture
Which does not support Hypervisor
nesting in hardware. . .
. . . but does support a single
level of virtualization Hardware
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 2/1
3. Why?
Operating systems are already hypervisors (Windows 7 with XP
mode, Linux/KVM)
Security: attack via or defend against hypervisor-level rootkits
such as Blue Pill
To be able to run other hypervisors in clouds
Co-design of x86 hardware and system software
Testing, demonstrating, debugging, live migration of hypervisors
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 3/1
4. Related work
First models for nested virtualization [PopekGoldberg74,
BelpaireHsu75, LauerWyeth73]
First implementation in the IBM z/VM; relies on architectural
support for nested virtualization (sie)
Microkernels meet recursive VMs [FordHibler96]: assumes we
can modify software at all levels
x86 software based approaches (slow!) [Berghmans10]
KVM [KivityKamay07] with AMD SVM [RoedelGraf09]
Early Xen prototype [He09]
Blue Pill rootkit hiding from other hypervisors [Rutkowska06]
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 4/1
5. What is the Turtles project?
Efficient nested virtualization for Intel x86 based on KVM
Runs multiple guest hypervisors and VMs: KVM, VMware, Linux,
Windows, . . .
Code publicly available
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 5/1
6. What is the Turtles project? (cont’)
Nested VMX virtualization for nested CPU virtualization
Multi-dimensional paging for nested MMU virtualization
Multi-level device assignment for nested I/O virtualization
Micro-optimizations to make it go fast
+ + =
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 6/1
7. Theory of nested CPU virtualization
Trap and emulate[PopekGoldberg74] ⇒ it’s all about the traps
Single-level (x86) vs. multi-level (e.g., z/VM)
Single level ⇒ one hypervisor, many guests
Turtles approach: L0 multiplexes the hardware between L1 and L2 ,
running both as guests of L0 —without either being aware of it
(Scheme generalized for n levels; Our focus is n=2)
Guest Guest
L2 L2
Guest Guest
Hypervisor Guest Hypervisor Guest Guest Guest
L1
L1 L2 L2
L0 L0
Host Hypervisor Host Hypervisor
Hardware Hardware
Multiple logical levels Multiplexed on a single level
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 7/1
8. Nested VMX virtualization: flow
L0 runs L1 with VMCS0→1
L1 prepares VMCS1→2 and
L1 L2
executes vmlaunch Guest
OS
vmlaunch traps to L0 Guest
Hypervisor
L0 merges VMCS’s:
Memory 1-2 State
VMCS
VMCS 0→1 merged with Tables
VMCS 1→2 is VMCS0→2
L0 launches L2 0-1 State Memory Memory 0-2 State
VMCS VMCS
Tables Tables
L2 causes a trap
L0 handles trap itself or L0
Host Hypervisor
forwards it to L1
... Hardware
eventually, L0 resumes L2
repeat
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 8/1
9. Exit multiplication makes angry turtle angry
To handle a single L2 exit, L1 does many things: read and write
the VMCS, disable interrupts, . . .
Those operations can trap, leading to exit multiplication
Exit multiplication: a single L2 exit can cause 40-50 L1 exits!
Optimize: make a single exit fast and reduce frequency of exits
L3
… L2
… … L1
…
L0
Single Two Three Levels
Level Levels
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 9/1
10. Introduction to x86 MMU virtualization
x86 does page table walks in hardware
MMU has one currently active hardware page table
Bare metal ⇒ only needs one logical translation,
(virtual → physical)
Virtualization ⇒ needs two logical translations
1 Guest page table: (guest virt → guest phys)
2 Host page table: (guest phys → host phys)
. . . but MMU only knows to walk a single table!
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 10 / 1
11. Software MMU virtualization: shadow paging
Guest virtual
GV->GP
Guest physical SPT
GP->HP
Host physical
Two logical translations compressed onto the shadow page
table [DevineBugnion02]
Unmodified guest OS updates its own table
Hypervisor traps OS page table updates
Hypervisor propagates updates to the hardware table
MMU walks the table
Problem: traps are expensive
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 11 / 1
13. Nested MMU virt. via multi-dimensional paging
Three logical translations: L2 virt → phys, L2 → L1 , L1 → L0
Only two tables in hardware with EPT:
virt → phys and guest physical → host physical
L0 compresses three logical translations onto two hardware tables
Shadow on top of Shadow on top Multi-dimensional
shadow of EPT paging
L2 virtual L2 virtual L2 virtual
GPT GPT GPT
L2 physical L2 physical L2 physical
SPT12
SPT12
SPT02 EPT12
L1 physical L1 physical L1 physical
EPT02
EPT01 EPT01
L0 physical L0 physical L0 physical
baseline better best
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 13 / 1
14. Baseline: shadow-on-shadow
L2 virtual
GPT
L2 physical
SPT12 SPT02
L1 physical
L0 physical
Assume no EPT table; all hypervisors use shadow paging
Useful for old machines and as a baseline
Maintaining shadow page tables is expensive
Compress: three logical translations ⇒ one table in hardware
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 14 / 1
15. Better: shadow-on-EPT
L2 virtual
GPT
L2 physical
SPT12
L1 physical
EPT01
L0 physical
Instead of one hardware table we have two
Compress: three logical translations ⇒ two in hardware
Simple approach: L0 uses EPT, L1 uses shadow paging for L2
Every L2 page fault leads to multiple L1 exits
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 15 / 1
16. Best: multi-dimensional paging
L2 virtual
GPT
L2 physical
EPT12
L1 physical
EPT02
EPT01
L0 physical
EPT table rarely changes; guest page table changes a lot
Again, compress three logical translations ⇒ two in hardware
L0 emulates EPT for L1
L0 uses EPT0→1 and EPT1→2 to construct EPT0→2
End result: a lot less exits!
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 16 / 1
17. Introduction to I/O virtualization
From the hypervisor’s perspective, what is I/O?
(1) PIO (2) MMIO (3) DMA (4) interrupts
Device emulation [Sugerman01]
HOST
GUEST
1
device
driver
2
device device
driver emulation
4 3
Para-virtualized drivers [Barham03, Russell08]
HOST
GUEST
front−end
virtual
driver
1
back−end
device virtual
3 driver 2 driver
Direct device assignment [Levasseur04,Yassour08]
HOST
GUEST
device
driver
Direct assignment best performing option
Direct assignment requires IOMMU for safe DMA bypass
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 17 / 1
18. Multi-level device assignment
With nested 3x3 options for I/O virtualization (L2 ⇔ L1 ⇔ L0 )
Multi-level device assignment means giving an L2 guest direct
access to L0 ’s devices, safely bypassing both L0 and L1
MMIOs and PIOs
L2 device L1 L0 physical
driver hypervisor hypervisor device
L1 IOMMU L0 IOMMU
Device DMA via
platform IOMMU
How? L0 emulates an IOMMU for L1 [Amit11]
L0 compresses multiple IOMMU translations onto the single
hardware IOMMU page table
L2 programs the device directly
Device DMA’s into L2 memory space directly
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 18 / 1
19. Micro-optimizations
Goal: reduce world switch overheads
Reduce cost of single exit by focus on VMCS merges:
Keep VMCS fields in processor encoding
Partial updates instead of whole-sale copying
Copy multiple fields at once
Some optimizations not safe according to spec
Reduce frequency of exits—focus on vmread and vmwrite
Avoid the exit multiplier effect
Loads/stores vs. architected trapping instructions
Binary patching?
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 19 / 1
20. Nested VMX support in KVM
Date: Mon, 16 May 2011 22:43:54 +0300
From: Nadav Har’El <nyh (at) il.ibm.com>
To: kvm [at] vger.kernel.org
Cc: gleb [at] redhat.com, avi [at] redhat.com
Subject: [PATCH 0/31] nVMX: Nested VMX, v10
Hi,
This is the tenth iteration of the nested VMX patch set. Improvements in this
version over the previous one include:
* Fix the code which did not fully maintain a list of all VMCSs loaded on
each CPU. (Avi, this was the big thing that bothered you in the previous
version).
* Add nested-entry-time (L1->L2) verification of control fields of vmcs12 -
procbased, pinbased, entry, exit and secondary controls - compared to the
capability MSRs which we advertise to L1.
[many other changes trimmed]
This new set of patches applies to the current KVM trunk (I checked with
6f1bd0daae731ff07f4755b4f56730a6e4a3c1cb).
If you wish, you can also check out an already-patched version of KVM from
branch "nvmx10" of the repository:
git://github.com/nyh/kvm-nested-vmx.git
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 20 / 1
21. Windows XP on KVM L1 on KVM L0
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 21 / 1
22. Linux on VMware L1 on KVM L0
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 22 / 1
23. Experimental Setup
Running Linux, Windows,
KVM, VMware, SMP, . . .
Macro workloads:
kernbench
SPECjbb
netperf
Multi-dimensional paging?
Multi-level device assignment?
KVM as L1 vs. VMware as L1 ?
See paper for full experimental
details and more benchmarks
and analysis
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 23 / 1
24. Macro: SPECjbb and kernbench
kernbench
Host Guest Nested NestedDRW
Run time 324.3 355 406.3 391.5
% overhead vs. host - 9.5 25.3 20.7
% overhead vs. guest - - 14.5 10.3
SPECjbb
Host Guest Nested NestedDRW
Score 90493 83599 77065 78347
% degradation vs. host - 7.6 14.8 13.4
% degradation vs. guest - - 7.8 6.3
Table: kernbench and SPECjbb results
Exit multiplication effect not as bad as we feared
Direct vmread and vmwrite (DRW) give an immediate boost
Take-away: each level of virtualization adds approximately the same
overhead!
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 24 / 1
25. Macro: multi-dimensional paging
3.5
3.0
2.5
Shadow on EPT
Improvement ratio
Multi−dimensional paging
2.0
1.5
1.0
0.5
0.0
kernbench specjbb netperf
Impact of multi-dimensional paging depends on rate of page faults
Shadow-on-EPT: every L2 page fault causes L1 multiple exits
Multi-dimensional paging: only EPT violations cause L1 exits
EPT table rarely changes: #(EPT violations) << #(page faults)
Multi-dimensional paging huge win for page-fault intensive
kernbench
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 25 / 1
26. Macro: multi-level device assignment
throughput (Mbps)
%cpu
1,000 100
900
800 80
throughput (Mbps)
700
600 60
% cpu
500
400 40
300
200 20
100
0 nat si sing sing nes nes nes nes nes 0
ive engle
mu le virti e le direlc le emtu d guvirted guvirted gudirted gudirted gu
l e e
lati vel o vel t a vel lati es tio / es tio / es ect / es ect / es
on gue gue cce gue on /t em t
ul
virt t virtt diret
st st ss st em io io ct
ula ation
tion
Benchmark: netperf TCP_STREAM (transmit)
Multi-level device assignment best performing option
But: native at 20%, multi-level device assignment at 60% (x3!)
Interrupts considered harmful, cause exit multiplication
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 26 / 1
27. Macro: multi-level device assignment (sans interrupts)
1000
900
Throughput (Mbps)
800
700
600
500
400
300 L0 (bare metal)
200 L2 (direct/direct)
L2 (direct/virtio)
100
16 32 64 128 256 512
Message size (netperf -m)
What if we could deliver device interrupts directly to L2 ?
Only 7% difference between native and nested guest!
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 27 / 1
28. Micro: synthetic worst case CPUID loop
60,000
50,000
40,000
CPU Cycles
L1
30,000 L0
cpu mode switch
20,000
10,000
0 1. S 2. N 3. N 4. N 5. N
Gueingle este o es opti este o es
st Lev d G ptimited G miz d Gu ptimited Gu
el ues zati ues atio est zati es
t ons t ns 3 ons t
3.5. .5.2 3.5.
1 1 &3
.5.2
CPUID running in a tight loop is not a real-world workload!
Went from 30x worse to “only” 6x worse
A nested exit is still expensive—minimize both single exit
cost and frequency of exits
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 28 / 1
29. Conclusions
Efficient nested x86
virtualization is challenging but
feasible
A whole new ballpark opening
up many exciting
applications—security, cloud,
architecture, . . .
Current overhead of 6-14%
Negligible for some
workloads, not yet for others
Work in progress—expect at
most 5% eventually
Code is available
Why Turtles?
It’s turtles all the way down
Ben-Yehuda et al. (IBM Research) The Turtles Project: Nested Virtualization dc9723 May, 2011 29 / 1