Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

VEE’16
04/02/2016
Shoot4U: Using VMM Assists to Optimize
TLB Operations on Preempted vCPUs
Jiannan Ouyang, John Lange
University of Pittsburgh
Haoqiang Zheng
VMware Inc.

CPU Consolidation in the Cloud
2
CPU Consolidation: multiple virtual CPUs (vCPUs) share the
same physical CPU (pCPU).
Motivation: Improve datacenter utilization.
Figure 1. Average activity distribution of a typical shared Google clusters
including Online Services, each containing over 20,000 servers, over a
period of 3 months [Barroso 13].

Problems with Preempted vCPUs
3
PP P P
A
B
A
B
B
A
A
B
A vCPU of VM-A B vCPU ofVM-B
Preempted
Running
pCPUP
Performance problems:
Busy-waiting based kernel synchronization operations
—  Lock Holder Preemption problem
—  Lock Waiter Preemption problem
—  TLB Shootdown Preemption problem

Lock Holder Preemption
4
Lock holder preemption [Uhlig 04, Friebel 08]
—  A preempted vCPU is holding a spinlock
—  Causes dramatically longer lock waiting time
—  context switch latency + CPU shares allocated to other vCPUs
SchedulingTechniques
—  co-scheduling, relaxed co-scheduling [VMware 10]
—  Adaptive co-scheduling [Weng HPDC11]
—  Balanced scheduling [Sukwong EuroSys11]
—  Demand-based coordinated scheduling [Kim ASPLOS13]
HardwareAssistedTechniques
—  Intel Pause-Loop Exiting (PLE) [Riel 11]

Lock Waiter Preemption [Ouyang VEE13]
5
Linux uses a FIFO order fair spinlock, named ticket spinlock
i i+1 i+2 i+3
Lock waiter preemption
—  A lock waiter is preempted, and blocks the queue
—  P(waiter preemption) > P(holder preemption)
0 T 2TTimeout: 3T
PreemptableTicket Spinlock
—  Key idea: proportional timeout

TLB Shootdown Preemption
6
KVM Paravirt Remote FlushTLB [kvmtlb 12]
—  VMM maintains vCPU preemption states and shares with the
guest.
—  Use conventional approach if the remote vCPU is running.
—  DeferTLB flush if the remote vCPU is preempted.
—  Cons: preemption state may change after checking.
TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13]
Shoot4U
—  Goal: eliminate the problem
—  Key idea: invalidate guestTLB entries from theVMM

Contributions
7
—  An analysis of the impact that various low level
synchronization operations have on system benchmark
performance.
—  Shoot4U:A novel virtualizedTLB architecture that ensures
consistently low latencies for synchronizedTLB operations.
—  An evaluation of the performance benefits achieved by
Shoot4U over current state-of-art software and hardware
assisted approaches.

Overhead of CPU Consolidation
9
0
2
4
6
8
10
12
14
16
18
20
blackscholes
bodytrack
canneal
dedup
ferret
freqmine
raytrace
streamcluster
swaptions
vips
x264
Slowdown
70.6
max ideal slowdown
PARSEC Runtime with co-locatedVM over running alone
(12-coreVMs, measured on Linux/KVM, with PLE disabled)

CPU Usage Profiling (perf)
10
0
10
20
30
40
50
60
70
80
90
100
blackscholes
bodytrack
canneal
dedup
ferret
freqmine
raytrace
streamcluster
swaptions
vips
x264
blackscholes
bodytrack
canneal
dedup
ferret
freqmine
raytrace
streamcluster
swaptions
vips
x264
Percentage(%)
1VM 2VM
k:lock k:tlb k:other u:*

CDF of TLB Shootdown Latency (ktap)
11
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100
101
102
103
104
105
106
CumulativePercent
Latency (us)
1VM
2VM

How TLB Shootdown Works in VMs
12
—  TLB (Translation Lookaside Buffer)
—  a per-core hardware cache for page table translation results
—  TLB coherence is managed by the OS
—  TLB shootdown operations: IPI + invlpg
—  LinuxTLB shootdown is busy-waiting based
VMM
PP P P
A A B A
1. vIPIs
3. pIPIs
2. trap 4. inject virtual interrupts
5. vCPU is scheduled
(TLB Shootdown Preemption)
B B A B
6. invalidation & ACK

Shoot4U
14
Observation: modern hardware allows theVMM to invalidate guest
TLB entries (e.g. Intel invpid)
Key idea: invalidate guestTLB entries from theVMM
—  Tell theVMM whatTLB entries and vCPUs to invalidate (hypervall)
—  TheVMM invalidates and returns, no interrupt injection and waiting
2. pIPIs
1.  hypercall
<vcpu set, addr range> 3. invalidation andACKVMM
PP P P
A A B A
B B A B

Implementation
15
KVM/Linux 3.16, ~200 LOC (~50 LOC guest side)
—  https://github.com/ouyangjn/shoot4u
Guest
—  use hypercall forTLB shootdowns
VMM
—  hypercall handler: vCPU set => pCPU set, and send IPIs
—  IPI handler: invalidate guestTLB entries with invpid
kvm_hypercall3(unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap,
unsigned long start_addr, unsigned long end_addr);
Shoot4U API

Evaluation
16
Dual-socket Dell R450 server
—  6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading
—  24 GB RAM split across two NUMA domains.
—  CentOS 7 (Linux 3.16)
Virtual Machines
—  12 vCPUs, 4G RAM on the same socket
—  Fedora 19 (Linux 4.0)
—  VM1: PARSEC Benchmark Suite,VM2 sysbench CPU test
Schemes
—  baseline: unmodified Linux kernel
—  kvmtlb [kvmtlb 12]
—  Shoot4U
—  Pause-Loop Exiting (PLE) [Riel 11]
—  PreemptableTicket Spinlock (PMT) [OuyangVEE ’13]

TLB Shootdown Latency (Cycles)
17
Order of magnitude lower latency

TLB Shootdown Latency (CDF)
18
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100
101
102
103
104
105
106
CumulativePercent
Latency (us)
shoot4u-1VM
shoot4u-2VM
kvmtlb-1VM
kvmtlb-2VM
baseline-1VM
baseline-2VM

Parsec Performance (2-VMs)
19
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
blackscholes
bodytrackcannealdedup
ferret
freqmineraytracestreamcluster
swaptionsvips
x264
NormalizedExecutionTime
baseline
ple
pmt
pmt+kvmtlb
pmt+shoot4u
ple+pmt+shoot4u

Revisiting Performance Slowdown
20
0
2
4
6
8
10
12
14
16
18
20
blackscholes
bodytrack
canneal
dedup
ferret
freqmine
raytrace
streamcluster
swaptions
vips
x264
Slowdown
70.6
baseline ple+pmt+shoot4u

Revisiting CPU Usage Profiling
21
0
10
20
30
40
50
60
70
80
90
100
blackscholes
bodytrack
canneal
dedup
ferret
freqmine
raytrace
streamcluster
swaptions
vips
x264
blackscholes
bodytrack
canneal
dedup
ferret
freqmine
raytrace
streamcluster
swaptions
vips
x264
Percentage(%) baseline 2VM ple+pmt+shoot4u 2VM
k:lock k:tlb k:other u:*

Conclusions
22
—  We conducted a set of experiments in order to provide a
breakdown of overheads caused by preempted virtual CPU cores,
showing thatTLB operations can have a significant impact on
performance with certain workloads.
—  We Shoot4U, an optimization forTLB shootdown operations that
internalizesTLB shootdowns in theVMM and so no longer
requires the involvement of a guest’s vCPUs.
—  Our evaluation demonstrates the effectiveness of our approach,
and illustrates how under certain workloads our approach is
dramatically better than state-of-the-art techniques.

23
https://github.com/ouyangjn/shoot4u

Q & A
24
Jiannan Ouyang
Ph.D. Candidate
ouyang@cs.pitt.edu
http://www.cs.pitt.edu/~ouyang/

The Prognostic Lab
http://www.prognosticlab.org

Pisces Co-Kernel
Kitten Lightweight Kernel
PalaciosVMM

References
25
—  [Ouyang 13] Jiannan Ouyang and John R. Lange. PreemptableTicket
Spinlocks: Improving Consolidated Performance in the Cloud. In Proc.
9thACM SIGPLAN/SIGOPS International Conference onVirtual
Execution Environments (VEE), 2013.
—  [Uhlig 04]Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and
Uwe Dannowski.Towards scalable multiprocessor virtual
machines. In Proceedings of the 3rd conference onVirtual
Machine ResearchAndTechnology Symposium -Volume 3,
VM’04, 2004.
—  [Friebel 08]Thomas Friebel. How to deal with lock-holder
preemption. Presented at the Xen Summit NorthAmerica, 2008.
—  [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng.
Demand- based Coordinated Scheduling for SMPVMs. In Proc.
Inter- national Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 2013.

26
—  [VMware 10]VMware(r) vSphere(tm):The cpu scheduler in
vmware esx(r) 4.1.Technical report,VMware, Inc, 2010.
—  [Barroso 13] L.A. Barroso, J. Clidaras, and U. Holzle.The
Datacenter as a Computer:An Introduction to the Design of
Warehouse- Scale Machines. Synthesis Lectures on Computer Architec-
ture, 2013.
—  [Weng HPDC’11] C.Weng, Q. Liu, L.Yu, and M. Li. Dynamic
Adaptive Scheduling forVirtual Machines. In Proc.20th
International Symposium on High Performance Parallel and Distributed
Computing (HPDC), 2011.
—  [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co-
schedulingToo Expensive for SMPVMs? In Proc.6th European
Conference on Com- puter Systems (EuroSys), 2011.

27
—  [Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011.
URL http://lwn.net/Articles/424960/.
—  [kvmtlb 12] KVM Paravirt Remote FlushTLB. https://lwn.net/
Articles/500188/.

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Similaire à Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs (20)

Dernier

Dernier (20)

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs