[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xen hypervisor

Towards a performance-aware power capping
orchestrator for the Xen hypervisor
Marco Arnaboldi, Matteo Ferroni, Marco D. Santambrogio
EWiLi’16, 10/06/2016, Pittsburgh PA, USA.
Co-located with the Embedded Systems Week.

Outline 2
• Introduction and system requirements
• Problem definition and proposed solution
• Related work
• XeMPUPiL Goals
• System design and implementation
• Experimental evaluation and results
• Conclusion and future work

Introduction
• Computing systems changed considerably in the
last few decades
– multi-core processors entered into the domain
of embedded systems
• Wide range of application fields
– automotive, Internet TV, mobile, …
– other embedded use cases like low-power
microservers for lightweight scale-out
workloads
• Fog Computing takes the computation “at the edge
of the Cloud” by exploiting fog nodes
– use case: latency-sensitive and security-critical
applications
3
Servers
Fog
IoT

Requirement 1: Portability 4
Servers
Fog
IoT
• Applications needs to be PORTABLE
between the Cloud and the Fog
– Hardware-assisted and software
virtualization enter the context of
embedded systems
• Features:
– applications do not need to be changed
– physical resources shared between
applications
– strong security and isolation guarantees

Requirement 2: Power consumption 5
• Nodes may be POWER CONSTRAINED
– Power management techniques to
control power consumption
• Limit power consumption of a machine to a
fixed “cap”, with the following features:
– timeliness: the ability of the system in
enforcing a new cap rapidly
– efficiency: maximize the performance
delivered by the applications under a
fixed power cap

Problem definition and Proposed solution
• One problem, two points of view:
– minimize power consumption given a minimum
performance requirement
– maximize performance given a limit on the maximum
power consumption
• Proposed solution:
– XeMPUPiL, a performance-aware power capping
orchestrator for the Xen hypervisor
6

Power capping approaches 7
Hardware Power Capping
(i.e. Intel RAPL[1])
Software-level resource
management
Description
Exploits DVFS to control
power consumption
Resource management to
achieve the desired power
consumption
PRO
Very fast
(~350ms [1])
It’s possible to tune
performances of
applications
CONS
No control over
performances of
applications
Slow compared to RAPL
(double digit degradation)

SOFTWARE APPROACH
✓ efﬁciency
✖ timeliness
MODEL BASED 
MONITORING [3]
THREAD 
MIGRATION [2]
RESOURCE
MANAGMENT DVFS [4] RAPL [1]
CPU
QUOTA
HARDWARE APPROACH
✖ efﬁciency
✓ timeliness
[1] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. Rapl: Memory power estimation and capping. In International Symposium on Low Power Electronics and Design (ISPLED), 2010.
[2] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: adaptive dvfs and thread packing under power caps. In International Symposium on Microarchitecture (MICRO), 2011.
[3]M. Ferroni, A. Cazzola, D. Matteo, A. A. Nacci, D. Sciuto, and M. D. Santambrogio. Mpower: gain back your android battery life! In Proceedings of the 2013 ACM conference on Pervasive and
ubiquitous computing adjunct publication, pages 171–174. ACM, 2013.
[4] T. Horvath, T. Abdelzaher, K. Skadron, and X. Liu. Dynamic voltage scaling in multitier web servers with end-to-end delay control. In Computers, IEEE Transactions. IEEE, 2007.

SOFTWARE APPROACH
✓ efficiency
✖ timeliness
MODEL BASED 
MONITORING [3]
THREAD 
MIGRATION [2] RESOURCE
MANAGMENT
DVFS [4]
RAPL [1]
CPU
QUOTA
HARDWARE APPROACH
✖ efficiency
✓ timeliness
[1] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. Rapl: Memory power estimation and capping. In International Symposium on Low Power Electronics and Design (ISPLED), 2010.
[2] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: adaptive dvfs and thread packing under power caps. In International Symposium on Microarchitecture (MICRO), 2011.
[3]M. Ferroni, A. Cazzola, D. Matteo, A. A. Nacci, D. Sciuto, and M. D. Santambrogio. Mpower: gain back your android battery life! In Proceedings of the 2013 ACM conference on Pervasive and
ubiquitous computing adjunct publication, pages 171–174. ACM, 2013.
[4] T. Horvath, T. Abdelzaher, K. Skadron, and X. Liu. Dynamic voltage scaling in multitier web servers with end-to-end delay control. In Computers, IEEE Transactions. IEEE, 2007.
HYBRID APPROACH
✓ efficiency
✓ timeliness

Related work: PUPiL [5] 10
[5] H. Zhang and H. Hoffmann. Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques. In International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 2016.
• PUPiL, a framework that aims to minimize and to maximize respectively
the concept of timeliness and efficiency
• Proposed approach:
– both hardware (i.e., the Intel RAPL interface [10]) and software (i.e.,
resource partitioning and allocation) techniques
– exploits a canonical ODA control loop, one of the main building blocks of
self-aware computing
• Limitations
– the applications running on the system need to be instrumented with the
Heartbeat framework, to provide uniform metric of throughput
– applications running bare-metal on Linux
• These conditions might not hold in the context of a multi-tenant
virtualized environment

Goals
• We want to extend this approach to:
– work in a virtualized environment, based on the Xen
hypervisor
– avoid instrumentation of the guest workloads, as each
tenant is seen as a “black box”
• We then need to:
1. identify a performance metric for all the hosted tenants
2. improve the decision phase, to deal with the requirements
of a virtualized environment
3. extend the hypervisor to provide the right knobs to work
with our orchestrating logic
11

The Xen Hypervisor 12
Slides from: http://www.slideshare.net/xen_com_mgr/xpds16-porting-xen-on-arm-to-a-new-soc-julien-grall-arm

1. Performance metric identification
• Hardware event counters as low level metrics of
performance
• We exploit the Intel Performance Monitoring Unit (PMU)
to monitor the number of Instruction Retired (IR)
accounted to each domain in a certain time window
– an insight on how many microinstructions were completely
executed (i.e., that successfully reached the end of the
pipeline)
– it represents a reasonable indicator of performance, as the
same manufacturer suggests [6]
13
[6] Clockticks per instructions retired (cpi). https://software.intel.com/en-us/node/544403. Accessed: 2016-06-01.

1. Performance monitoring 14
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
Hardware events per core,
energy per socket
…
XEMPOWER
Collect and account hardware events
to virtual tenants in two steps:
1. In the Xen scheduler (kernel-level)
• At every context switch, trace the
interesting hardware events
• e.g., INST_RET
Tracing the Domains’ behavior

1. Performance monitoring 14
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
Hardware events per core,
energy per socket
…
XEMPOWER
Collect and account hardware events
to virtual tenants in two steps:
1. In the Xen scheduler (kernel-level)
• At every context switch, trace the
interesting hardware events
• e.g., INST_RET
2. In Domain 0 (privileged tenant)
• Periodically acquire the events
traces and aggregate them on a
domain basis
Tracing the Domains’ behavior

2. Decision phase and virtualization
• Evaluation criterion: the average IR rate over a certain time
window
– the time window allows the workload to adapt to the actual
configuration
– the comparison of IR rates of different configurations highlights
which one makes the workload perform better
• Resource allocation granularity: core-level
– each domain owns a set virtual CPUs (vCPUs)
– a set of physical CPUs (pCPU) present on the machine
– each vCPU can be mapped on a pCPU for a certain amount of
time, while multiple vCPUs can be mapped on the same pCPU
• We wanted our allocation to cover the whole set of pCPUs, if
possible
15

3. Extending the hypervisor - RAPL
• Working with the Intel RAPL interface:
– harshly cutting the frequency and the voltage of the whole CPU socket
• On a bare-metal operating system:
– reading and writing data into the right Model Specific Register (MSR)
• MSR_RAPL_POWER_UNIT: read processor-specific time, energy and power
units, used to scale each value read or written
• MSR_PKG_RAPL_POWER_LIMIT: write to set a limit on the power
consumption of the whole socket
• In a virtualized environment:
– the Xen hypervisor does not natively support the RAPL interface
– we developed custom hypercalls, with kernel callback functions and
memory buffers
– we developed a CLI tool that performs some checks on the input
parameters, as well as of instantiating and invoking the Xen command
interface to launch the hypercalls
16

3. Extending the hypervisor - Resources
• cpupool tool:
– allows to cluster the physical CPUs in different pools
– the pool scheduler will schedule the domain’s vCPUs only
on the pCPUs that are part of that cluster
– as a new resource allocation is chosen by the decide phase,
we increase or decrease the number of pCPUs in the pool
– pin the domain’s vCPUs to these, to increase workload
stability
• NO xenpm:
– set a maximum and minimum frequency for each pCPU
– it may interfere with the actuation made by RAPL
17

System Design
• The workloads run in paravirtualized domains
20

System Design
• XeMPUPiL spans over all the layers
21

System Design
• Instruction Retired (IR) metric gathered and accounted to each domain,
thanks to XeMPower
• The aggregation is done over a time window of 1 second
22

System Design
• Observation of both hardware events (i.e., IR) and power
consumption (whole CPU socket)
23

System Design 24
– given a workload with M virtual resources
and an assignment of N physical resources,
to each pCPUi we assign:

System Design
• Hybrid actuation:
– enforce power cap via RAPL
– define a CPU pool for the workload and pin workload’s vCPUs over pCPUs
25

System Design 26

System Design 27

Experimental Setup
• Server setup (aka Sandy)
– 2.8-GHz quad-core Intel Xeon E5-1410 processor, no HT enabled
(4 cores)
– 32GB of RAM
– Xen hypervisor version 4.4
– paravirtualized instance of Ubuntu 14.04 as Dom0, pinned on the
first 4 and with 4GB of RAM
• Benchmarking
– Embarrassingly Parallel (EP) [1]
– IOzone [3]
– cachebench [2]
– Bi-Triagonal solver (BT) [1]
28
EP IOzone cachebench BT
CPU-bound YES NO NO YES
IO-bound NO YES NO YES
memory-bound NO NO YES YES[1] Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb. html#url. Accessed: 2016-06-01.
[2] Openbenchmarking.org. https://openbenchmarking.org/test/pts/ cachebench. Accessed: 2016-06-01.
[3] Iozone ﬁlesystem benchmark. http://www.iozone.org. Accessed: 2016- 06-01.

Experimental evaluation 29
• Experimental evaluation:
1. how do different workloads perform under a power cap?
2. can we achieve higher efficiency w.r.t. RAPL power cap?
• Three power caps explored: 40W, 30W and 20W
– in idle state, the entire socket consumes around 17W
– the maximum power consumption we measured was
around 43W
• Results are normalized with respect to the performance
obtained with no power caps

Experimental Results 31
0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
NormalizedPerformance
0
0.2
0.4
0.6
0.8
1.0
EP cachebench IOzone BT
• Preliminary evaluation: how do they perform under a power cap?
• For CPU-bound benchmarks (i.e., EP and BT), the difference are
significant w.r.t. benchmarks where the bottleneck is on the IO and/
or on memory accesses

0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
NormalizedPerformance
0
0.2
0.4
0.6
0.8
1.0
• Preliminary evaluation: how do they perform under a power cap?
• With IO- and/or memory-bound workloads, the performance
degradation is less significant between different power caps

0
0.5
1.0
PUPiL 40
RAPL 40
Normalizedperformance
0
0.5
1.0
0
0.5
1.0
PUPiL 30
RAPL 30
0
0.5
1.0
0
0.5
1.0
PUPiL 20
RAPL 20
0
0.5
1.0
• Performance of the
workloads with
XeMPUPiL, for different
power caps:
– higher performance
than RAPL, in general
– not always true on a
pure CPU-bound
benchmark (i.e., EP)

0
0.5
1.0
PUPiL 40
RAPL 40
0
0.5
1.0
0
0.5
1.0
PUPiL 30
RAPL 30
0
0.5
1.0
0
0.5
1.0
PUPiL 20
RAPL 20
0
0.5
1.0
• Performance of the
workloads with
XeMPUPiL, for different
power caps:
– higher performance
than RAPL, in general
– not always true on a
pure CPU-bound
benchmark (i.e., EP)

0
0.5
1.0
PUPiL 40
RAPL 40
0
0.5
1.0
0
0.5
1.0
PUPiL 30
RAPL 30
0
0.5
1.0
0
0.5
1.0
PUPiL 20
RAPL 20
0
0.5
1.0
• XeMPUPiL improves the
performance of the IO-bound,
the memory-bound and the
mixed benchmark w.r.t. the
system with no constraints:
– just one core assigned for
IOzone and Cachebench
– two cores for the BT
benchmark
• These allocations are more
power efficient, as they
reduce memory and IO
contention for non strictly
CPU-bound workloads

Conclusion and Future Work
• Conclusions
– Performance tuning trough ODA controller under a power cap
improves performance
• Future works
– Improving decide phase
• Better algorithm in order to reduce convergence time
• More general approach in order to improve portability
– Improving act phase
• Implementation of custom fine-grained tool for
resource management in Xen
37

[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xen hypervisor

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à [EWiLi2016] Towards a performance-aware power capping orchestrator for the Xen hypervisor

Similaire à [EWiLi2016] Towards a performance-aware power capping orchestrator for the Xen hypervisor (20)

Plus de Matteo Ferroni

Plus de Matteo Ferroni (6)

Dernier

Dernier (20)

[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xen hypervisor