In the last few years, multi-core processors entered into the domain of embedded systems: this, together with virtualization techniques, allows multiple applications to easily run on the same System-on-Chip (SoC). As power consumption remains one of the most impacting costs on any digital system, several approaches have been explored in literature to cope with power caps, trying to maximize the performance of the hosted applications. In this paper, we present some preliminary results and opportunities towards a performance-aware power capping orchestrator for the Xen hypervisor. The proposed solution, called XeMPUPiL, uses the Intel Running Average Power Limit (RAPL) hardware interface to set a strict limit on the processor’s power consumption, while a software-level Observe-Decide-Act (ODA) loop performs an exploration of the available resource allocations to find the most power efficient one for the running workload. We show how XeMPUPiL is able to achieve higher performance under different power caps for almost all the different classes of benchmarks analyzed (e.g., CPU-, memory- and IO-bound).
Full paper: http://ceur-ws.org/Vol-1697/EWiLi16_17.pdf
Optimizing AI for immediate response in Smart CCTV
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xen hypervisor
1. Towards a performance-aware power capping
orchestrator for the Xen hypervisor
Marco Arnaboldi, Matteo Ferroni, Marco D. Santambrogio
EWiLi’16, 10/06/2016, Pittsburgh PA, USA.
Co-located with the Embedded Systems Week.
2. Outline 2
• Introduction and system requirements
• Problem definition and proposed solution
• Related work
• XeMPUPiL Goals
• System design and implementation
• Experimental evaluation and results
• Conclusion and future work
3. Introduction
• Computing systems changed considerably in the
last few decades
– multi-core processors entered into the domain
of embedded systems
• Wide range of application fields
– automotive, Internet TV, mobile, …
– other embedded use cases like low-power
microservers for lightweight scale-out
workloads
• Fog Computing takes the computation “at the edge
of the Cloud” by exploiting fog nodes
– use case: latency-sensitive and security-critical
applications
3
Servers
Fog
IoT
4. Requirement 1: Portability 4
Servers
Fog
IoT
• Applications needs to be PORTABLE
between the Cloud and the Fog
– Hardware-assisted and software
virtualization enter the context of
embedded systems
• Features:
– applications do not need to be changed
– physical resources shared between
applications
– strong security and isolation guarantees
5. Requirement 2: Power consumption 5
• Nodes may be POWER CONSTRAINED
– Power management techniques to
control power consumption
• Limit power consumption of a machine to a
fixed “cap”, with the following features:
– timeliness: the ability of the system in
enforcing a new cap rapidly
– efficiency: maximize the performance
delivered by the applications under a
fixed power cap
6. Problem definition and Proposed solution
• One problem, two points of view:
– minimize power consumption given a minimum
performance requirement
– maximize performance given a limit on the maximum
power consumption
• Proposed solution:
– XeMPUPiL, a performance-aware power capping
orchestrator for the Xen hypervisor
6
7. Power capping approaches 7
Hardware Power Capping
(i.e. Intel RAPL[1])
Software-level resource
management
Description
Exploits DVFS to control
power consumption
Resource management to
achieve the desired power
consumption
PRO
Very fast
(~350ms [1])
It’s possible to tune
performances of
applications
CONS
No control over
performances of
applications
Slow compared to RAPL
(double digit degradation)
8. Power capping approaches 8
SOFTWARE APPROACH
✓ efficiency
✖ timeliness
MODEL BASED
MONITORING [3]
THREAD
MIGRATION [2]
RESOURCE
MANAGMENT DVFS [4] RAPL [1]
CPU
QUOTA
HARDWARE APPROACH
✖ efficiency
✓ timeliness
[1] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. Rapl: Memory power estimation and capping. In International Symposium on Low Power Electronics and Design (ISPLED), 2010.
[2] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: adaptive dvfs and thread packing under power caps. In International Symposium on Microarchitecture (MICRO), 2011.
[3]M. Ferroni, A. Cazzola, D. Matteo, A. A. Nacci, D. Sciuto, and M. D. Santambrogio. Mpower: gain back your android battery life! In Proceedings of the 2013 ACM conference on Pervasive and
ubiquitous computing adjunct publication, pages 171–174. ACM, 2013.
[4] T. Horvath, T. Abdelzaher, K. Skadron, and X. Liu. Dynamic voltage scaling in multitier web servers with end-to-end delay control. In Computers, IEEE Transactions. IEEE, 2007.
9. Power capping approaches 9
SOFTWARE APPROACH
✓ efficiency
✖ timeliness
MODEL BASED
MONITORING [3]
THREAD
MIGRATION [2] RESOURCE
MANAGMENT
DVFS [4]
RAPL [1]
CPU
QUOTA
HARDWARE APPROACH
✖ efficiency
✓ timeliness
[1] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. Rapl: Memory power estimation and capping. In International Symposium on Low Power Electronics and Design (ISPLED), 2010.
[2] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: adaptive dvfs and thread packing under power caps. In International Symposium on Microarchitecture (MICRO), 2011.
[3]M. Ferroni, A. Cazzola, D. Matteo, A. A. Nacci, D. Sciuto, and M. D. Santambrogio. Mpower: gain back your android battery life! In Proceedings of the 2013 ACM conference on Pervasive and
ubiquitous computing adjunct publication, pages 171–174. ACM, 2013.
[4] T. Horvath, T. Abdelzaher, K. Skadron, and X. Liu. Dynamic voltage scaling in multitier web servers with end-to-end delay control. In Computers, IEEE Transactions. IEEE, 2007.
HYBRID APPROACH
✓ efficiency
✓ timeliness
10. Related work: PUPiL [5] 10
[5] H. Zhang and H. Hoffmann. Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques. In International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 2016.
• PUPiL, a framework that aims to minimize and to maximize respectively
the concept of timeliness and efficiency
• Proposed approach:
– both hardware (i.e., the Intel RAPL interface [10]) and software (i.e.,
resource partitioning and allocation) techniques
– exploits a canonical ODA control loop, one of the main building blocks of
self-aware computing
• Limitations
– the applications running on the system need to be instrumented with the
Heartbeat framework, to provide uniform metric of throughput
– applications running bare-metal on Linux
• These conditions might not hold in the context of a multi-tenant
virtualized environment
11. Goals
• We want to extend this approach to:
– work in a virtualized environment, based on the Xen
hypervisor
– avoid instrumentation of the guest workloads, as each
tenant is seen as a “black box”
• We then need to:
1. identify a performance metric for all the hosted tenants
2. improve the decision phase, to deal with the requirements
of a virtualized environment
3. extend the hypervisor to provide the right knobs to work
with our orchestrating logic
11
12. The Xen Hypervisor 12
Slides from: http://www.slideshare.net/xen_com_mgr/xpds16-porting-xen-on-arm-to-a-new-soc-julien-grall-arm
13. 1. Performance metric identification
• Hardware event counters as low level metrics of
performance
• We exploit the Intel Performance Monitoring Unit (PMU)
to monitor the number of Instruction Retired (IR)
accounted to each domain in a certain time window
– an insight on how many microinstructions were completely
executed (i.e., that successfully reached the end of the
pipeline)
– it represents a reasonable indicator of performance, as the
same manufacturer suggests [6]
13
[6] Clockticks per instructions retired (cpi). https://software.intel.com/en-us/node/544403. Accessed: 2016-06-01.
14. 1. Performance monitoring 14
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
Hardware events per core,
energy per socket
…
XEMPOWER
Collect and account hardware events
to virtual tenants in two steps:
1. In the Xen scheduler (kernel-level)
• At every context switch, trace the
interesting hardware events
• e.g., INST_RET
Tracing the Domains’ behavior
15. 1. Performance monitoring 14
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
Hardware events per core,
energy per socket
…
XEMPOWER
Collect and account hardware events
to virtual tenants in two steps:
1. In the Xen scheduler (kernel-level)
• At every context switch, trace the
interesting hardware events
• e.g., INST_RET
2. In Domain 0 (privileged tenant)
• Periodically acquire the events
traces and aggregate them on a
domain basis
Tracing the Domains’ behavior
16. 2. Decision phase and virtualization
• Evaluation criterion: the average IR rate over a certain time
window
– the time window allows the workload to adapt to the actual
configuration
– the comparison of IR rates of different configurations highlights
which one makes the workload perform better
• Resource allocation granularity: core-level
– each domain owns a set virtual CPUs (vCPUs)
– a set of physical CPUs (pCPU) present on the machine
– each vCPU can be mapped on a pCPU for a certain amount of
time, while multiple vCPUs can be mapped on the same pCPU
• We wanted our allocation to cover the whole set of pCPUs, if
possible
15
17. 3. Extending the hypervisor - RAPL
• Working with the Intel RAPL interface:
– harshly cutting the frequency and the voltage of the whole CPU socket
• On a bare-metal operating system:
– reading and writing data into the right Model Specific Register (MSR)
• MSR_RAPL_POWER_UNIT: read processor-specific time, energy and power
units, used to scale each value read or written
• MSR_PKG_RAPL_POWER_LIMIT: write to set a limit on the power
consumption of the whole socket
• In a virtualized environment:
– the Xen hypervisor does not natively support the RAPL interface
– we developed custom hypercalls, with kernel callback functions and
memory buffers
– we developed a CLI tool that performs some checks on the input
parameters, as well as of instantiating and invoking the Xen command
interface to launch the hypercalls
16
18. 3. Extending the hypervisor - Resources
• cpupool tool:
– allows to cluster the physical CPUs in different pools
– the pool scheduler will schedule the domain’s vCPUs only
on the pCPUs that are part of that cluster
– as a new resource allocation is chosen by the decide phase,
we increase or decrease the number of pCPUs in the pool
– pin the domain’s vCPUs to these, to increase workload
stability
• NO xenpm:
– set a maximum and minimum frequency for each pCPU
– it may interfere with the actuation made by RAPL
17
22. System Design
• Instruction Retired (IR) metric gathered and accounted to each domain,
thanks to XeMPower
• The aggregation is done over a time window of 1 second
22
24. System Design 24
– given a workload with M virtual resources
and an assignment of N physical resources,
to each pCPUi we assign:
25. System Design
• Hybrid actuation:
– enforce power cap via RAPL
– define a CPU pool for the workload and pin workload’s vCPUs over pCPUs
25
26. System Design 26
• Hybrid actuation:
– enforce power cap via RAPL
– define a CPU pool for the workload and pin workload’s vCPUs over pCPUs
27. System Design 27
• Hybrid actuation:
– enforce power cap via RAPL
– define a CPU pool for the workload and pin workload’s vCPUs over pCPUs
28. Experimental Setup
• Server setup (aka Sandy)
– 2.8-GHz quad-core Intel Xeon E5-1410 processor, no HT enabled
(4 cores)
– 32GB of RAM
– Xen hypervisor version 4.4
– paravirtualized instance of Ubuntu 14.04 as Dom0, pinned on the
first 4 and with 4GB of RAM
• Benchmarking
– Embarrassingly Parallel (EP) [1]
– IOzone [3]
– cachebench [2]
– Bi-Triagonal solver (BT) [1]
28
EP IOzone cachebench BT
CPU-bound YES NO NO YES
IO-bound NO YES NO YES
memory-bound NO NO YES YES[1] Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb. html#url. Accessed: 2016-06-01.
[2] Openbenchmarking.org. https://openbenchmarking.org/test/pts/ cachebench. Accessed: 2016-06-01.
[3] Iozone filesystem benchmark. http://www.iozone.org. Accessed: 2016- 06-01.
29. Experimental evaluation 29
• Experimental evaluation:
1. how do different workloads perform under a power cap?
2. can we achieve higher efficiency w.r.t. RAPL power cap?
• Three power caps explored: 40W, 30W and 20W
– in idle state, the entire socket consumes around 17W
– the maximum power consumption we measured was
around 43W
• Results are normalized with respect to the performance
obtained with no power caps
30. Experimental Results 31
0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
NormalizedPerformance
0
0.2
0.4
0.6
0.8
1.0
EP cachebench IOzone BT
• Preliminary evaluation: how do they perform under a power cap?
• For CPU-bound benchmarks (i.e., EP and BT), the difference are
significant w.r.t. benchmarks where the bottleneck is on the IO and/
or on memory accesses
31. Experimental Results 32
0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
NormalizedPerformance
0
0.2
0.4
0.6
0.8
1.0
EP cachebench IOzone BT
• Preliminary evaluation: how do they perform under a power cap?
• With IO- and/or memory-bound workloads, the performance
degradation is less significant between different power caps
32. Experimental Results 34
0
0.5
1.0
PUPiL 40
RAPL 40
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
0
0.5
1.0
PUPiL 30
RAPL 30
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
0
0.5
1.0
PUPiL 20
RAPL 20
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
• Performance of the
workloads with
XeMPUPiL, for different
power caps:
– higher performance
than RAPL, in general
– not always true on a
pure CPU-bound
benchmark (i.e., EP)
33. Experimental Results 35
0
0.5
1.0
PUPiL 40
RAPL 40
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
0
0.5
1.0
PUPiL 30
RAPL 30
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
0
0.5
1.0
PUPiL 20
RAPL 20
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
• Performance of the
workloads with
XeMPUPiL, for different
power caps:
– higher performance
than RAPL, in general
– not always true on a
pure CPU-bound
benchmark (i.e., EP)
34. Experimental Results 36
0
0.5
1.0
PUPiL 40
RAPL 40
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
0
0.5
1.0
PUPiL 30
RAPL 30
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
0
0.5
1.0
PUPiL 20
RAPL 20
Normalizedperformance
0
0.5
1.0
EP cachebench IOzone BT
• XeMPUPiL improves the
performance of the IO-bound,
the memory-bound and the
mixed benchmark w.r.t. the
system with no constraints:
– just one core assigned for
IOzone and Cachebench
– two cores for the BT
benchmark
• These allocations are more
power efficient, as they
reduce memory and IO
contention for non strictly
CPU-bound workloads
35. Conclusion and Future Work
• Conclusions
– Performance tuning trough ODA controller under a power cap
improves performance
• Future works
– Improving decide phase
• Better algorithm in order to reduce convergence time
• More general approach in order to improve portability
– Improving act phase
• Implementation of custom fine-grained tool for
resource management in Xen
37