The document discusses power-efficient scheduling in the Linux kernel. It proposes moving power management capabilities from cpufreq and cpuidle into the scheduler to allow it to make more informed decisions. Key points include:
- The scheduler currently lacks power/energy information to optimize task placement.
- Cpufreq and cpuidle are not well coordinated with the scheduler.
- A power driver would provide power/topology data to the scheduler.
- Feedback from a kernel summit highlighted the need for use cases and benchmarks to evaluate proposals.
- Patches have been prepared to implement task placement based on CPU suitability.
2. 2
Topics Overview
Timeline
Towards a unified scheduler driven power policy
Task placement based on CPU suitability
Kernel Summit Feedback
Status
Questions?
3. 3
Timeline
May – Ingo's response to the task packing patches from
VincentG reignited discussions on power-aware scheduling
Early July – Posted proposed patches for a power aware
scheduler based on a power driver running in conjunction
with the current scheduler
Avoid big changes to the already complex current scheduler
Migrate functionality back in to the scheduler when we had worked
out the kinks
Sept – At Plumbers there was a relatively broad agreement
with the approach
October – Morten reposts patchset with refined APIs between
power driver and the scheduler
LKS – Reopened the discussion. More on this later
4. 4
Unified scheduler driven power policy … Why ?
big.LITTLE MP patches are tested, stable and performant
Take the principles learnt during the implementation and apply to
an upstream solution
Existing power management frameworks are not coordinated
(cpufreq, cpuidle) with the scheduler
E.g. the scheduler decides which cpu to wake up or idle without
having any knowledge about C-states. cpuidle is left to do its best
based on these uninformed choices.
The scheduler is the most obvious place coordinate power
management at it has the best view of the overall system load.
The scheduler knows when tasks are scheduled and decides the
load balance. cpufreq has to wait until it can see the result of the
scheduler decisions before it can react.
Task packing in the scheduler needs P and C-state information to
make informed decisions.
5. 5
Existing Power Policies
Frequency scaling: cpufreq
Generic governor + platform specific driver
Decides target frequency based on overall cpu load.
Idle state selection: cpuidle
Generic governor + platform specific driver
Attempts to predict idle time when cpus enter idle.
Scheduler:
Completely generic and unaware of cpufreq and cpuidle policies.
Determines when and where a task runs, i.e. on which cpu.
Task placement considering CPU suitability required.
6. 6
cpu1cpu1
Existing Power Policies
cpu0cpu0
Freq Load
T
Scheduler
policy
cpufreq
policy
cpuidle
policy
Powerrq
T
Load balance
idle
Current load (pre-3.11)
Current load (3.11)
No coordination between power policies to avoid
conflicting/suboptimal decisions.
Is it a problem?
7. 7
Issues
Scheduler->cpufreq->scheduler cpu load feedback loop
From 3.11 the scheduler uses tracked load for load-balancing.
Tracked load is impacted by frequency scaling. Lower frequency
leads to higher tracked load for the same task.
Hindering new power-aware scheduling features
Task packing: Needs feedback from cpufreq to determine when cpus
are full.
Topology aware task placement: Needs topology information inside
the scheduler to determine the most optimal cpus to use when the
system is partially loaded.
Heterogeneous systems (big.LITTLE): Needs topology information
and accurate load tracking.
Thermal also needs to be considered
8. 8
Power scheduler proposal
Power driver (drivers/*/?.c)Scheduler (fair.c) Power framework (power.c)
Helper function
library
Driver registrationsched_domain
Hierarchy
(Generic topology)
Load balance
algorithms
Detailed platform
topology
Platform HW driver
Load tracking
Platform perf. and
energy monitoring
Performance state
selection
Sleep state
selection
“Important tasks”
cgroup
+ New generic info
(pack, heterogeneous, ...)
+ Packing,
+ P & C-state aware,
+ Heterogeneous
+ Scale invariant
Abstract power
driver/topology
interface
Existing policy algorithms
Library (drivers/power/?.c)
9. 9
Task placement based on CPU suitability
Part of the power scheduler proposal
sched_domain hierarchy
Load balance algorithm (Heterogeneous)
Existing big.LITTLE MP Patches
Definition: CFS scheduler optimization for heterogeneous platforms.
Attempts to select task affinity to optimize power and performance
based on task load and CPU type
Hosted at
http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git
Co-exists with existing (CFS) scheduler code
Guarded by CONFIG_SCHED_HMP
Setup HMP domains as a dependency to topology code
Implement big.LITTLE MP functionality inside scheduler mainline code
10. 10
Task placement scheduler architectural bricks
1) Additional sched domain data structures
2) Specify sched domain level for task placement
3) Unweighted instantaneous load signal
4) Task placement hook in select task
5) Task placement hook in load balance
6) Task placement idle pull
11. 11
Brick 1: Additional sched domain data structures
big.LITTLE MP:
struct hmp_domain
struct hmp_domain {
struct cpumask cpus;
struct cpumask possible_cpus;
struct list_head hmp_domains;
}
Task placement based on CPU suitability:
Use the existing sched groups in CPU sched domain level
Add task load ranges into CPU, sched domain and group
12. 12
Brick 2: Specify sched domain level
big.LITTLE MP:
No additional sched domain flag
Deletes SD_LOAD_BALANCE flag in CPU level
Task placement based on CPU suitability:
Adds SD_SUITABILITY flag to CPU level
13. 13
Brick 3: Unweighted instantaneous load signal
big.LITTLE MP & Task placement based on CPU suitability:
For sched entity and cfs_rq
struct sched_avg {
u32 runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update;
s64 decay_count;
unsigned long load_avg_contrib;
unsigned long load_avg_ratio;
}
sched entity: runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1)
cfs_rq: set in [update/enqueue/dequeue]_entity_load_avg()
14. 14
Brick 4: Task placement hook in select task
big.LITTLE MP:
Force new non-kernel tasks onto big CPUs until
load stabilises
Least loaded CPU of big cluster is used
Task placement based on CPU suitability:
Use task load ranges of previous CPU and
(initialized) task load ratio to set new CPU
15. 15
Brick 5: Task placement hook in load balance
big.LITTLE MP:
Completely bypasses load_balance() in CPU level
hmp_force_up_migration() in run_rebalance_domains()
Calls hmp_up_migration() for migration to faster CPU
Calls hmp_offload_down() for using little CPUs when idle
Does not use env->imbalance or something equivalent
Task placement based on CPU suitability:
Happens inside load_balance()
Find most unsuitable queue (i.e. find source run-queue)
Move unsuitable tasks (counterpart to load balance)
Move one unsuitable task (counterpart to active load balance)
Cannot use env->imbalance to control load balance
Using grp_load_avg_ratio/(NICE_0_LOAD * sg->group_weight) <= THRESHOLD
Falling back to 'mainline load balance' in case condition is not meet (destination
group is already overloaded)
16. 16
Brick 6: Task placement idle pull
big.LITTLE MP:
Big CPU pulls running task above the threshold from little CPU
Task placement based on CPU suitability:
Not necessary because idle_balance()->load_balance() is not
suppressed on CPU level by missing SD_LOAD_BALANCE flag
Idle pull happens inside load_balance
17. 17
Kernel Summit Feedback
Good to get active discussion
First time with everybody in the same room
LWN article - “The power-aware scheduling mini-summit”
Key points made
Power benchmarks are needed for evaluation
Use-case descriptions are needed to define common ground.
The scheduler needs energy/power information to make power-aware
scheduling decisions.
Power-awareness should be moved into the scheduler.
cpufreq is not fit for its purpose and should go away.
cpuidle will be integrated in the scheduler. Possibly support by
new per task properties, such as latency constraints
Are there ways to replay energy scenarios?
Linsched or perf sched
18. 18
Kernel Summit feedback observations
All part of the open-source process
Discussions have raised awareness of the issues
Maintainers recognise the need for improved power management
Iterative approach necessary but the steps are clear
Maintainers have a clear server/desktop background
ARM community can help educate this audience on embedded
requirements
Benchmarking for power could be hard to do in a simple way
Cyclic test, sysbench type tests unlikely to yield realistic results in real
systems
However, full accuracy not required
Power models necessarily complex and often closely guarded
secrets
Collection and reporting of meaningful metrics is probably sufficient
19. 19
Status
Latest Power-aware scheduling patches on LKML
https://lkml.org/lkml/2013/10/11/547
Task placement based on CPU suitability patches prepared
Proof of concept done
Waiting for right time to post to lists
Feedback from Linux kernel Summit needs to be discussed