1. Load-Balancing for Improving user Responsiveness
on Multicore Embedded Systems
Jul-11, 2012
Geunsik Lim
Samsung Electronics Co., Ltd.
Sungkyungkwan University
2. 2/24
Who am I ?
• Full name: Geunsik Lim
• E-Mail : geunsik.lim@samsung.com, leemgs@gmail.com
• Current : Senior software engineer at Samsung Electronics (http://www.samsung.com)
• Android localization: Korea Android community (http://www.kandorid.org)
• Past: S/W membership manager at Samsung Electronics
Senior engineer at ROBOST company
Systems administrator at Daegu Bank, Ltd.
South
Korea
Ottawa
4. 4/24
SMP Scheduler(Load-balancing) : scheduler( ), load_balance( ), migration_thread( )
Synchronization : Semaphore, Spin-Lock, FUTEX, Atomic op., Per-CPU variable, RCU, Work-Queue
Interrupt Load-balancing ( or user-space level irqbalance daemon)
Affinity (Interface to protect the movement of tasks into another CPU for system administrator)
• CPU Affinity (Shielded CPU)
• I/O Affinity
• IRQ Affinity
CPUSET(with Process Container; cgroups): Assign CPU and Memory on NUMA
CPU Isolation: Isolate a specific CPU (If you don‟t need Load-balancing)
Tasks
Tasks
Multi-core
Parallelism
Load-balancing
Introduction – Linux Features for Multicore
5. 5/24
• 2.6.00 SMP scalability (Per-CPU data structures)
• 2.6.16 SMP IRQ affinity
• 2.6.24 CPU isolation
• 2.6.28 Block: add support for IO CPU affinity
• 2.6.32 Enable rq CPU completion affinity by default (speeds up significantly databases)
• 2.6.33 Includes full support for ARM9 MPCore
• 2.6.37 Outdated Big Kernel Lock (BKL) technology
• 2.6.38 Improve cpu-cgroup performance for smp systems significantly by rewriting tg_shares_up
• 2.6.39 Ext4 SMP scalability - SMP speed-ups
• 3.1.00 Block: Dynamic writeback throttling - SMP scaling problem fixed , Strict CPU affinity,
• 3.4.00 Memory resource controller (with cgroups)
Latest Linux have the matured SMP features
• 2.6.15: SMP support for ARM11 MPCore
• 2.6.18: SMPnice
• 2.6.36: Support for S5PV310 (ARM Cortex-A9 Multi-Core)
The major features for ARM is merged into mainline Kernel.
Change-logs of Linux Kernel for SMP and ARM.
Introduction – SMP Linux
Up-to-date
6. 6/24
Considerable Problem Solution
Avoiding destruction of sharing
resource according to Concurrent
workers (e.g. Writers)
Use Locking mechanism. (e.g: kernel lock facilities,
app level thread library)
Synchronization overhead Increase or decrease parallel level suitably.
Task Migration Adjust Affinity manually. (ideal OS will schedule
tasks automatically)
Resource Contention Operate well-programmed s/w, well-designed OS
scheduler like Cgroups. Utilize sched_yield( )
False sharing Allocated data into cache line size ( via compiler
ASAP).
Routines used by many agents Implement thread-safe and re-entrant software
Cache line depending task migration
(e.g. Ping-pong effect)
Affinitize tasks to a specific CPU
Unfair cache request case. Affinitize tasks to a specific CPU
Introduction – Considerable Factors for SMP Environment
7. 7/24
Related work - CPU Affinity Policy
This technique affinitize specific tasks into some CPUs to avoid load-balancing
operation
• Apparatus and method for improved CPU affinity in a multiprocessor systemRA Alfieri - US Patent 5,745,778, 1998, Citations 167
• Affinity scheduling of processes on symmetric multiprocessing systemsKD Abramson, HB Butts Jr… - US Patent 5,506,987, 1996
• Migration policies for multi-core fair-share scheduling, D Choffnes, M Astley,ACM SIGOPS Operating Systems, 2008
8. 8/24
Related work - Classification of RT & NRT tasks
This technique isolates a time-critical tasks into a specific CPU physically.
• Shielded CPUs: real-time performance in standard Linux, ecee.colorado.edu, S Brosky, Linux Journal, 2004, Citations 11
• Shielded processors: Guaranteeing sub-millisecond response in standard Linux, S Brosky, Parallel and Distributed Processing, 2003
• A real-time Linux, V Yodaiken, Proceedings of the Linux Applications, 1997, Citations 167
9. 9/24
Related work - A Partitioning method for Multi-processor
• Container-based operating system virtualization: a scalable,
high-performance alternative to hypervisors, S Soltesz, H Pötzl,
ME Fiuczynsk, ACM SIGOPS, 2007 , Citations 169
•Task partitioning: An innovation process variable, Eric von Hippel,
MIT Sloan School of Management, Cambridge, MA 02139, U.S.A.,
1 April 2002.
•Process Partitioning for Distributed Embedded Systems, CODES
'96 Proceedings of the 4th International Workshop on
Hardware/Software Co-Design, 1996
These techniques schedule by grouping/partitioning for tasks‟ goals in kernel space.
10. 10/24
Related work: Load-balancing on Linux for multicore system
• Load balancing operation periodically whenever load imbalance for optimal CPU utilization
• The problems of this mechanism process task migration unnecessarily although the CPU
isn't used as fully as 100%.
• Real-time performance and middleware on multi-core linux platforms, Yuanfang Zhang, Washington University, 2008
• Load balancing control method for a loosely coupled multi-processor system and a device for realizing same, Toshio
Hirosawa, Hitachi, Japan, Patent No. 4748558, May-1-1986
• Improve load balancing when tasks have large weight differential, Nikhil Rao, Google, http://lwn.net/Articles/409860
11. 11/24
Problems of the existing load-balancer
1. Direct cost
• The load-balancing co
st by checking the loa
d imbalance of CPUs f
or utilization and scala
bility in the multicore
system
2. Indirect cost
• Cache invalidation
• Power consumption
3. Latency cost
• Scheduling latency
• Longer non-preempta
ble Period
In general, more CPU load leads to more frequent task migration, and thus, incurs
higher cost. The cost can be broken down into direct, indirect, and latency costs
as follows;
12. 12/24
Operation zone based load-balancer: Task migration time
Figure shows the time that has to inspect the needs of task migration to keep
the CPU load fairly.
(1)
(2)
(3)
13. 13/24
Operation zone based load-balancer : Load-balancing operation zone
• load-balancing operation zone consists of three scheduling-aware control areas.
• "Cold zone" policy may executes load-balancing operation loosely for low CPU utilization system
• "Hot zone" policy must executes load-balancing operation enthusiastically like the existing mechanism
• "Warm zone" policy is located in middle level between "Cold zone" and "Hot zone".
100
90
80
70
60
50
40
30
20
10
0
CPUusage(%)
Hot Zone
Warm Zone
Cold Zone
Fluctuation
Spot
(Always load-balancing)
(No load-balancing)
High spot
Mid spot
Low spot
(No load-balancing)
14. 14/24
Operation zone based load-balancer : Calculating CPU utilization
• Warm Zone consists of three spots based on management system of score.
• Control of tasks isn't simple because CPU utilization of "Warm zone“ policy
occurs fluctuations, Therefore, support Weight-based score management.
Please see the paper for
the detail
Weight-based score management for
Warm zone
Based on Local CPU
(Default policy)
Based on Average CPUs
15. 15/24
Hardware Latency
Interrupt
Per CPU Latency
Interrupt Latency
Preemption Latency
Switching Latency
WakeUp Latency
Latency Factors in Linux Kernel
Misc. Latency
Latency factors in kernel-space
• The major factors that happen latency damage in kernel-space
Scheduling
Latency
16. 16/24 16/10
RT Task
Go to sleep
(1000 usec)
NRT/lower PR Tasks
5,000 usec
RT Task
Go to sleep
(1000 usec)
NRT/lower PR Tasks
5,000 usec Latency
Preemption
latency
Switching
latency
Interrupt
latency
… …Wakeup
latency
…
Evaluation environment
17. 17/24
Evaluation scenario for worst-case
# Evaluate latency of 1 user-space thread with static priority 99
# ps -eo comm,pid,tid,class,rtprio,wchan:35 | grep 99 | awk '{print $2}„
time ./cyclictest ( –a 0 )-t1 -p 99 -i 1000 -n -l 1000000
# Create 50 threads as background tasks.
time ./cyclictest -t50 -p 80 -i 10000 -n -l 100000
# To maximize I/O Load ASAP
cd /opt
tar cvzf test1.tgz ./linux-2.6.X &
tar cvzf test2.tgz ./linux-2.6.X &
tar cvzf test3.tgz ./linux-2.6.X &
tar cvzf test4.tgz ./linux-2.6.X &
# To maximize CPUs Load
/bin/ping -l 100000 -q -s 10 -f localhost &
/bin/ping -l 100000 -q -s 10 -f localhost &
/bin/ping -l 100000 -q -s 10 -f localhost &
/bin/ping -l 100000 -q -s 10 -f localhost &
/bin/ping -l 100000 -q -s 10 -f localhost &
# To get the highest CPU stress with Ingo Molnar’s dohell.
#!/bin/sh
while true; do /bin/dd if=/dev/zero of=bigfile bs=1024000 count=1024; done &
while true; do /usr/bin/killall hackbench; sleep 5; done &
while true; do /sbin/hackbench 20; done &
( cd ./ltp-full-20120401; while true; do ./runalltests.sh -x 40; done & )
Evaluate
scheduling Latency of
a urgent task
Stress conditions
http://rt.wiki.kernel.org# Calculate the usage of disk for CPU & I/O load
/bin/du / &
BACKGROUNDFOREROUND
18. 18/24
Evaluation on CPU affinity based system 1/2
• Test Scenario: Foreground task is affinity (CPU0). Background stress is affinity (CPU1~3).
• Test Environment : Intel Q9400 , Linux 2.6.32
• Test Utilities : LTP-FULL-20120401 , Cyclictest of rt-test package
• Load-balancer setting: With Warm Zone (High spot) Policy
Scheduling latency of our test thread is
reduced more than three times: from 53
microseconds to 16 microseconds on average
19. 19/24
Evaluation on CPU non-affinity based system 2/2
• Test Scenario: Foreground task is affinity (CPU0). Background stress is non-affinity.
• Test Environment : Intel Q9400 , Linux 2.6.32
• Test Utilities : LTP-FULL-20120401 , cyclictest of rt-test package
• Load-balancer setting: With Warm Zone (High spot) Policy
Scheduling latency of our test thread is
reduced more than two times: from 72
microseconds to 31 microseconds on average
21. 21/24
Evaluation – Migration Handling of one threaded application
• Test Environment : Android device, Linux 2.6.32
• Test Scenario : CPU intensive process‟s scheduling with one threaded application
• Test Example : tar xvf *** ./
• System Interface: /proc/sys/kernel/balance_one_threaded_app (ON=1, OFF=0)
Time
CPU 0Before
CPU 1
CPU2
CPU 3
95%
94%
89%
91%
86%
92%
97%
91
%
89%
84%
Idle status
Idle status
Idle status
Idle status
Idle status
Idle statusIdle status
Idle status
Idle status
Idle status
Idle status
Idle
status
Start End
CPU 0
After
CPU 1
CPU2
CPU 3
92%
(CPU usage of one process)
Time
Idle status
Idle status
Idle status
Start End
22. 22/24
Further work
• If the deadline guarantee for real-time characteristics in the worst
conditions is very critical for real-time systems, this approach has the
technical limitation to max latency protection of running tasks anytime.
• We need to figure out the best method such as a hybrid design by
mixing our technique and the physical CPU shielding technique.
• To recognize low power consumption of mobile devices, we need further
experimental research to design an ideal algorithm for vital task
migration according to the CPU on-line and the CPU off-line.
• We have to evaluate various scenarios such as direct cost, indirect cost,
and latency cost to improve our load-balancer as a next generation SMP
scheduler.
23. 23/24
Conclusion
• We do not need any modification of user-space because this
approach is the only technique in the operating system.
• Our design reduces non-preemptive intervals that always
generate double-locking cost for task migration among the CPUs.
• Our approach suppress the “task migration” kernel thread which
executes inefficient CPU instructions to move a task to another
CPU
• Our idea pushes cost reduction aggressively regarding CPU cache
invalidation and synchronization cause by the update of local
cache.