1. Energy Conservation and Thermal
Management in High-Performance
Server Architectures
Adam Lewis
The Center for Advanced Computer Studies
The University of Louisiana at Lafayette
Tuesday, March 15, 2011
2. Agenda
• Background and Related Work
• System Modeling
• Effective Prediction
• Initial Evaluation and Results
• Thermally-Aware Scheduling
• Status, Plans, and Summary
Tuesday, March 15, 2011
3. What does this picture tell us?
(c) The New York Times, June 14, 2006
Source: McKinsey & Company 2008 Source: EPA 2008
A 20% projected increase Only ~50%
in data center of power consumed
emissions over next 5 years from IT equipment
Tuesday, March 15, 2011
4. Current Practice
Completely Fair Scheduler
Domain-based Load Balancing
Power-state aware
Run-queue scheduling
Domain-based Load Balancing
Power-state aware (Solaris 11)
Run-queue scheduling
Interface w/ power manager?
Tuesday, March 15, 2011
5. Thread Scheduling & Power Management
DVFS:
P = CV 2 f
SpeedStep
Multi-core/Many-core • Performance issues [LLBL 2007]
•Cache affinity • Lack of slack
•Load balancing • High load = No gain
•Opportunity to turn • Reliability issues [Bircher 2008]
off the lights?
• Under-clocking & MTBF
• Reactive rather than proactive
Tuesday, March 15, 2011
6. Proactively Avoid Thermal Emergencies
Thermally-
A Full-System Effective
+ Aware
Energy Model Prediction
Scheduling
• Possible approaches
• Heat-and-Run and related approaches
[Gomaa2004] [Coskun2009] [Zhou2010]
• Memory-resource focused approaches
[Merkel2010]
• Control-theoretic techniques
Tuesday, March 15, 2011
8. Model: Inputs & Components
Esystem = Eproc + Emem + Ehdd + Eboard + Eem .
• Processor
• Memory
• Hard disk & storage
devices
Edc = Esystem
• Motherboard &
peripherals
• Three DC voltage
domains • Electrical &
Electromechanical
• 12Vdc, 5.5Vdc, 3.3Vdc
Components
• 5.5V and 3.3V domains
limited to 20% of rated
voltage
Tuesday, March 15, 2011
9. Model: Processor
t2
Eproc = (Pproc (t))dt
t1
Memory Memory
DDR2-DRAM DDR2-DRAM
QPLC
•
Core 1 Core 2 Core 2 Core 1
Bus transactions
I-cache D-cache I-cache D-cache D-cache I-cache D-cache I-cache
L2 Cache L2 cache L2 cache L2 Cache Core Core
system request interface system request interface
Crossbar Switch Crossbar Switch
•
Coherent
Integrated HyperTransport Integrated
Host bridge (cHT) Host bridge
Memory Memory
Reflects amount of
HyperTransport HyperTransport
Controller Controller
HyperTransport Bus
QPLIO QPLIO
data processed
Input
USB VGA
Output
SouthBridge Handler
HDD
•
Ethernet
Die temperature
DVD
Graphics
Board - Level Power consumers PCI Express
AMD Opteron Intel Nehalem
• Computation per
core
• Processor as black box
• Processor system
• Power = f(workload) metrics
• Manifests as heat
Tuesday, March 15, 2011
10. Model: Memory
• DRAM Read/Write
t2
N
power + background
Emem =
t1
(
i=1
CMi (t) + DB(t)) × PDR + Pab dt power = known
quantities
• Performance counters
exist for measuring the
count of highest level
cache miss and bus
transactions
• Combine these to
compute the energy
consumed
Tuesday, March 15, 2011
11. Model: Storage
Ehdd =Pspin−up × Tsu + Pread N r × Tr
+ Pwrite N w × Tw + Pidle × Tid
Parameter Value
Interface Serial ATA
Capacity 250 GB
Rotational speed 7200 rpm
Power (spin up) 5.25 W (max)
Power (Random read, write) 9.4 W (typical)
Power (Silent read, write) 7 W (typical)
Power (idle) 5 W (typical)
Power (low RPM idle) 2.3 W (typical for 4500 RPM)
Power (standby) 0.8 W (typical)
Power (sleep) 0.6 W (typical)
Tuesday, March 15, 2011
12. Model: Board
Eboard = Vpower−line × Ipower−line × tinterval
• System components that
support the operation of
the machine
• Typically in the 5.5Vdc
and 3.3Vdc power
domains
• Measured by current
probe
Tuesday, March 15, 2011
13. Model: Electromechanical
N
Tp
i
Eem = V (t) · I(t) + Pf an (t) dt
0 i=1
• Need to account for
energy required to cool
• No performance
counters
• Can measure power
drawn by the fans
• Derived from log data
collected by OS
Tuesday, March 15, 2011
15. gobmk 1.7% 9.0% 2.30
zeusmp TABLE III 8.1%
2.8% 2.14
Linear ODEL ERRORS FOR CAP, AR(1),A good ON AN AMD OPTERON S
M AR Time Series - AND MARS idea?
AR
Avg Max RMSE
Benchmark Err % Err %
astar 3.1% 8.9% 2.26
games 2.2% 9.3% 2.06
gobmk 1.7% 9.0% 2.30
zeusmp 2.8% 8.1% 2.14
TABLE IV
M ODEL ERRORS AR Model: AMD Opteron SERVER
Linear FOR AR ON I NTEL N EHALEM
• Linear Regression
• Easy, simple Benchmark
Avg
Err %
Max
Err %
RMSE
• Odd mis-predictions astar 5.9% 28.5% 4.94
• Corrective methods games
gobmk
5.6%
5.3%
44.3%
27.8%
5.54
4.83
required zeusmp 7.7% 31.8% 7.24
TABLE IV
M ODEL ERRORS FOR AR ON I NTEL Nehalem SERVER
Linear AR Model: Intel N EHALEM
Avg Max RMSE
Benchmark Err % Err %
astar 5.9% 28.5% 4.94
games 5.6% 44.3% 5.54
Tuesday, March 15, 2011
16. Prediction w/ Chaotic Time Series
TABLE 4
dications of chaotic behavior in power time series
Chaotic behavior
(AMD, Intel) Chaotic Time Series
Benchmark Hurst Average • Time-delay reconstructed state space
Parameter
(H)
Lyapunov
Exponent
• Uses Takens Embedding Theorem:
bzip2 (0.96, 0.93) (0.28, 0.35)
• Time-delayed partition of
observations to build function that
cactusadm (0.95, 0.97) (0.01, 0.04) preserves the topological and
gromac (0.94, 0.95) (0.02, 0.03)
leslie3d (0.93, 0.94) (0.05, 0.11) dynamical properties of our original
omnetpp (0.96, 0.97) (0.05, 0.06) chaotic system
perlbench (0.98, 0.95) (0.06, 0.04)
• Find nearest neighbors on attractor to our
observations
• Perform least-square curve fit to find a
ponent can be calculated using: polynomial that approximates the attractor
1
N −1
λ = lim ln|f (Xn )|.
N →∞ N
n=0
e found a positive Lyapunov exponent when per-
ming this calculation on our data set ranging from
1 to 0.28 (or 0.03 to 0.35) on the AMD (or Intel) test
ver, as listed in Table 4, where each pair indicates
Tuesday, March 15, 2011
18. Forward prediction
• Start with a Taylor series expansion
fˆ(X) = f (x) + f (x)T (X − x)
ˆ ˆ
• Find the coefficients of the polynomial by solving the
linear least squares problem for a and b:
n+p
T
2
Xt − a − b (Xt−1 − x) ∗ Kβ (Xt−1 − x)
t=p+1
• Explicit solution for our linear least squares
problem: n+p
ˆ 1
f (x) = (s2 − s1 ∗ (x − Xt−1 ))2 ∗ Kβ ((x − Xt−1 )/β)
n t=p+1
n+p
1
si = (x − Xt−1 )i ∗ Kβ ((x − Xt−1 )/β)
n t=p+1
Tuesday, March 15, 2011
19. Time Complexity
n future observations p past observations
Creating a CAP: O(n ) 2
Predicting with a CAP: O(p)
Tuesday, March 15, 2011
21. gobmk C Artificial Intelligence: Go
ipmito
FP Benchmarks
are ava
Initial Evaluation and Results calculix C++/F90 Structural Mechanics
zeusmp F90 Computational Fluid Dynamics
commo
Solaris
Linux.
TABLE 7
dtrace
Test hardware configuration
tunable
impact
Sun Fire 2200 Dell PowerEdge R610
consum
CPU 2 AMD Opteron 2 Intel Xeon (Nehalem) 5500 The p
CPU L2 cache 2x2MB 4MB
Memory 8GB 9GM power
Internal disk 2060GB 500GM and th
Network 2x1000Mbps 1x1000Mbps measur
Video On-board NVIDA Quadro FX4600
Height 1 rack unit 1 rack unit ampera
memor
the run
TABLE 5 CAT increases linearly, as can be obtained in Eq. (15). downlo
values, TABLE 6
SPEC CPU2006 benchmarks used for model interna
Open
Training Benchmarks
calibration Evaluation Benchmarks
The actual computation time results for our CAP
SPEC CPU2006 benchmarks used for evaluation
code implemented using MATLAB run on machines the dif
coun
(detailed in Table 7) with respect to different n and p sured u
the O
Integer Benchmarks Integer Benchmark
values are provided in the next section. one Ag
In
bzip2 C Compression astar C++ Path Finding domain
is de- mcf C Combinatorial Optimization gobmk C Artificial Intelligence: Go
have
from th
average omnetpp C++ Discrete Event Simulation 5 FP E VALUATION ipmi
Benchmarks a bench
sed on
host. a
are
FP Benchmarks
. , Xt−p A calculix C++/F90 Structural Mechanics out to evaluate
set of experiments was carried
gromacs C/F90 Biochemistry/Molecular Dynamics
the performance of power models built using CAP
zeusmp F90 Computational Fluid Dynamics
comm
(u) = cacstusADM C/F90 Physics/General Relativity
Solar
as de- leslie3d F90 Fluid Dynamics techniques to approximate a solution for dynamic 5.2 R
lbm C Fluid Dynamics systems following Eq. TABLE 7 purpose of the first
(12). The Linu
experiment was to confirm the time complexity CAP. Fig. 8
dtra
Test hardware configuration
The behavior of CAP was simulated using MATLAB from C
tuna
using two criteria: sufficient coverage of the functional on the hardware described in Table 7 withR610 varying tual po
impa
Sun Fire 2200 Dell PowerEdge system
units in the processor and reasonable applicability values of n future observation and p past observa- cons
to the problem space. Components of the processor tions. Fig. 6 illustrates the behavior of (Nehalem) 5500
CPU 2 AMD Opteron 2 Intel Xeon CAP as the three A
Th
affect the thermal envelope in different ways [40]. This CPU L2 cache 2x2MB 4MB over th
value of n is varied and confirms the O(n2 ) behavior
Memory 8GB 9GM pow
issue is addressed by balancing the benchmark selec- of the predictor in this case. The behavior of CAP as
Internal disk 2060GB 500GM betwee
and
tion between integer and floating point benchmarks pNetwork is shown in Fig. 71x1000Mbps
is varied 2x1000Mbps and supports the claim benchm
meas
a local in the SPEC CPU2006 benchmark suite. Second, the Video On-board NVIDA Quadro FX4600 polyno
Tuesday, March 15, 2011 of linear behavior. amp
22. Results: AMD Opteron f10h
(a) Astar/CAP. (b) Astar/AR(1)).
(c) Zeusmp/CAP. (d) Zeusmp/AR(1).
Fig. 8. Actual power results versus predicted results for AMD Opteron.
Tuesday, March 15, 2011
23. Results: Intel Nehalem
(c) Zeusmp/CAP. (d) Zeusmp/AR(1).
Fig. 8. Actual power results versus predicted results for AMD Opteron.
(a) Astar/CAP. (b) Astar/AR(1)).
(c) Zeusmp/CAP. (d) Zeusmp/AR(1).
Fig. 9. Actual power results versus predicted results for an Intel Nehalem server.
Tuesday, March 15, 2011
25. Observations and Analysis
• Where does maximum error occur?
• Choice of performance counters
• Difference in behavior between
processors?
• The right set of performance counters
• Benchmark selection
Tuesday, March 15, 2011
27. Problem nature
• Scheduling...
• in time: who runs next
• in space: who runs where
• Optimization problem
• Who runs next: least use of energy with
best performance quality of service
• Who runs where: best utilization of
resources with least increase in
processor and/or ambient temperature
Tuesday, March 15, 2011
28. Thermal Extensions to System Model
Applications have
a length: 2.
1. For which we define
L(A, DA , t) Thermal Equivalent of Application
and generate workload ΘA (A, DA , T, t) = U (A,DA ,t)
lim Je × (T − Tnominal )
T →Tth
U (A, DA , t) = lim n × W (pi , di , t) × Ln (An , DAn , t), 1 ≤ i ≤ p
n→ke
3. Which is used to generate 4. That is used to compute
Thermal Efficiency to Completion Cost of Performance per Unit Power
ΘA (A,DA ,T,t) ΘA (A,DA ,T,t)
η(A, DA , T, t) = ΘA (Ae ,DAe ,Tme ,Le ) Cθ (A, DA , T, t) = Esys (A,DA ,t)
Tuesday, March 15, 2011
29. Extending CAP for Thermal Prediction
• Thermal Chaotic Attractor Predictor
(TCAP)
• Extends CAP to thermal domain
• Created and used in similar manner to
CAP
• Matching TCAP for each thermal metric
Tuesday, March 15, 2011
30. Reducing Processor Temperatures
• Premise: Processor die temperature can be
managed by controlling what threads
execute over time
• Predict the next thread to run a logical
CPU using TCAP for processor die
temperature
Tuesday, March 15, 2011
31. Reducing Ambient Temperature
• Premise: Control system ambient
temperature by managing load on logical
CPUs so that overheated resources have
time to recover
• Partition resources into categories based
on predicted change in temperature
• Move workload from “HOT” resources
towards “COLD” resources
Tuesday, March 15, 2011
33. Current Status
Thermally-
A Full-System Effective
+ Aware
Energy Model Prediction
Scheduling
• Development Complete
• Evaluation Complete • Design complete
• Intel + AMD processors • Prototype under
development
• OpenSolaris (Solaris 11)
• Peer-reviewed
• Conference/Workshop: 3
• Journal: 1
Tuesday, March 15, 2011
34. Plan for Completion
ID Task
1 Respond to review comments for [Lewis 2011]
2 Implement scheduler prototype in FreeBSD
3 Evaluate scheduler performance using parallel benchmarks
4 Document results and submit to archival journal
5 Create dissertation from Prospectus + output from previous task
6 Defend dissertation
7 Respond to comments from committee and Graduate School editor
8 Submit final version of document
Tuesday, March 15, 2011
35. Future Directions
• Extend beyond a single blade
• Cluster, Grid, and Cloud Scheduling
• MPI, OpenMP, and other environments
• Impact of operating system virtualization
• Extension of the thermal model in terms of
the thermodynamics of computation
Tuesday, March 15, 2011
38. This work was supported in part by the U.S.
Department of Energy and by the Louisiana
Board of Regents
Tuesday, March 15, 2011
39. Publications List
Lewis, A., Ghosh, S., and Tzeng, N.-F. 2008. Run-
time energy consumption estimation based on
workload in server systems. Proceedings of the
2008 conference on Power aware computing and
systems.
Lewis, A., Simon, J., and Tzeng, N.-F. 2010.
Chaotic attractor prediction for server run-time
energy consumption. Proc. of the 2010 Workshop on
Power Aware Computing and Systems (Hotpower’10).
Lewis, A., Tzeng, N.-F., and Ghosh, S. 2011. Time
series approximation of run-time energy
consumption based on server workload. Under
review for publication in ACM Transactions on
Architecture and Code Optimization.
Tuesday, March 15, 2011