Dr. Bruce Worthington presented at the CMG'08 International conference on power management in Windows Server. He discussed how data center electricity usage has risen significantly due to increasing server power demands. Windows Server 2008 and 2008 R2 include improvements to processor performance states (P-states) and idle states (C-states) to better manage power usage with minimal performance impact. Server hardware must also support these power management features for the operating system to effectively control power.
2. CMG‘08 INTERNATIONAL
conference
Server Power Ground Rules
TANSTAAFL: Everything is a trade-off
Performance, Power, Functionality, Capacity,
Cost, Reliability, Availability, Manageability,
Maintainability, Usability, Environmental
Impact, Lifetime, Footprint, Security, Morale
Saving Power Power Efficiency
More work at fixed power level, or
Less power at fixed work level
Shifting component power efficiencies
4. CMG‘08 INTERNATIONAL
conference
Rising Cost of Ownership
From 2000 to 2006
Computing performance: 25x
Energy efficiency: 8x
US electricity cost: 1.35x
Power per $1K of server: 4x
Server(+) world electricity: >2x
○ >1% of total world production
Datacenters use 2% of all US electricity
5. CMG‘08 INTERNATIONAL
conference
Scale: Kilowatts
Megawatts
Idle high-performance servers
50-80% of max power draw
2-sockets ~ 250 W
4-sockets ~ 500 W
8-sockets ~ 1000 W
25 15Krpm 2.5” disks + SAN = 3U
~ 300/450 W (idle/active)
10,000 2-socket 1U servers ~ 1-3 MW
Datacenter “container” ~ 0.5 MW
~1500 servers + storage + infrastructure
6. CMG‘08 INTERNATIONAL
conference
Datacenter Energy Demand
Data centers are energy intensive facilities
Server racks now designed to carry 25 kW load
Surging demand for data storage
Typical facility ~ 1MW, can be > 20 MW (even 200 MW)
Nationally 1.5% of US Electricity consumption in 2006
○ Doubling every 5 years
Significant data center building boom,
Power and cooling constraints in existing facilities
Growing demand for compute cycles
Growing computing performance
Commoditized hardware
Declining cost of computing
7. CMG‘08 INTERNATIONAL
conference
15 MW Datacenter Monthly
Costs
“Good” (PUE=1.7) Internet-scale datacenter with DAS
Servers
$3,000,000Infrastructure
$1,800,000
Power
$1,000,000
3 yr server and 15 yr infrastructure amortization
11. CMG‘08 INTERNATIONAL
conference
Environmental Impact
Governments, businesses, and
organizations are trying to reduce the
production of greenhouse gases
New EPA Energy Star mandates for
enterprise server power efficiencies
13. CMG‘08 INTERNATIONAL
conference
ACPI Power State
Definitions
Performance states (P-states)
Dynamic voltage and frequency scaling
More than linear savings (cubic function)
Throttle states (T-states)
Linear scaling of CPU clock
“Power” states (C-states)
Low-power idle (CPU “sleep”) states
Turn off increasing amounts of silicon in package
System sleep states (S-states)
On, standby, hibernate, off
MS has not encouraged S-state support for servers
○ Changing with the increased focus on power
14. CMG‘08 INTERNATIONAL
conference
ACPI Power State State
Machine
• For entire system
○ Global System States (G-States)
○ Sleeping States (S-States)
Standby (S1), Hibernate (S2), …
For processor only
Processor Performance States (P-
States)
○ Different processor frequency and
voltage
Processor Throttling States (T-States)
○ Processor clock throttling to reduce
processor utilization (and capacity)
Processor Power States (C-States)
○ Processor is executing instructions
(C0)
○ Processor is idle (C1, C2, …)
Other devices
Device Power States (D-States)
○ Similar as C-States, but are for
devices other than processors
G3 -Mech
Off
Legacy
Wake
Event
G0 (S0) -
Working
G1 -
Sleeping
S4
S3
S2
S1
Power
Failure/
Power Off
G2 (S5) -
Soft Off
BIOS
Routine
C0
D0
D1
D2
D3
Modem
D0
D1
D2
D3
HDD
D0
D1
D2
D3
CDROM
C2
C1
Cn
Performance
State Px
Throttling
C0
CPU
15. CMG‘08 INTERNATIONAL
conference
ACPI Specification Versions
WS03 complies with ACPI 2.0
WS08 complies with ACPI 3.0
Multiprocessor
○ Dependent (ganged) and independent control
○ Independent control w/ dependent behavior
(may transition or not based on other
processors’ states)
MS has some ideas for ACPI 3.5
16. CMG‘08 INTERNATIONAL
conference
ACPI Power State
Dependencies
Dependency Domains for ACPI power states (assumes
S0)
Logical processors in the same domain should have the same
C-state, P-state, or T-state
No dependence between a processor’s C-state domain, P-state
domain, or T-state domain
OS control mechanisms based on dependency
relationships
Dependent control: Transitioning one processor to a new state
causes other processor(s) to transition to the same state
Independent control: Transitioning one processor to a new
P-state or T-state does not affect other processors’ power states
Independent control, dependent behavior: Transitioning one
processor to a new P-state or T-state may or may not transition
other processor(s) to the same state based on the current state
of the other processor()s that share this relationship
17. CMG‘08 INTERNATIONAL
conference
P-States
Windows processor performance states are
enabled by default
Power policy allows flexible use of
performance states
Values for min / max processor speed
Expressed as a percentage of maximum
processor frequency
Windows will round up to the nearest available state
Processor- and workload-dependent impact
E.g., one system configuration was determined to
have insignificant perf impact from capping P-states
at P1, but significant power savings
18. CMG‘08 INTERNATIONAL
conference
Power policy will always use DBS
between the range defined by min / max
frequency
Full range or subset of available P-states
Policy may be set to use only one performance
state (min / max / intermediate)
Will not include linear clock throttle states
19. CMG‘08 INTERNATIONAL
conference
Example: Processor state power policy
Note: This is the default policy in WS08
Intended to minimize performance hit
State Freq % Type
0 2800 100 Performance
1 2520 90 Performance
2 2142 85 Performance
3 1607 75 Performance
4 964 60 Performance
5 482 50 Performance
Maximum Processor State
Minimum Processor State
20. CMG‘08 INTERNATIONAL
conference
P-State Policy Settings
Example: Processor state power policy
Using a subset of available states
Can use any contiguous range
Some performance loss (may not be significant) unless P0
included (targets minimal perf hit)
State Freq % Type
0 2800 100 Performance
1 2520 90 Performance
2 2142 85 Performance
3 1607 75 Performance
4 964 60 Performance
5 482 50 Performance
Maximum Processor State
Minimum Processor State
21. CMG‘08 INTERNATIONAL
conference
Example: Processor state power policy
Locking processor at one state
Any available state may be selected
Some performance loss (may not be significant) unless P0 is
the state chosen (a la High Perf mode)
State Freq % Type
0 2800 100 Performance
1 2520 90 Performance
2 2142 85 Performance
3 1607 75 Performance
4 964 60 Performance
5 482 50 Performance
Min & Max Processor State
24. CMG‘08 INTERNATIONAL
conference
Linear clock throttle states (T-states)
Compared to P-states, T-states do not save
energy when performing identical workloads
However, throttle states may be useful for
some scenarios (thermal overload)
By default, WS08 uses T-states only if P-
states are unavailable or in case of thermal
overload
No DBS: only the Maximum Processor State
parameter is used
25. CMG‘08 INTERNATIONAL
conference
Default use of linear throttle states
Performance is directly affected by throttling
State Freq % Type
0 2800 100 Performance
1 2520 90 Performance
2 2380 85 Performance
3 2100 75 Performance
4 1680 60 Performance
5 1400 50 Performance
6 1400 50 Throttle
7 1120 40 Throttle
8 840 30 Throttle
9 560 20 Throttle
DBS
Allowed
No DBS
Allowed
26. CMG‘08 INTERNATIONAL
conference
Power Capping / Budgeting
Enforcing per-server power limits (static or dynamic)
Calculations based on “plate rating” are often over-configured
○ Stranded capacity
OS may not be able to respond fast enough to enforce hard limits
when power spikes
Typically lower-power P-states attempted, then T-states engaged
as necessary
○ OS might not get a good estimate of the resulting effective frequency
○ Monitoring applications and diagnostic tools may give incorrect data
○ Opposite strategy from OS, where P-states move towards higher
performance modes when load increases
Potentially huge (and potentially unexpected) hit in performance
right when it is most vital
○ Sudden hardware throttling should be last resort
27. CMG‘08 INTERNATIONAL
conference
C-States
Although hardware may support more than
3 C-states, Windows only utilizes a maximum
of 3. But that doesn’t mean Windows only
uses the first three hardware C-states:
C1 = hardware C1
C2 = hardware C?
○ Lowest-power consuming c-state with _CST of type 2
C3 = hardware Cn
Wouldn’t expect P-state to affect C-state
power, but it does on some processors
WS08R2 handles this by providing the capability to
drop to Pn before transitioning to C-state
28. CMG‘08 INTERNATIONAL
conference
Processor Power Management -
1
CPUs have increasing number and
ranges of P-states and C-states
Ballpark expectations per socket:
A few watts per P-state
Tens of watts for lowest C-state(s)
Varying impact to server throughput and
responsiveness
Mature, reliable technology
Significant deployments in mobile and
desktops
29. CMG‘08 INTERNATIONAL
conference
Processor Power Management -
2
No user intervention required
Managed by the operating system
Balances power savings with CPU
utilization
Kernel selects target P-state based on
processor utilization history, Windows power
policies, thread scheduler, system heuristics,
node/socket/HW thread hierarchy
Transition processor to “sleep” C-states when
idle (i.e., no thread to run on that processor)
30. CMG‘08 INTERNATIONAL
conference
Processor Power Management -
3
Windows’ power policy includes various
parameters that influence how the
kernel chooses target power states
Low voltage/power processors must be
evaluated and targeted for the right
scenarios
Reduces OS power management flexibility
Additional servers are required if the
workload is CPU-bottlenecked
31. CMG‘08 INTERNATIONAL
conference
Hardware Support
The correctness of all PPM tools and settings
relies on accurate hardware / firmware support
Broken BIOSes found in some previous-generation
servers
Reporting
○ Initialization of ACPI tables (e.g., power states, memory and I/O
controller locations)
○ P-state and C-state monitoring
Controlling
○ PPM algorithm depends on correct historical information
○ HW should comply/cooperate with OS power state requests
32. CMG‘08 INTERNATIONAL
conference
Processor Power Management
Working together with OEMs/IHVs -
1
Hardware must support PPM capabilities
ACPI namespace must describe capabilities and contain
processor objects
On a processor there may be multiple independently-
managed power planes, potentially shared between
components, such as:
Cores, Caches, Memory Controllers, and Bus/Serial interface(s)
to other processors or IO components
The performance impacts of turning off various pieces of silicon
must be carefully weighed and understood
○ Snooping caches must be flushed before being shut down
○ Memory or IO channels attached to a package must still be
accessible by other packages
○ Bus/Serial interfaces must be running for active caches, memory,
or IO
○ Different components have different power-up delays from the
various power states they support
33. CMG‘08 INTERNATIONAL
conference
Collaborative Power
Budgeting
Ideal WS08R2 strategy
Platform guarantees operation within the
allocated budget (HW Fail-safe)
OS scales power/perf according to workload
and respects platform notifications
New R2 Beta option: OS specifies target
utilization and HW selects P-states accordingly
Otherwise, if the OS and HW are fighting for
power management control, both power and
performance will suffer
Hardware-directed power control settings are on by
default in some BIOSes
34. CMG‘08 INTERNATIONAL
conference
Servers Defaulting to Hardware-
Controlled Power Mgmt
Hardware-directed power control settings are on by
default in some BIOSes
Platform alters P-states, C-states, T-states, and/or D-
states without OS information
○ One alternative is to have platform dynamically restrict the
available states and update the OS via ACPI (<= 2 Hz)
May take over processor performance counters!
○ Obviously this is a big concern when using performance
monitoring tools that utilize the on-CPU counters
36. CMG‘08 INTERNATIONAL
conference
Component Power Metering
• Only a small set of server models provide the
functionality of component power reporting
• Extra HW instrumentation (or fragile probing) is
needed to monitor the component power usages
for most platforms
• Simplest alternative is to populate and then
take away any removable components and track
the overall system power delta
39. CMG‘08 INTERNATIONAL
conference
Example Component Power Distribution
#3
CPU (2)
46%
PCI Cards
(3)
17%
SCSI HDD (4)
12%
Mobo,
8GB RAM
18%
Other
7%
Processor power management represents the best opportunity today
Source: Intel Server Products Power Budget Analysis Tool
http://www.intel.com/support/motherboards/server/sb/cs-016976.htm
40. CMG‘08 INTERNATIONAL
conference
Selecting Memory Components
Lots of permutations for a given capacity
Family (e.g., DDR#)
○ FB DIMMs draw more power
DIMM count
○ Especially for FB, where bus may decrease frequency if
enough DIMMs
Bus frequencies
Ranks
Density
Data width
Channel count
Low power memory must be evaluated and targeted
for the right scenarios
Additional servers are required if the workload is memory-
bottlenecked
41. CMG‘08 INTERNATIONAL
conference
Memory Power Savings
Select the right type and number of DIMMs for
the workload
Reduce memory accesses
Overall
○ Smaller working set
○ Better cache hit ratios
○ Probably better performance, too
More memory power states
Compare server memory idle characteristics to
mobile memory
Deeper self-refresh states
○ Takes memory longer to come out of deeper states
42. CMG‘08 INTERNATIONAL
conference
“Green Memory”
Tech Marking Datarate Capacity Density DQ Ranks
Power/
DIMM
DDR2 PC2-5300 667Mhz 1GB 256Mb x4 DR 18.1W
DDR2 PC2-5300 667Mhz 1GB 256Mb x8 QR 18.6W
DDR2 PC2-5300 667Mhz 1GB 512Mb x4 SR 7.6W
DDR2 PC2-5300 667Mhz 1GB 512Mb x8 DR 7.8W
DDR2 ECC 667Mhz 1GB 1Gb x16 DR 6.1W
DDR2 No ECC 667Mhz 1GB 1Gb x16 DR 5.5W
No "by 16" part with 4Gb density
DDR2 PC2-5300 667Mhz 4GB 1Gb x4 DR 14.0W
DDR2 PC2-5300 667Mhz 4GB 1Gb x8 QR 14.4W
DDR2 PC2-5300 667Mhz 4GB 2Gb x4 SR 8.6W
DDR2 PC2-5300 667Mhz 4GB 2Gb x8 DR 8.8W
43. CMG‘08 INTERNATIONAL
conference
Networking Power
NIC idle power (examples)
100 Mb 1 W
1Gb 5 W
Quad 1Gb 5-9 W
10Gb 10-15 W
Quad 10Gb 17 W
Don’t forget network switch power
Windows Networking Optimizations
NDIS DPC timer period
Wake-on-LAN (see content in WinHEC 2008)
Low Power on Network Disconnect
44. CMG‘08 INTERNATIONAL
conference
Hard Disk Power
Decreasing radius
Cubic power relationship (Power~Radius3)
3.5” 15K RPM drive = ~12/18 W (idle/active)
2.5” 15K RPM drive = ~6/9 W (idle/active)
Decreasing rotational speed
Quintic power relationship (Power~RPM5)
15K RPM = 2 ms avg rotational delay (serial workload)
10K RPM = 3 ms avg (~3-4 W idle)
7.2K RPM = 4 ms avg (may have slower seek as well)
Frequently spinning down enterprise drives not
advisable (yet)
45. CMG‘08 INTERNATIONAL
conference
Storage Controller Power
HBA / storage connection interface
E.g., PCI-X and PCI-e cards 5-8W idle
Array Controller
E.g., small SAN ctlr (2U) = 200/300 W
(idle/active in direct attached mode)
Disk Interface
SCSI: 80 160 320 GB/s
FC: 1 2 4 8 Gb/s
SAS/SATA: 1.5 3.0 6.0Gb/s
46. CMG‘08 INTERNATIONAL
conference
PCI-Express Power
Management
Support for Active State Power
Management (ASPM)
a k a, Link State Power Management
In-box power policy for ASPM state
Requires OS control of PCI Express
features
Available white paper
47. CMG‘08 INTERNATIONAL
conference
Power Supply Efficiency
Power Factor: phase delta between input
voltage and current
Active Power Factor Correction (PFC)
Ratio of input:output power (ACDC)
Entropy means 100% efficiency is unobtainable
Default supplies at 70%; new models up to 85%
Previous power supplies were often
optimized for high workload levels, but
most servers run at 5-20% of capacity (for
now)
Decreases power without decreasing perf
48. CMG‘08 INTERNATIONAL
conference
Power Supply Efficiency
“80 Plus”
Requirement for Energy Star (July ‘08)
80% minimum efficiency at 20%, 50%,
and 100% of rated output
Previous power supplies often optimized for
high loads, but most servers run at 5-20%
Minimum power factor of 0.9 or greater at
100% of rated output
Decrease power without decreasing perf
49. CMG‘08 INTERNATIONAL
conference
Power Supply Waste Power
Efficiency
Output
Power
Required
Input Power
Waste
Power
Waste Power
Cost per
Annum
70 (default) 500 W 714 W 214W $183.15
80 (near 80 plus Bronze) 500 W 625W 125W $106.98
85 (80 plus silver) 500 W 588 88W $75.31
90(above 80 plus gold) 500 W 555 55W $47.07
50. CMG‘08 INTERNATIONAL
conference
Fan Power
Fans in some 1U servers consume 15-
20% of overall system power
Fixed vs. variable-speed fans
Decrease power without decreasing perf
52. CMG‘08 INTERNATIONAL
conference
Windows Server 2003
ACPI 2.0 compliant
Windows processor driver required for
specific CPU make/model
Requires selecting appropriate power policy
Each system power policy includes a
processor throttling policy
Highest (default), lowest, or full range of P-states
OEMs or server administrators may create
additional power plans
53. CMG‘08 INTERNATIONAL
conference
Windows Server 2008 - 1
ACPI 2.0 and 3.0 compliant
Native OS support for PPM on
multiprocessor systems
Default power settings refined for each
release (including WS08R2)
Windows Server 2008 & SP2
Simplified configuration model
Group Policy over power settings
Power management enabled by default
(“Balanced Mode”)
55. CMG‘08 INTERNATIONAL
conference
Windows Server 2008 - 2
T-states used only when no P-states
available
Power management parameterization for
improved flexibility of P- and T-state
algorithms
Additional tunings available for OEMs to
customize to processor, chipset, platform, role,
etc.
Improved C3 support
Very hard to generalize, but 2-10%
improvement in power efficiency observed
at mid-to-low utilization levels (vs. 2003)
56. CMG‘08 INTERNATIONAL
conference
Processor Power
Management
Windows Server Releases Fully supported by WS03, WS08, and
WS08R2
Feature parity with Windows client
operating systems
For example, WS08 has full support for:
○ ACPI 2.0, 3.0 processor objects, Notify()
events
○ Power policy for tuning Operating System
(OS) target state algorithms
○ Deep idle C-states
62. CMG‘08 INTERNATIONAL
conference
Measuring Power
Few existing Windows servers are equipped
with comprehensive power metering
capabilities
In the future, servers are likely to have onboard
power meters
○ AC power (into the power supply)
○ DC power (out of the power supply)
○ For individual components (CPU, RAM, IO, fans, disks,
…)
The Windows Server Performance team has
resorted to two strategies:
Metering at the wall (AC)
Directly probing specially manufactured server
motherboards (solder and data acquisition)
63. CMG‘08 INTERNATIONAL
conference
Measuring Power Efficiency
Which Watts/Amps to measure?
Total server (wall)
power
External power
Network switches/hubs
Storage (disks, array
controllers, SANs)
Power distribution and
conditioning
HVAC
Internal component power
Processor package
○ Threads, cores, caches, memory
controllers, cross-package
interconnect controllers, IO
controllers (e.g., PCI-E)
Memory (controllers, DIMMs, ranks,
banks)
Chipsets (north bridge, south bridge,
IO controllers)
Power supplies
○ AC in, multiple DC out
○ Redundant (active/active,
active/passive)
IO (network, storage, video, USB)
○ Embedded components and
expansion cards
Fans and other internal misc.
64. CMG‘08 INTERNATIONAL
conference
Measuring Power Efficiency
Traditional performance benchmarks optimize for
high throughput or low response time by using all
resources
The load line approach tracks power use as load
varies
Pick a power point and see how much load can be
handled
Pick a load point and see how much power is required
Workload breadth
Database, web server, file server, etc.
MS uses SPECpower (a la SPECjbb) and is adding
customer-accepted performance benchmarks
TPC-C/E/H, SpecWeb, NetBench, SAP, SPEC, …
Semi-internal: FSCT, LCW2, Web Fundamentals,
TermSrv, PerfGates, …
65. CMG‘08 INTERNATIONAL
conference
Measuring Power Efficiency
Which workloads to test?
Workload breadth
Database, web server, file server, etc.
○ Need to prioritize based on potential for power savings and for
broadest customer coverage
Each has unique “work accomplished” metrics (e.g., ops per
second)
Industry standard workloads, such as SPEC and TPC
Custom workloads designed to test power scenarios
Microsoft is currently using SPECpower and customer-
accepted performance benchmarks to convey power
efficiency
TPC-C/E/H, SpecWeb, NetBench, SAP, SPEC, …
Semi-internal: FSCT, LCW2, Web Fundamentals, TermSrv,
PerfGates, …
66. CMG‘08 INTERNATIONAL
conference
Industry Standard
Workloads SPEC
SPECpower is the only standardized benchmark at this point
○ Single workload defined to date
Order processing for a wholesale supplier running typical Java business applications
Basically SPECjbb with some changes
Minimal I/O and kernel time
○ Other SPEC benchmarks could have a “power” version, and each one may
or may not be modified from the “perf” version
TPC
Could add a power metric to each of their existing benchmarks, but
details are still being worked out
○ What is server power vs. storage power?
○ What needs to be installed in the audited server?
I suspect they will stick to the same approach used for pricing, in that the system has to
be available as a purchasable product
What about the “price” of power?
○ Etc.
67. CMG‘08 INTERNATIONAL
conference
Measuring Power Efficiency
Windows Server Performance Lab
Methodology for obtaining power load line data
for TPC-C, TPC-E, FSCT, and Web
Fundamentals have been demonstrated
Benchmark loads varied by throttling number of active
users
Multiple workloads tested in Hyper-V environment
SPECpower has been successfully tuned
Data has been gathered on 2-, 4-, and 8-socket
systems with various processors
Wall-socket power measurements
Component power measurement by brute force (device
extraction)
68. CMG‘08 INTERNATIONAL
conference
Varying Load Levels
68
Iteration SPECpower
(Reduce load)
TPC-E
(Reduce users)
FSCT
(Increase users)
1 100% load 100% of max users 0 users
2 90% load ~90% of max users 10% of max users
3 80% load ~80% of max users 20% of max users
4 70% load ~70% of max users 30% of max users
5 60% load ~60% of max users 40% of max users
6 50% load ~50% of max users 50% of max users
7 40% load ~40% of max users 60% of max users
8 30% load ~30% of max users 70% of max users
9 20% load ~20% of max users 80% of max users
10 10% load ~10% of max users 90% of max users
11 0% load 0 users 100% of max users
Similar strategy used for Web Fundamentals
70. CMG‘08 INTERNATIONAL
conference
HW and SW Test
Configurations
Sample platforms
2-socket and 4-socket quad-core
8-socket dual-core
x64 (AMD and Intel); ia64
Hardware- and software-controlled power
management modes
WS03, WS08, WS08SP2 (prerelease), and WS08R2
(prerelease)
Windows power schemes
Balanced, Higher Performance, Power Saver, …
P-State settings and heuristics
C-State settings and heuristics
Parameterized power management optimizations
○ E.G., core parking, tick skipping
71. CMG‘08 INTERNATIONAL
conference
SPECpower: WS03 and
WS08
60%
70%
80%
90%
100%
0% 20% 40% 60% 80% 100%
Power(%ofMaxWatts)
Workload (% of Max ssj_opts)
W2K3.SP1 W2K8.RTM W2K8.SP2
2 sockets,
8 cores total
72. CMG‘08 INTERNATIONAL
conference
SPECpower & FSCT: WS03 and
WS08
SPECpower throughput and power
at different workload levels
on a 4-socket quad-core system
60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80% 100%
Power(%ofmaximumwatts)
Workload (% of maximum throughput)
Windows Server 2003 Windows Server 2008
60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80% 100%
Power(%ofMaximumwatts)
Workload (% of maximum throughput)
Windows Server 2003 Windows Server 2008
FSCT throughput and power
at different workload levels
on a 2-socket dual-core system
73. CMG‘08 INTERNATIONAL
conference
TPC-E: WS03 and WS08
TPC-E power usage at
varying workload levels
TPC-E power efficiency (tpsE/Watt) at
varying workload levels
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80%
Watts(%ofmaximum)
Workload (% of maximum tpsE)
Windows Server 2003 Windows Server 2008
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80%
tpsE/Watt(%ofmaximum)
Workload (% of maximum tpsE)
Windows Server 2003 Windows Server 2008
74. CMG‘08 INTERNATIONAL
conference
OOB Windows Server 2008
0%
20%
40%
60%
80%
100%
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
ssj_opsperWatt(%ofMaximum)
Power(%ofMaximum)
Workload (% of max ssj_ops)
power ssj_ops per Watt
SPECpower throughput (ssj_ops) and
power at varying workload levels
Processor utilization and frequency as
SPECpower workload decreases over time
70%
75%
80%
85%
90%
95%
100%
0%
20%
40%
60%
80%
100%
0 10 20 30 40 50 60 70
ProcessorFrequency(%ofMaximum)
AverageProcessorUtilization
Time (minutes) with decreasing workload
Processor Utilization Processor Frequency
Time (min)
75. CMG‘08 INTERNATIONAL
conference
TPC-E: Windows Server
2008
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
30 40 50 60 70 80 90 100 110 120 130 140
CumulativeP-StateDistribution
Time (minutes)
Distribution of P-States as workload
decreases over time
P0
P1
P2
P3
P4
C1
76. CMG‘08 INTERNATIONAL
conference
4 quad-core CPUs, 16 GB, RAID-5 array
Measured Projected
Server
Config
Active
Clients
Avg Watts kWh / yr Cost Kg of CO2
WS03, IIS6 0 468 4100 375 3190
WS08, IIS7 0 457 4000 357 3110
WS03, IIS6 20 537 4700 430 3660
WS08, IIS7 20 500 4380 401 3410
77. CMG‘08 INTERNATIONAL
conference
Outline
Motivation
Background
Windows Server 2003 2008
Windows Server 2008 R2
Server Energy Vision
Idle Power Optimizations
Core Parking
Hyper-V (v2)
Power Metering and Budgeting
SSD
Power Diagnostics and Control
Summary
78. CMG‘08 INTERNATIONAL
conference
Windows Server Core Energy
Vision
Dynamic Data Center
Coordination across all data center
components to scale infrastructure and
computing according to business needs
Scalable Node: Server power efficiency
Low idle power consumption
Power consumption should scale with load
79. CMG‘08 INTERNATIONAL
conference
Dynamic Data Center
Holistic approach spanning all
infrastructure not just the computing nodes
Reducing waste and optimizing
performance
Scaling and migrating workloads
Coordination with power and cooling systems
Watch out for over-eager workload consolidation
or low-power component acquisition
Building platform and management
infrastructure
80. CMG‘08 INTERNATIONAL
conference
Dynamic Data Center – The
Problem
Addressing energy consumption in the data center
requires a holistic approach spanning all infrastructure not
just the computing nodes
Many factors affect how a data center consumes energy
Hardware, workload, time of day/week/year, locality, etc.
Data centers are generally statically configured for peak load
Tremendous opportunities for reducing waste and
optimizing performance exist
Scaling and migrating workloads across groups of machines
Coordination with power and cooling systems
Opportunities also exist for unexpected reduction in computing
capacity through over-eager workload consolidation or low-
power component acquisition without proper planning / testing
81. CMG‘08 INTERNATIONAL
conference
Dynamic Data Center – The
Vision
Enable the management of aggregate
servers in conjunction with data center
infrastructure
Deliver this through building platform
and management infrastructure
Power metering and budgeting
Virtualization and workload migration
Standards-based management technologies
Coordination between in-band and out-of-
band management systems
82. CMG‘08 INTERNATIONAL
conference
Scalable Node
Today power consumption does not scale in
line with server utilization
Typical commodity servers consume 50-70% of the
maximum power when completely idle
Basic approaches:
○ Increase server utilization via virtualization
○ Reduce power when full performance not needed
○ Power down / put to sleep excess servers
Work with partners to provide the best power
and performance by managing the system
efficiently
Windows power management improvements
83. CMG‘08 INTERNATIONAL
conference
Scalable Node – The
Problem
Today power consumption does not scale in
line with server utilization
Typical commodity servers consume 50-70% of
the maximum power when completely idle
○ Idle servers have low efficiency due to high idle power
○ Efficiency rises with utilization due to idle power
amortization
Tremendous opportunities exist for reducing
energy needs
○ Reduce power when full performance is not required
○ Leverage virtualization solutions to increase server
utilization
○ Power down servers when they are not needed
84. CMG‘08 INTERNATIONAL
conference
Scalable Node – The Vision
Work with partners to provide the best power
and performance by managing the system
efficiently
Deliver this through improvements to
Windows Power Management
Build on existing infrastructure and extend
Windows value
Enhancements to processor power management
Focus on idle and low-to-medium workload levels
Support for device performance states
85. CMG‘08 INTERNATIONAL
conference
Windows Server 2008 R2 - 1
Refined “Balanced Mode” defaults to
optimize power efficiency
Takes advantage of advances in server
platform hardware (e.g., powering down
individual cores or sockets)
Configurable power settings for new
features (e.g., core parking)
P-state and C-state selection algorithms
updated
Increased support for joint OS/HW power
management
86. CMG‘08 INTERNATIONAL
conference
Windows Server 2008 R2 - 2
Simplified configuration model
Group Policy control over all power settings
Rich command line interface and refined UI
elements
In-band WMI power metering and
budgeting support
Remote manageability of power policy via
WMI
Additional qualification logo to indicate
enhanced power management support
87. CMG‘08 INTERNATIONAL
conference
Windows Device Power
Management
Extensible power policy infrastructure
Allows easy incorporation of power
management-enabled devices
○ Device power settings integrate with Windows
system power policy
○ Device power settings can appear in
Advanced power UI
○ Rich notification support
Allows for true OEM power management
innovation and value
88. CMG‘08 INTERNATIONAL
conference
Enhanced Power
Management Logo
Additional Qualification logo for
“Enhanced Power Management” that
indicates support for the following:
Processor power management through
Windows
Power metering and budgeting
Power On/Off via WS-Management
(SMASH)
89. CMG‘08 INTERNATIONAL
conference
Windows Server 2008 P-State
Parameters
Balanced Mode Settings WS08 R2 Pre-Beta WS08R2
Time Check 100 ms 100 ms 50 ms
Increase Time 100 ms 100 ms 50 ms
Decrease Time 300 ms 100 ms 50 ms
Increase Percentage 30% 70% 80%
Decrease Percentage 50% 30% 70%
Domain Accounting
Policy
0 (On) Always Off Always Off
Increase Policy IDEAL (0) IDEAL (0) SINGLE (1)
Decrease Policy SINGLE (1) SINGLE (1) IDEAL (0)
90. CMG‘08 INTERNATIONAL
conference
Optimized for Low-to-Medium Loads
Even though 100% utilization may have the
highest power efficiency, few servers run at
full capacity
Servers at maximum utilization provide less
opportunities for power optimizations
In the short term, targeting low utilization
servers will provide most benefit
In medium term, targeting medium
utilization servers will provide increased
benefit
E.g, consolidation and virtualization will increase
average utilization levels
91. CMG‘08 INTERNATIONAL
conference
Outline
Motivation
Background
Windows Server 2003 2008
Windows Server 2008 R2
Server Energy Vision
Idle Power Optimizations
Core Parking
Hyper-V (v2)
Power Metering and Budgeting
SSD
Power Diagnostics and Control
Summary
92. CMG‘08 INTERNATIONAL
conference
Get Idle; Stay Idle
Shut down unnecessary services, applications, roles,
devices, drivers
Avoid polling and spinning in tight loops
Avoid high-res periodic timers (<10 ms)
Timer Coalescing
Intelligent Timer Tick Distribution (ITTD)
Use NUMA-based affinity for threads and interrupts
Thread (via APIs and tools): soft (IdealProc), hard (affinity
mask)
Interrupts (via IntPolicy.exe)
Idle improvements extend to Hyper-V
Significant reduction in platform interrupt activity
Enables power savings and greater scalability
93. CMG‘08 INTERNATIONAL
conference
Timer Coalescing
Platform energy efficiency can be improved by extending
idle periods
New timer coalescing API enables callers to specify a tolerance for
due time
Enables the kernel to expire multiple timers at the same time
Extensions should integrate with WS08R2 API/DDI
94. CMG‘08 INTERNATIONAL
conference
Intelligent Timer Tick Distribution
(Tick Skipping)
Extend processor sleep states by not waking
the CPU unnecessarily
CPU 0 handles the periodic system timer
tick; other processors are signaled as
necessary
Non-timer interrupts will still wake sleeping
processors
Not available on IA64
Only enabled on systems with more C-states
than just C1
95. CMG‘08 INTERNATIONAL
conference
Background Process
Management
Background activity on the macro scale (minutes, hours) is
also important for power
E.g., disk defragmentation, AV scans
Prevents low-power idle and sleep modes
Will collapsing multiple background activities result in a
significantly heavier load during that interval and thus potentially
impede concurrent foreground activity?
Unified Background Process Manager (UBPM)
New WS08R2 infrastructure
Drives scheduling of services and scheduled tasks
Transparent to users, IT pros, and existing APIs
Enables trigger-starting services
Delivers usage data and metrics to Microsoft via CEIP
96. CMG‘08 INTERNATIONAL
conference
UBPM: Trigger-Start
Services
Many services configured to Autostart and wait for rare
events
UBPM enables Trigger-Start services based on
environmental changes
Device arrival/removal, IP address change, domain join, etc.
Examples
○ Bluetooth service is started only if a Bluetooth radio is currently attached
○ BitLocker encryption service started only when new volumes detected
ISV Call to Action
Leverage trigger-start capability for value-add services
Validate performance impact with XPerf tools
○ Performance impact can be positive or negative
97. CMG‘08 INTERNATIONAL
conference
Coordinated Processor
Clocking Control
New processor performance state interface
described via ACPI
Feature enables OS and HW platform
coordination of processor power management
Platform is in direct control of T-states and P-states
OS dynamically specifies processor performance
requirements on per-processor basis as a
percentage of maximum frequency
Platform is responsible for delivering requested
performance
○ In some cases, like a power budget condition, the
platform may underdeliver, but must report this
98. CMG‘08 INTERNATIONAL
conference
Outline
Motivation
Background
Windows Server 2003 2008
Windows Server 2008 R2
Server Energy Vision
Idle Power Optimizations
Core Parking
Hyper-V (v2)
Power Metering and Budgeting
SSD
Power Diagnostics and Control
Summary
99. CMG‘08 INTERNATIONAL
conference
Processor Core Parking
This is a Windows scheduler optimization, not HW!
Goals
Save power on multi-processor systems by dynamically
scaling number of active cores to match workload
Drop parked cores into deepest C-states
Approach
Use historical information to predict future workload
Calculate number of cores needed
Heuristically select the “unparked” cores
Monitoring
Perfmon and ETW
100. CMG‘08 INTERNATIONAL
conference
Processor (Logical) Core
Parking
Logical core = HW thread (e.g., Intel®
Hyperthreading)
Extension of Windows’ processor
performance state engine
Configurable via power policy settings
Parking may reduce performance,
depending on the parameter settings, by
reducing OS responsiveness to rising load
levels
Parking could improve performance by
concentrating work onto a smaller number
of cores
101. CMG‘08 INTERNATIONAL
conference
Selecting Cores to Park - 1
WS08R2 (Beta) approach:
Leave one logical core unparked per NUMA node
Other possible approaches, including
customizable minimum unparked entities
Park entire packages at once
Park logical cores individually, regardless of
packages
Leave one logical core unparked per socket
Leave one logical core unparked per physical core
Affinitized activity does tend to unpark logical
cores that must be used (selection heuristic)
Beta tracks affinitized threads, not DPCs / Interrupts
102. CMG‘08 INTERNATIONAL
conference
Selecting Cores to Park - 2
Parking algorithm takes many inputs. At a minimum:
Time since the last parking decision was made
Average frequencies of each core over the last time interval
Average CPU “utilization” over the last time interval
Possible additional inputs depending on parameter
setting and final WS08R2 refinements:
○ Power state domains (i.e., groups of associated cores)
○ Current processor P-States
○ P-State change rate policies (SINGLE, ROCKET, IDEAL)
○ Affinitized DPCs / Interrupts
○ Time spent in affinitized activity
○ More comprehensive or longer historical information
○ More system component topology information
103. CMG‘08 INTERNATIONAL
conference
Outline
Motivation
Background
Windows Server 2003 2008
Windows Server 2008 R2
Server Energy Vision
Idle Power Optimizations
Core Parking
Hyper-V (v2)
Power Metering and Budgeting
SSD
Power Diagnostics and Control
Summary
104. CMG‘08 INTERNATIONAL
conference
Hyper-V Power
Management
Full P-state/C-state management already
integrated between Windows root partition
and Hyper-V v1 (WS08)
Enlightenments added in Hyper-V v2
(WS08R2)
Hypervisor delivers child clocks without requiring
root interaction, plus Intelligent Timer Tick
Distribution (to children)
Core parking enabled for all partitions
105. CMG‘08 INTERNATIONAL
conference
Web Fundamentals Dynamic: WS08
0
5000
10000
15000
20000
25000
250
270
290
310
330
350
370
0% 50% 100%
Throughput(Reqs/Sec)
Watts
System Utilization Percentage
Watts Throughput
For these experiments, the highest system utilization under WF
workload is ~80%. This issue has been subsequently resolved.
0
5000
10000
15000
20000
25000
250
270
290
310
330
350
370
0% 50% 100%
Throughput(Reqs/sec)
Watts System Utilization Percentage
Watts Throughput
Adding load to each guest Adding guests
107. CMG‘08 INTERNATIONAL
conference
SPECpower + WF (WS08)
SPECpower throughput and server power
usage versus total system utilization
Power usage for various
throughput levels
0
20
40
60
80
100
120
140
250
270
290
310
330
350
370
390
0% 20% 40% 60% 80% 100%
Throughput(inThousands)
Watts
System Utilization Percentage
Watts Throughput
60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80% 100%Power(%ofWatts)
Workload (% of maximum throughput)
•4 Guests running WF (4940 requests/sec)
•~25% system utilization; ~35% guest virtual processor utilization
•4 Guests running SPECpower (similar efficiency as single workload)
108. CMG‘08 INTERNATIONAL
conference
Outline
Motivation
Background
Windows Server 2003 2008
Windows Server 2008 R2
Server Energy Vision
Idle Power Optimizations
Core Parking
Hyper-V (v2)
Power Metering and Budgeting
SSD
Power Diagnostics and Control
Summary
109. CMG‘08 INTERNATIONAL
conference
Power Metering and
Budgeting
In the future, servers are likely to have
onboard power meters
AC power (into the power supply)
DC power (out of the power supply)
For components (CPU, RAM, IO, fans, disks, …)
WS08R2 provides the capability to monitor
such meters, as well as communicate with
power management logic, through
standard Windows and ACPI interfaces
Power budget information is reported to OS
Optional support for configuring the budget from
within Windows
110. CMG‘08 INTERNATIONAL
conference
Power Metering and
Budgeting
System
Center
.
.
.
WMI Consumers
WMI Namespace
rootcimv2power
Power Supply
class
Power Meter
class
Power Meter
Events
User-mode
Power Service
Power
WMI
providers
Standard Windows IOCTL
interface
In-box ACPI-based
implementation
Vendors provide
ACPI code in
firmware
Other vendor
specific
implementations…
Implemented
in WS08R2
BMC hardware
Admin scripts
Hardware
Management
tools
111. CMG‘08 INTERNATIONAL
conference
Power Metering and
Budgeting WS08R2 introduces the ability to report power consumption
and budgeting information
Server platform reports this in-band to the OS via ACPI
No additional drivers are required or HW changes, only platform
support
Power information is exposed via WMI
Adheres to the DMTF Power Supply Profile v1.01
Power budget information is reported to the OS
Optional support for configuring the budget from within Windows
Extendable to enable per-device metering
WDM driver interface available
Design goals
Standard hardware and software interfaces
Native infrastructure, easily extendable
Leverages existing platform technology
112. CMG‘08 INTERNATIONAL
conference
Power Metering and Budgeting – Usage
Statistical/inventory/auditing
Data center can monitor power
consumption across nodes
Administrator can write scripts to control
power policies and receive power condition
events
Model can be extended to per-device
meters
Another set of metrics for virtualization and
consolidation
113. CMG‘08 INTERNATIONAL
conference
Power Metering and Budgeting – WDM
Standard Windows driver IOCTL interface
Event model based on pending IO requests
(IRPs)
Two separate device interfaces
Consumed by the WMI providers
An alternative to the ACPI implementation
Future direction – potentially consumed by
the kernel power manager
Documented on MSDN
114. CMG‘08 INTERNATIONAL
conference
Power Metering and Budgeting –
ACPI
Rationale
Works as the abstraction layer to the underlying
platform technology (IPMI, WSMAN, etc.)
Scales across different platforms
Does not require special drivers
Requires only firmware updates
Currently being proposed to the ACPI 4.0
specification
Delegate tasks to the BMC (e.g., rolling
average calculation, polling for events, etc.)
115. CMG‘08 INTERNATIONAL
conference
Power Metering and Budgeting –
ACPI
Power supply device
Extends the current power source device
Control method to publish capabilities
Power meter device
Similar to control method for batteries
A set of control methods to get capabilities
and set configuration parameters, trip points,
and configure hardware enforced limits
Event notification via Notify codes
116. CMG‘08 INTERNATIONAL
conference
Power Metering and Budgeting –
ACPI
WS08R2 will provide
In-box driver to support power meter device(s)
described in ACPI
In-box IPMI operation region handler as part of
the Microsoft IPMI driver – allowing ACPI control
methods to communicate with IPMI using the
KCS protocol
○ Format similar to the SMBUS OpRegion
○ 3rd-party IPMI drivers can register OpRegion
handler
for other IPMI protocol(s)
○ Also proposed to ACPI 4.0 specification
117. CMG‘08 INTERNATIONAL
conference
Outline
Motivation
Background
Windows Server 2003 2008
Windows Server 2008 R2
Server Energy Vision
Idle Power Optimizations
Core Parking
Hyper-V (v2)
Power Metering and Budgeting
SSD
Power Diagnostics and Control
Summary
118. CMG‘08 INTERNATIONAL
conference
WS08R2 Enables Improved
Endurance for SSD
Technology SSD can identify itself differently from HDD in
ATA as defined through ATA8-ACS Identify
Word 217: Nominal media rotation rate
Reporting non-rotating media will allow WS08R2
to set Defrag off as default; improving device
endurance by reducing writes
119. CMG‘08 INTERNATIONAL
conference
WS08R2 Enables Optimization
for SSD Technology
Microsoft implementation of “Trim” feature
NTFS will send down delete notification to the device
supporting “trim”
○ File system operations: Format, Delete, Truncate,
Compression
○ OS internal processes: e.g., Snapshot, Volume
Manager
Three optimization opportunities for the device
Enhancing device wear leveling by eliminating
merge operation for all deleted data blocks
Making early garbage collection possible for fast
write
Keeping device’s unused storage area as high as
possible; more room for device wear leveling.
123. CMG‘08 INTERNATIONAL
conference
Monitoring Power Status - 1
System Event Log: ID 4
Perfmon/Logman
Processor
○ Provide average C-state information
% C1/2/3 Time and C1/2/3 Transactions/sec
Processor Information
○ Parking status
Processor Performance
○ Only present if P-states are exposed
○ Provide current P-state information (e.g., avg freq)
Resource Monitor
CPU % Max Frequency average and graph
127. CMG‘08 INTERNATIONAL
conference
Monitoring Power Status - 2
ETW tracing (Windows Perf Tool Kit)
Xperf –on power
Pwrtest.exe
Logs use of P-, T-, and C-states
Pwrtest /ppm
○ Sampling P-state and C-state performance
Pwrtest /ppm /live
○ Event driven logging for all the P-state and C-
state transactions
128. CMG‘08 INTERNATIONAL
conference
Pwrtest.exe /info:ppm
C:Program FilesMicrosoft PwrTest>
pwrtest /info:ppm
PROCESSOR_POWER_INFORMATION
CPU Number = 0
MaxMhz = xxxx
CurrentMhz = yyyy
MhzLimit = zzzz
MaxIdleState = M
CurrentIdleState = N
InstanceName: CPU Model X
(continued)
Processor Performance States
PerfStates:
Max Transition Latency: xx us
Number of States: yy
State Speed (Mhz) Type
0 aaaa (100%) Perf
1 bbbb ( ss%) Perf
2 cccc ( tt%) Perf
3 dddd ( uu%) Throttle
4 eeee ( vv%) Throttle
5 ffff ( ww%) Throttle
132. CMG‘08 INTERNATIONAL
conference
Power Controls:
Powercfg.exe
Configure power settings within a
specific power scheme (WS03+)
WS08R2: Detect common energy
efficiency problems (via /ENERGY flag)
USB device selective suspend
Processor Power Management (PPM)
Inefficient power policy settings
Platform timer resolution
Platform firmware problems
…and more
133. CMG‘08 INTERNATIONAL
conference
Configure power setting within a specific power
scheme
Set AC, DC values for individual settings
Every power setting belongs to a Subgroup
-setdcvalueindex used for battery scenario
C:> powercfg.exe –setacvalueindex
<SCHEME> <SUBGROUP> <SETTING> <VALUE>
C:> powercfg.exe –setacvalueindex
SCHEME_BALANCED SUB_SLEEP STANDBYIDLE 0
134. CMG‘08 INTERNATIONAL
conference
Power Efficiency
Diagnostics
“Powercfg /ENERGY” to start tracing
Close open applications and documents first
Inbox with WS08R2 only
Leverages new inbox ETW instrumentation
Advanced users can run utility and view HTML output
Automatically executed when the system is idle [Win7]
Reports data to Microsoft via Customer Experience Improvement
Program (CEIP)
Attend
for demo and details
137. CMG‘08 INTERNATIONAL
conference
Lab Issues: Processor
Utilization is Based on Non-Idle
Wall Time Idle == idle loop or HALT
It doesn’t take frequency into account, so 100% CPU
utilization could be at P0 or at Pn
There may actually be more performance on the table
Idle time will include the time taken to return from C-
states (HALT), which could be microseconds
CPU utilization will include cache warm-up effects if
the cache has been flushed to reach the deepest C-
states
CPU utilization will include latencies caused by
remote memory being in low-power states
In particular, AMD and future Intel processors where
memory is socket-attached
138. CMG‘08 INTERNATIONAL
conference
Lab Issues: OS vs. HW C-
States
Only three C-states selected by the OS:
C1: C1 in HW
C2: lowest power “type 2” C-state reported
by HW
C3: Cn in HW
Perfmon shows OS perspective of C-
states
140. CMG‘08 INTERNATIONAL
conference
Summary
Windows Server 2008 and 2008 R2 deliver
real energy savings for the data center
New WS08R2 features deliver enhanced
power efficiency and better manageability
Improvements to idle and low-to-medium
workload operating efficiency
Management of power policy via WMI
Power metering support provides energy
consumption information through Windows
141. CMG‘08 INTERNATIONAL
conference
Future Work Example:
NonVolatile Memory (NVM)
Solid State Disk (current server usage)
Potential additional layer(s) in memory hierarchy
Cache (a la ReadyBoost)
DRAM complement
Very low power when idle
But low-power DRAM may narrow the gap significantly
Poor performance of random writes
Could be improved by coalescing and remapping writes
Block orientation
Difficult to use as DRAM complement
Limited lifetime of Flash cells
Future NVM technologies may improve on this
142. CMG‘08 INTERNATIONAL
conference
Call to Action - 1
Make sure any reduction in server capabilities
is a planned-for and acceptable tradeoff
between power and performance (e.g.)
TANSTAAFL, Do More With Less
Reduce idle activity and power consumption
Validate new platform power management using
Power Efficiency Diagnostics
ISV/IHV Call to Action for Power: eliminate
activity during workload idle periods in
applications and drivers
Target average idle period at minimum >100ms
Provide software with adjustable tradeoffs between
power and performance when appropriate
143. CMG‘08 INTERNATIONAL
conference
Call To Action - 2
Build power efficient platforms and
solutions
Expose complete processor (and memory and
device) information from BIOS
Ensure drivers and applications work with core
parking enabled
Speak with Microsoft about creating ACPI-based
power meter and supply devices
Get the Enhanced Power Management logo
Review microsoft.com power whitepapers
and presentations
145. CMG‘08 INTERNATIONAL
conference
Additional Resources WDK available with pre-Beta
Web Resources:
White papers and presentations at www.microsoft.com (search on “power”)
○ http://www.microsoft.com/whdc (search on “power”)
Windows Hardware Developer Central – Power Management:
…/whdc/system/pnppwr/
Processor Power Management in Windows Vista and Windows Server 2008:
…/whdc/system/pnppwr/powermgmt/ProcPowerMgmt.mspx
ACPI / Power Management: …/whdc/system/pnppwr/powermgmt/default.mspx
Recommendations for Power Budgeting with Windows Server:
…/whdc/system/pnppwr/powermgmt/Svr_PowerBudget.mspx
Active State Power Management in Windows Vista: …/whdc/connect/pci/aspm.mspx
○ Windows Server 2008 Power Savings
http://download.microsoft.com/download/4/5/9/459033a1-6ee2-45b3-ae76-
a2dd1da3e81b/Windows_Server_2008_Power_Savings.docx
○ Designing Efficient Background Processes for Windows (Trigger-Start Services):
http://go.microsoft.com/fwlink/?LinkId=128622
ACPI Specifications: http://www.acpi.info
80 Plus Program for power supplies: http://www.80plus.org
Energy Star Power Supply Specification Draft:
http://www.energystar.gov/ia/partners/prod_development/new_specs/downloads/Draft1_Server
_Spec.pdf
E-mail: Server Power Feedback alias srvpwrfb@microsoft.com
146. CMG‘08 INTERNATIONAL
conference
Sources
Estimating Total Power Consumption by Servers in the U.S. and the
World – Jonathan G. Koomey, Ph.D.
http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
Bureau of Labor Statistics
http://data.bls.gov/cgi-bin/cpicalc.pl
US Energy Information Administration
http://www.eia.doe.gov/fuelelectric.html
AFCOM Data Center Institute’s Five Bold Predictions, 2006
http://www.afcom.com/News_Releases/Afcom_In_The_News_05010601.asp
Intel Server Products Power Budget Analysis Tool
http://www.intel.com/support/motherboards/server/sb/cs-016976.htm
Data center TCO benefits of reduced air flow -- Malone, Vinson, and
Bash
Various Gartner press releases
Aperture Research Institute
EYP Mission Critical Facilities Inc.
Power In, Dollars Out: How to Stem the Flow in the Data Center
http://www.microsoft.com/whdc/system/pnppwr/powermgmt/Svr_Pwr_ITAdmin.mspx
150. CMG‘08 INTERNATIONAL
conference
2004 Energy Consumption = ~ 100 quads
2004 Energy Expenditures = ~ $910 billion
0
4000
8000
12000
16000
20000
24000
28000
32000
36000
1940 1950 1960 1970 1980 1990 2000 2010
In d u strial = red
T ran sp o rtatio n = p u rp le
Resid en tial = g reen
Co mmercial = blu e
U .S. Energy C onsumption
1949 - 2004
A ll Fuels (TB TU )
Growing Energy Demand
156. CMG‘08 INTERNATIONAL
conference
Power Metering
In the future, servers are likely to have
onboard power meters
AC power (into the power supply)
DC power (out of the power supply)
For individual components (CPU, RAM, IO, fans,
disks, …)
WS08R2 provides the capability to monitor
such meters, as well as communicate with
power management logic, through
standard Windows and ACPI interfaces
157. CMG‘08 INTERNATIONAL
conference
Power Metering and
Budgeting WS08R2 introduces the ability to report power consumption
and budgeting information
Server platform reports this in-band to the OS via ACPI
No additional drivers are required or HW changes, only platform
support
Power information is exposed via WMI
Adheres to the DMTF Power Supply Profile v1.01
Power budget information is reported to the OS
Optional support for configuring the budget from within Windows
Extendable to enable per-device metering
WDM driver interface available
Design goals
Standard hardware and software interfaces
Native infrastructure, easily extendable
Leverages existing platform technology
159. CMG‘08 INTERNATIONAL
conference
Based on the DMTF management profiles
New power namespace – rootcimv2power
1) Power supply device
Inventory information
Capabilities/characteristics
Redundancy information
160. CMG‘08 INTERNATIONAL
conference
2) Power meter device
Inventory information
Capabilities/characteristics
Latest meter measurements
OS-Configurable trip-points
Configurable platform enforced limit
3) Power supply/meter events
Notification for changes in configuration and capabilities
Notification for trip-points crossed and platform limit enforced
161. CMG‘08 INTERNATIONAL
conference
Statistical/inventory/auditing
Data center can monitor power consumption
across nodes
Administrator can write scripts to control power
policies and receive power condition events
Model can be extended to per-device meters
Another set of metrics for virtualization and
consolidation
162. CMG‘08 INTERNATIONAL
conference
Standard Windows driver IOCTL interface
Event model based on pending IO requests
(IRPs)
Two separate device interfaces
Consumed by the WMI providers
An alternative to the ACPI implementation
Future direction – potentially consumed by the
kernel power manager
Documented on MSDN
163. CMG‘08 INTERNATIONAL
conference
Rationale
Works as the abstraction layer to the underlying
platform technology (IPMI, WSMAN, etc.)
Scales across different platforms
Does not require special drivers
Requires only firmware updates
Currently being proposed to the ACPI 4.0
specification
Delegate tasks to the BMC (e.g., rolling average
calculation, polling for events, etc.)
164. CMG‘08 INTERNATIONAL
conference
Power supply device
Extends the current power source device
Control method to publish capabilities
Power meter device
Similar to control method batteries
A set of control methods to get capabilities
and set configuration parameters, trip points,
and configure hardware enforced limits
Event notification via Notify codes
165. CMG‘08 INTERNATIONAL
conference
WS08R2 will provide
In-box driver to support power meter device(s)
described in ACPI
In-box IPMI operation region handler as part of the
Microsoft IPMI driver – allowing ACPI control
methods to communicate with IPMI using the KCS
protocol
Format similar to the SMBUS OpRegion
3rd-party IPMI drivers can register OpRegion handler
for other IPMI protocol(s)
Also proposed to ACPI 4.0 specification
168. CMG‘08 INTERNATIONAL
conference
Flash SSD versus HDD (Jun
‘08)
HDD Flash SSD
Endurance (write cycles per bit) 10^12 10^5 (SLC*
)
10^6 (MLC*
)
Cost per byte 1x 2.5x – 25x
Performance : Small random read requests 1x 10 – 100x
Active Power (Watts/byte) 10-20x 1x
Shock Resistance
Non-operating
Operating
100g 200g (2010)
~10g
1500g
100g
Thermal (°C) 5-55 0-70
* SLC – Single Level Cell
* MLC – Multi Level Cell
169. CMG‘08 INTERNATIONAL
conference
Flash Characteristics (Jun
’08)
Chip Read 50 MB/s
Write 25 MB/s
Scales with number of chips
Read Latency 25 μs to start,
100 μs for 2KB “page”
Write Latency 200 to 300 μs for 2KB “page”
2,000 μs to erase
Active Power 1-2 Watts for 8 chips + controller
170. CMG‘08 INTERNATIONAL
conference
SSD High-IOps Workload
TCO
Decrease TCO for IOps-intensive systems
IOps bottleneck causes customers to buy spindles
instead of capacity, driving up TCO and
operational complexity (e.g., workload balancing)
SSDs provide less expensive systems for same
performance targets
Smaller form factors
171. CMG‘08 INTERNATIONAL
conference
SSD Performance Concerns
- 1
Random write perf
Could be alleviated with next generation of
products
New technological problems may arise with future
generations (no guarantee that it will stay at same
level)
Potential bottleneck on erasing/block cleaning
Mixing workloads creates unexpected
performance characteristics
Read:write ratio, request sizes, sequentiality
172. CMG‘08 INTERNATIONAL
conference
SSD Performance Concerns
- 2
First-pass performance might be better than
steady-state
When nearing EOL, perf may degrade as blocks
are removed from pool
Does mapping metadata have be re-
read/initialized after power failure?
Need enough onboard parallelism to keep
internal serial interfaces from becoming
bottlenecks
Just like disk arrays, the wrong stripe unit size
can kill perf in an SSD array
173. CMG‘08 INTERNATIONAL
conference
WS08R2 Enables Improved
Endurance for SSD Technology
SSD can identify itself differently from HDD in ATA
as defined by ATA8-ACS Identify Word 217:
Nominal media rotation rate
Reporting non-rotating media will allow WS08R2
to set Defrag off as default; improving device endurance
by reducing writes
174. CMG‘08 INTERNATIONAL
conference
WS08R2 Enables Optimization
for SSD Technology
Microsoft implementation of “Trim” feature
NTFS will send down delete notification to the device
supporting “trim”
○ File system operations: Format, Delete, Truncate, Compression
○ OS internal processes: e.g., Snapshot, Volume Manager
Three optimization opportunities for the device
Enhancing device wear leveling by eliminating merge
operation for all deleted data blocks
Making early garbage collection possible for fast write
Keeping device’s unused storage area as high as
possible; more room for device wear leveling.
180. CMG‘08 INTERNATIONAL
conference
Hyper-V Power
Management
Full P-state/C-state management already
integrated between Windows root partition
and Hyper-V v1 (WS08)
Enlightenments, such as timer assist added in
Hyper-V v2 (WS08R2)
Hypervisor delivers child clocks without requiring
root interaction, plus ITTD
Core parking enabled for all partitions
181. CMG‘08 INTERNATIONAL
conference
Overview
Scheduling virtual machines on a single server for
density as opposed to dispersion
This allows “park/sleep” cores by putting them into
deep C states
Benefits
Significantly enhances Green IT by being able to
reduce power required for CPUs
Idle improvements extend to Hyper-V
Significant reduction in platform interrupt activity
Enables power savings and greater scalability
185. CMG‘08 INTERNATIONAL
conference
Hyper-V Power Efficiency
Windows Server Performance Lab
Testbed Configurations
Single Workloads
Web Fundamentals
SPECpower
Mixed Workloads
186. CMG‘08 INTERNATIONAL
conference
Hyper-V Power Efficiency
Workload configuration - 1
Methodology for obtaining power load line
data for TPC-C, TPC-E, FSCT, and Web
Fundamentals have been demonstrated
Benchmark loads varied by throttling number of
active users
Multiple workloads tested in Hyper-V environment
SPECpower has been successfully tuned
188. CMG‘08 INTERNATIONAL
conference
Hyper-V Power Efficiency
Workload configuration - 2
Single workloads
All the guests run the same workload
Two scenarios:
○ Fixing the number of active guests and scaling the
load in each guest
○ Fixing the load in each guest and activating more
guests
Mixed workloads
Half of guests run each workload
○ Fixed load in WF guests (~35% CPU utilization each)
○ Varying load in SPECpower guests
189. CMG‘08 INTERNATIONAL
conference
HW and SW Test
Configurations Hardware
2-socket quad-core processors
○ Minimal P-States
16GB memory: 4x4GB 667MHz DIMMs
External (wall) power monitor
Software
OS: Windows Server 2008
○ OS Power Management: Balanced mode
Hyper-V v2 (pre-release build)
○ Configured with 8 guests
Single virtual processor: 3.16GHz
1.75GB memory
190. CMG‘08 INTERNATIONAL
conference
Web Fundamentals
Dynamic
Adding Load to Each Guest
250
270
290
310
330
350
370
0 10000 20000 30000
Watts
Throughput (Requests / Sec)
Throughput and power usage
versus total system utilization
Power usage for various
throughput levels
0
5000
10000
15000
20000
25000
250
270
290
310
330
350
370
0% 50% 100%
Throughput(Requests/Sec)
Watts
System Utilization Percentage
Watts Throughput
For these experiments, the highest system utilization under WF
workload is ~80%. This issue has been subsequently resolved.
191. CMG‘08 INTERNATIONAL
conference
Web Fundamentals
Dynamic
Activating Guests - 1
0
5000
10000
15000
20000
25000
250
270
290
310
330
350
370
0% 50% 100%
Throughput(requests/sec)
Watts
System Utilization Percentage
Watts Throughput
Throughput and power usage
versus total system utilization
Power usage for various
throughput levels
•Data points from left to right: 0 guest, 1 guest, 2 guests, …, 8 guests active
•Each active guest tries to run at the maximum load
250
270
290
310
330
350
370
0 10000 20000 30000
Watts
Throughput (requests / sec)
192. CMG‘08 INTERNATIONAL
conference
Web Fundamentals
Dynamic
Activating Guests - 2Virtual processor utilizations for
different numbers of active guests
•The maximum utilization of each guest decreases as more guests are
activated. Most of this decrease has been subsequently removed.
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8
VirtualProcessorUtilization(%)
Number of Guests
Guest 1
Guest 2
Guest 3
Guest 4
Guest 5
Guest 6
Guest 7
Guest 8
193. CMG‘08 INTERNATIONAL
conference
SPECpower
Adding Load to Each Guest - 1
Throughput and power usage
versus total system utilization
Power usage for various
throughput levels
0
50
100
150
200
250
300
250
270
290
310
330
350
370
390
0% 20% 40% 60% 80% 100%
Throughput(inThousands)
Watts
System Utilization Percentage
Watts Throughput
65%
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80% 100%Power(%ofMaxWatts)
Workload (% of maximum throughput)
194. CMG‘08 INTERNATIONAL
conference
SPECpower
Adding Load to Each Guest - 2
Average processor frequency for various workload levels
0% 20% 40% 60% 80% 100%
Frequency(MHz)
Workload (% of Max Throughput)
196. CMG‘08 INTERNATIONAL
conference
SPECpower
Activating Guests
Throughput and power usage
versus total system utilization
Power usage for various
throughput levels
Similar scalability behavior of power and throughput as when adding load.
0
50
100
150
200
250
300
250
270
290
310
330
350
370
390
0% 20% 40% 60% 80% 100%
Throughput(inThousands)
Watts
System Utilization Percentage
Watts Throughput
60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80% 100%Power(%ofMaxWatts)
Workload (% of maximum throughput)
197. CMG‘08 INTERNATIONAL
conference
SPECpower and WF
Mixed Workloads - 1
SPECpower throughput and server power
usage versus total system utilization
Power usage for various
throughput levels
0
20
40
60
80
100
120
140
250
270
290
310
330
350
370
390
0% 20% 40% 60% 80% 100%
Throughput(inThousands)
Watts
System Utilization Percentage
Watts Throughput
60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 20% 40% 60% 80% 100%
Power(%ofWatts) Workload (% of maximum throughput)
•4 Guests running WF (4940 req/sec)
•~25% system utilization; ~35% guest virtual processor utilization
•4 Guests running SPECpower (similar efficiency as single workload)
198. CMG‘08 INTERNATIONAL
conference
SPECpower and WF
Mixed Workloads - 2
Average processor frequency for various
levels of SPECpower workload
0% 20% 40% 60% 80% 100%
Frequency(MHz)
Workload (% of Max SPECPower Throughput)
199. CMG‘08 INTERNATIONAL
conference
Hyper-V Power Efficiency
Future Experiments
More workloads
Different workload mix scenarios
Different combinations of fixed and varying
workloads
More VM configurations
Multiple virtual processors per guest
Oversubscription
201. CMG‘08 INTERNATIONAL
conference
Power WMI Provider
Enables power policy configuration through
standard WMI interface
Change power setting values
Activate a given plan
Conforms to DMTF data model
To get started…
Change a power setting: Win32_PowerSetting
Activate a plan: Win32_Plan.Activate() method
Attend
for
additional details
202. CMG‘08 INTERNATIONAL
conference
Configuration and Administration
WMI interfaces to query and set configuration
settings
Configuration of systems
Global administration
Management applications
WMI interfaces to query current and hardware
capabilities
3rd party applications
Diagnostics
203. CMG‘08 INTERNATIONAL
conference
TargetSetting = "Microsoft:PowerSetting{3c0bc021-c8a8-4e07-a973-
6b14cbcb2b7e}" 'display blank timeout
Set objWMIService = GetObject("WinMgmts:.rootcimv2power")
Set SettingIndices = objWMIService.ExecQuery(“ASSOCIATORS OF {“ &
chr(34) &
“Win32_PowerSetting.InstanceID=“ & chr(34) & TargetSetting &
chr(34) & “} WHERE ResultClass = Win32_PowerSettingDataIndex”)
For Each SettingIndex in SettingIndices
Set Plan = objWMIService.ExecQuery(“ASSOCIATORS OF {“ & chr(34) &
SettingIndex.InstanceID & “} WHERE ResultClass = Win32_PowerPlan”)
If Plan.IsActive THEN
SettingIndex.SettingIndexValue = 120 ‘2 seconds
SettingIndex.Put_
Plan.Activate()
204. CMG‘08 INTERNATIONAL
conference
Remote Power
Manageability - 1
WS08R2 supports the configuration of power
policy via WMI
Local and remote management via WMI
Adheres to DMTF conventions for setting data
Scriptable
Includes support for reading and writing of all
power plan and setting data
Active power plan can get changed remotely
Power Action can be carried out (sending a node
to S3)
206. CMG‘08 INTERNATIONAL
conference
Get the Active Plan:
Set objWMIService =
GetObject("WinMgmts:.rootcimv2power")
Set PowerPlans =
objWMIService.InstancesOf("Win32_PowerPlan")
For Each PowerPlan in PowerPlans
If PowerPlan.IsActive Then
wscript.echo "Current Plan: " &
PowerPlan.ElementName
End If
Next
PowerPlan.Activate()
207. CMG‘08 INTERNATIONAL
conference
Get all power settings in the Active Plan:
(Continued with PowerPlan)
EscapedInstanceID = Replace(PowerPlan.InstanceID, "", "")
Set PowerSettingIndexes = objWMIService.ExecQuery(
"ASSOCIATORS OF {Win32_PowerPlan.InstanceID=" & chr(34) &
EscapedInstanceID & chr(34) & "}")
For Each PowerSettingIndex in PowerSettingIndexes
EscapedInstanceID = Replace(PowerSettingIndex.InstanceID, "", "")
Set PowerSettings = objWMIService.ExecQuery(
"ASSOCIATORS OF {Win32_PowerSettingDataIndex.InstanceID=" & chr(34) &
EscapedInstanceID & chr(34) & "} WHERE ResultClass =
Win32_PowerSetting")
For Each PowerSetting in PowerSettings
wscript.echo “Power Setting: “ & PowerSetting.InstanceID
wscript.echo “Description: “ & PowerSetting.Description
wscript.echo “Index Value: “ & PowerSettingIndex.SettingIndexValue