1. Power Reduction Techniques For Microprocessor Systems
VASANTH VENKATACHALAM AND MICHAEL FRANZ
University of California, Irvine
Power consumption is a major factor that limits the performance of computers. We
survey the “state of the art” in techniques that reduce the total power consumed by a
microprocessor system over time. These techniques are applied at various levels
ranging from circuits to architectures, architectures to system software, and system
software to applications. They also include holistic approaches that will become more
important over the next decade. We conclude that power management is a multifaceted
discipline that is continually expanding with new techniques being developed at every
level. These techniques may eventually allow computers to break through the “power
wall” and achieve unprecedented levels of performance, versatility, and reliability. Yet it
remains too early to tell which techniques will ultimately solve the power problem.
Categories and Subject Descriptors: C.5.3 [Computer System Implementation]:
Microcomputers—Microprocessors; D.2.10 [Software Engineering]: Design—
Methodologies; I.m [Computing Methodologies]: Miscellaneous
General Terms: Algorithms, Design, Experimentation, Management, Measurement,
Performance
Additional Key Words and Phrases: Energy dissipation, power reduction
1. INTRODUCTION of power; so much power, in fact, that
their power densities and concomitant
Computer scientists have always tried to heat generation are rapidly approaching
improve the performance of computers. levels comparable to nuclear reactors
But although today’s computers are much (Figure 1). These high power densities
faster and far more versatile than their impair chip reliability and life expectancy,
predecessors, they also consume a lot increase cooling costs, and, for large
Parts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712
and by the Office of Naval Research under grant N00014-01-1-0854.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
authors and should not be interpreted as necessarily representing the official views, policies or endorsements,
either expressed or implied, of the National Science foundation (NSF), the Office of Naval Research (ONR),
or any other agency of the U.S. Government.
The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems that
partially supported this work.
Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University of
California at Irvine, Irvine, CA 92697-3425; email: vvenkata@uci.edu; Michael Franz, School of Information
and Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: franz@uci.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or direct commercial advantage and
that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212)
869-0481, or permissions@acm.org.
c 2005 ACM 0360-0300/05/0900-0195 $5.00
ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237.
2. 196 V. Venkatachalam and M. Franz
the power problem. Finally, we will dis-
cuss some commercial power management
systems (Section 3.10) and provide a
glimpse into some more radical technolo-
gies that are emerging (Section 3.11).
2. DEFINING POWER
Power and energy are commonly defined
in terms of the work that a system per-
forms. Energy is the total amount of work
a system performs over a period of time,
Fig. 1. Power densities rising. Figure adapted from while power is the rate at which the sys-
Pollack 1999.
tem performs that work. In formal terms,
data centers, even raise environmental
P = W/T (1)
concerns.
At the other end of the performance
spectrum, power issues also pose prob- E = P ∗ T, (2)
lems for smaller mobile devices with lim- where P is power, E is energy, T is a spe-
ited battery capacities. Although one could cific time interval, and W is the total work
give these devices faster processors and performed in that interval. Energy is mea-
larger memories, this would diminish sured in joules, while power is measured
their battery life even further. in watts.
Without cost effective solutions to These concepts of work, power, and en-
the power problem, improvements in ergy are used differently in different con-
micro-processor technology will eventu- texts. In the context of computers, work
ally reach a standstill. Power manage- involves activities associated with run-
ment is a multidisciplinary field that in- ning programs (e.g., addition, subtraction,
volves many aspects (i.e., energy, temper- memory operations), power is the rate at
ature, reliability), each of which is complex which the computer consumes electrical
enough to merit a survey of its own. The energy (or dissipates it in the form of heat)
focus of our survey will be on techniques while performing these activities, and en-
that reduce the total power consumed by ergy is the total electrical energy the com-
typical microprocessor systems. puter consumes (or dissipates as heat)
We will follow the high-level taxonomy over time.
illustrated in Figure 2. First, we will de- This distinction between power and en-
fine power and energy and explain the ergy is important because techniques that
complex parameters that dynamic and reduce power do not necessarily reduce en-
static power depend on (Section 2). Next, ergy. For example, the power consumed
we will introduce techniques that reduce by a computer can be reduced by halv-
power and energy (Section 3), starting ing the clock frequency, but if the com-
with circuit (Section 3.1) and architec- puter then takes twice as long to run
tural techniques (Section 3.2, Section 3.3, the same programs, the total energy con-
and Section 3.4), and then moving on to sumed will be similar. Whether one should
two techniques that are widely applied in reduce power or energy depends on the
hardware and software, dynamic voltage context. In mobile applications, reducing
scaling (Section 3.5) and resource hiberna- energy is often more important because
tion (Section 3.6). Third, we will examine it increases the battery lifetime. How-
what compilers can do to manage power ever, for other systems (e.g., servers), tem-
(Section 3.7). We will then discuss recent perature is a larger issue. To keep the
work in application level power manage- temperature within acceptable limits, one
ment (Section 3.8), and recent efforts (Sec- would need to reduce instantaneous power
tion 3.9) to develop a holistic solution to regardless of the impact on total energy.
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
3. Power Reduction Techniques for Microprocessor Systems 197
Fig. 2. Organization of this survey.
2.1. Dynamic Power Consumption capacitance (C) and an activity factor (a)
that relates to how many 0 → 1 or 1 → 0
There are two forms of power consump-
transitions occur in a chip:
tion, dynamic power consumption and
static power consumption. Dynamic power
consumption arises from circuit activity Pdynamic ∼ aCV 2 f . (3)
such as the changes of inputs in an
adder or values in a register. It has two Accordingly, there are four ways to re-
sources, switched capacitance and short- duce dynamic power consumption, though
circuit current. they each have different tradeoffs and not
Switched capacitance is the primary all of them reduce the total energy con-
source of dynamic power consumption and sumed. The first way is to reduce the phys-
arises from the charging and discharging ical capacitance or stored electrical charge
of capacitors at the outputs of circuits. of a circuit. The physical capacitance de-
Short-circuit current is a secondary pends on low level design parameters
source of dynamic power consumption and such as transistor sizes and wire lengths.
accounts for only 10-15% of the total power One can reduce the capacitance by re-
consumption. It arises because circuits are ducing transistor sizes, but this worsens
composed of transistors having opposite performance.
polarity, negative or NMOS and positive The second way to lower dynamic power
or PMOS. When these two types of tran- is to reduce the switching activity. As com-
sistors switch current, there is an instant puter chips get packed with increasingly
when they are simultaneously on, creat- complex functionalities, their switching
ing a short circuit. We will not deal fur- activity increases [De and Borkar 1999],
ther with the power dissipation caused by making it more important to develop tech-
this short circuit because it is a smaller niques that fall into this category. One
percentage of total power, and researchers popular technique, clock gating, gates
have not found a way to reduce it without the clock signal from reaching idle func-
sacrificing performance. tional units. Because the clock network
As the following equation shows, the accounts for a large fraction of a chip’s
more dominant component of dynamic total energy consumption, this is a very
power, switched capacitance (Pdynamic ), de- effective way of reducing power and en-
pends on four parameters namely, supply ergy throughout a processor and is imple-
voltage (V), clock frequency (f), physical mented in numerous commercial systems
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
4. 198 V. Venkatachalam and M. Franz
2.2. Understanding Leakage Power
Consumption
In addition to consuming dynamic
power, computer components consume
static power, also known as idle power
or leakage. According to the most re-
cently published industrial roadmaps
[ITRSRoadMap], leakage power is rapidly
becoming the dominant source of power
consumption in circuits (Figure 3) and
persists whether a computer is active or
Fig. 3. ITRS trends for leakage power dissipation. idle. Because its causes are different from
Figure adapted from Meng et al., 2005. those of dynamic power, dynamic power
reduction techniques do not necessarily
reduce the leakage power.
As the equation that follows illustrates,
including the Pentium 4, Pentium M, Intel leakage power consumption is the product
XScale and Tensilica Xtensa, to mention of the supply voltage (V) and leakage cur-
but a few. rent (Il eak ), or parasitic current, that flows
The third way to reduce dynamic power through transistors even when the tran-
consumption is to reduce the clock fre- sistors are turned off.
quency. But as we have just mentioned,
this worsens performance and does not al-
ways reduce the total energy consumed. Pleak = V Ileak . (4)
One would use this technique only if
the target system does not support volt- To understand how leakage current
age scaling and if the goal is to re- arises, one must understand how transis-
duce the peak or average power dissi- tors work. A transistor regulates the flow
pation and indirectly reduce the chip’s of current between two terminals called
temperature. the source and the drain. Between these
The fourth way to reduce dynamic power two terminals is an insulator, called the
consumption is to reduce the supply volt- channel, that resists current. As the volt-
age. Because reducing the supply voltage age at a third terminal, the gate, is in-
increases gate delays, it also requires re- creased, electrical charge accumulates in
ducing the clock frequency to allow the cir- the channel, reducing the channel’s resis-
cuit to work properly. tance and creating a path along which
The combination of scaling the supply electricity can flow. Once the gate volt-
voltage and clock frequency in tandem is age is high enough, the channel’s polar-
called dynamic voltage scaling (DVS). This ity changes, allowing the normal flow of
technique should ideally reduce dynamic current between the source and the drain.
power dissipation cubically because dy- The threshold at which the gate’s voltage
namic power is quadratic in voltage and is high enough for the path to open is called
linear in clock frequency. This is the most the threshold voltage.
widely adopted technique. A growing num- According to this model, a transistor is
ber of processors, including the Pentium similar to a water dam. It is supposed
M, mobile Pentium 4, AMD’s Athlon, and to allow current to flow when the gate
Transmeta’s Crusoe and Efficieon proces- voltage exceeds the threshold voltage but
sors allow software to adjust clock fre- should otherwise prevent current from
quencies and voltage settings in tandem. flowing. However, transistors are imper-
However, DVS has limitations and cannot fect. They leak current even when the gate
always be applied, and even when it can voltage is below the threshold voltage. In
be applied, it is nontrivial to apply as we fact, there are six different types of cur-
will see in Section 3.5. rent that leak through a transistor. These
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
5. Power Reduction Techniques for Microprocessor Systems 199
include reverse-biased-junction leakage, ture increases. But as the Equation (6)
gate-induced-drain leakage, subthreshold shows, this increases leakage further,
leakage, gate-oxide leakage, gate-current causing yet higher temperatures. This vi-
leakage, and punch-through leakage. Of cious cycle is known as thermal runaway.
these six, subthreshold leakage and gate- It is the chip designer’s worst nightmare.
oxide leakage dominate the total leakage How can these problems be solved?
current. Equations (4) and (6) indicate four ways to
Gate-oxide leakage flows from the gate of reduce leakage power. The first way is to
a transistor into the substrate. This type reduce the supply voltage. As we will see,
of leakage current depends on the thick- supply voltage reduction is a very common
ness of the oxide material that insulates technique that has been applied to compo-
the gate: nents throughout a system (e.g., processor,
buses, cache memories).
V 2
Tox
The second way to reduce leakage power
Iox = K 2 W e−α V . (5) is to reduce the size of a circuit because the
Tox total leakage is proportional to the leak-
age dissipated in all of a circuit’s tran-
According to this equation, the gate- sistors. One way of doing this is to de-
oxide leakage Iox increases exponentially sign a circuit with fewer transistors by
as the thickness Tox of the gate’s oxide ma- omitting redundant hardware and using
terial decreases. This is a problem because smaller caches, but this may limit perfor-
future chip designs will require the thick- mance and versatility. Another idea is to
ness to be reduced along with other scaled reduce the effective transistor count dy-
parameters such as transistor length and namically by cutting the power supplies to
supply voltage. One way of solving this idle components. Here, too, there are chal-
problem would be to insulate the gate us- lenges such as how to predict when dif-
ing a high-k dialectric material instead ferent components will be idle and how to
of the oxide materials that are currently minimize the overhead of shutting them
used. This solution is likely to emerge over on or off. This, is also a common approach
the next decade. of which we will see examples in the sec-
Subthreshold leakage current flows be- tions to follow.
tween the drain and source of a transistor. The third way to reduce leakage power
It is the dominant source of leakage and is to cool the computer. Several cooling
depends on a number of parameters that techniques have been developed since
are related through the following equa- the 1960s. Some blow cold air into the
tion: circuit, while others refrigerate the
−Vth −V
processor [Schmidt and Notohardjono
Isub = K 1 W e nT 1−e T . (6) 2002], sometimes even by costly means
such as circulating cryogenic fluids like
In this equation, W is the gate width liquid nitrogen [Krane et al. 1988]. These
and K and n are constants. The impor- techniques have three advantages. First
tant parameters are the supply voltage they significantly reduce subthreshold
V , the threshold voltage Vth , and the leakage. In fact, a recent study [Schmidt
temperature T. The subthreshold leak- and Notohardjono 2002] showed that cool-
age current Isub increases exponentially ing a memory cell by 50 degrees Celsius
as the threshold voltage Vth decreases. reduces the leakage energy by five times.
This again raises a problem for future Second, these techniques allow a circuit to
chip designs, because as technology scales, work faster because electricity encounters
threshold voltages will have to scale along less resistance at lower temperatures.
with supply voltages. Third, cooling eliminates some negative
The increase in subthreshold leakage effects of high temperatures, namely the
current causes another problem. When the degradation of a chip’s reliability and life
leakage current increases, the tempera- expectancy. Despite these advantages,
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
6. 200 V. Venkatachalam and M. Franz
Fig. 4. The transistor stacking effect. The cir-
cuits in both (a) and (b) are leaking current be-
tween Vdd and Ground. However, Ileak1, the leak-
age in (a), is less than Ileak2, the leakage in (b).
Figure adapted from Butts and Sohi 2000.
Fig. 5. Multiple threshold circuits with
sleep transistors.
there are issues to consider such as the
costs of the hardware used to cool the
and this increases the threshold voltage
circuit. Moreover, cooling techniques
of the bottom transistor, making it more
are insufficient if they result in wide
resistant to leakage. As a result, in Fig-
temperature variations in different parts
ure 4(a), in which all transistors are in
of a circuit. Rather, one needs to prevent
the Off position, transistor B leaks less
hotspots by distributing heat evenly
current than transistor A, and transistor
throughout a chip.
C leaks less current than transistor B.
Hence, the total leakage current is atten-
2.2.1. Reducing Threshold Voltage. The uated as it flows from Vdd to the ground
fourth way of reducing leakage power is to through transistors A, B, and C. This is not
increase the threshold voltage. As Equa- the case in the circuit shown in Figure 4
tion (6) shows, this reduces the subthresh- (b), which contains only a single off
old leakage exponentially. However, it also transistor.
reduces the circuit’s performance as is ap- Another way to increase the thresh-
parent in the following equation that re- old voltage is to use Multiple Thresh-
lates frequency (f), supply voltage (V), and old Circuits With Sleep Transistors (MTC-
threshold voltage (Vth ) and where α is a MOS) [Calhoun et al. 2003; Won et al.
constant: 2003] . This involves isolating a leaky cir-
cuit element by connecting it to a pair
(V − Vth )α of virtual power supplies that are linked
f∞ . (7)
V to its actual power supplies through sleep
transistors (Figure 5). When the circuit is
One of the less intuitive ways of active, the sleep transistors are activated,
increasing the threshold voltage is to connecting the circuit to its power sup-
exploit what is called the stacking effect plies. But when the circuit is inactive, the
(refer to Figure 4). When two or more tran- sleep transistors are deactivated, thus dis-
sistors that are switched off are stacked connecting the circuit from its power sup-
on top of each other (a), then they dis- plies. In this inactive state, almost no leak-
sipate less leakage than a single tran- age passes through the circuit because the
sistor that is turned off (b). This is be- sleep transistors have high threshold volt-
cause each transistor in the stack induces ages. (Recall that subthreshold leakage
a slight reverse bias between the gate and drops exponentially with a rise in thresh-
source of the transistor right below it, old voltage, according to Equation (6).)
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
7. Power Reduction Techniques for Microprocessor Systems 201
Fig. 6. Adaptive body biasing.
This technique effectively confines the To adjust the threshold voltage, adap-
leakage to one part of the circuit, but is tive body biasing applies a voltage to the
tricky to implement for several reasons. transistor’s body known as a body bias
The sleep transistors must be sized prop- voltage (Figure 6). This voltage changes
erly to minimize the overhead of acti- the polarity of a transistor’s channel, de-
vating them. They cannot be turned on creasing or increasing its resistance to
and off too frequently. Moreover, this tech- current flow. When the body bias voltage
nique does not readily apply to memories is chosen to fill the transistor’s channel
because memories lose data when their with positive ions (b), the threshold volt-
power supplies are cut. age increases and reduces leakage cur-
A third way to increase the threshold rents. However, when the voltage is cho-
is to employ dual threshold circuits. Dual sen to fill the channel with negative ions,
threshold circuits [Liu et al. 2004; Wei the threshold voltage decreases, allowing
et al. 1998; Ho and Hwang 2004] reduce higher performance, though at the cost of
leakage by using high threshold (low leak- more leakage.
age) transistors on noncritical paths and
low threshold transistors on critical paths,
the idea being that noncritical paths can 3. REDUCING POWER
execute instructions more slowly without 3.1. Circuit And Logic Level Techniques
impairing performance. This is a diffi-
cult technique to implement because it re- 3.1.1. Transistor Sizing. Transistor siz-
quires choosing the right combination of ing [Penzes et al. 2002; Ebergen et al.
transistors for high-threshold voltages. If 2004] reduces the width of transistors
too many transistors are assigned high to reduce their dynamic power consump-
threshold voltages, the noncritical paths tion, using low-level models that relate the
in the circuit can slow down too much. power consumption to the width. Accord-
A fourth way to increase the thresh- ing to these models, reducing the width
old voltage is to apply a technique known also increases the transistor’s delay and
as adaptive body biasing [Seta et al. thus the transistors that lie away from
1995; Kobayashi and Sakurai 1994; Kim the critical paths of a circuit are usually
and Roy 2002]. Adaptive body biasing is the best candidates for this technique. Al-
a runtime technique that reduces leak- gorithms for applying this technique usu-
age power by dynamically adjusting the ally associate with each transistor a toler-
threshold voltages of circuits depending able delay which varies depending on how
on whether the circuits are active. When close that transistor is to the critical path.
a circuit is not active, the technique in- These algorithms then try to scale each
creases its threshold voltage, thus saving transistor to be as small as possible with-
leakage power exponentially, although at out violating its tolerable delay.
the expense of a delay in circuit operation.
When the circuit is active, the technique 3.1.2. Transistor Reordering. The ar-
decreases the threshold voltage to avoid rangement of transistors in a circuit
slowing it down. affects energy consumption. Figure 7
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
8. 202 V. Venkatachalam and M. Franz
3.1.3. Half Frequency and Half Swing Clocks.
Half-frequency and half-swing clocks re-
duce frequency and voltage, respectively.
Traditionally, hardware events such as
register file writes occur on a rising
clock edge. Half-frequency clocks synchro-
nize events using both edges, and they
tick at half the speed of regular clocks,
thus cutting clock switching power in
half. Reduced-swing clocks also often use
a lower voltage signal and thus reduce
power quadratically.
Fig. 7. Transistor Reordering. Figure
adapted from Hossain et al. 1996. 3.1.4. Logic Gate Restructuring. There are
many ways to build a circuit out of logic
gates. One decision that affects power con-
shows two possible implementations of sumption is how to arrange the gates and
the same circuit that differ only in their their input signals.
placement of the transistors marked A For example, consider two implementa-
and B. Suppose that the input to transis- tions of a four-input AND gate (Figure 8),
tor A is 1, the input to transistor B is 1, a chain implementation (a), and a tree
and the input to transistor C is 0. Then implementation (b). Knowing the signal
transistors A and B will be on, allowing probabilities (1 or 0) at each of the pri-
current from Vdd to flow through them mary inputs (A, B, C, D), one can easily cal-
and charge the capacitors C1 and C2. culate the transition probabilities (0→1)
Now suppose that the inputs change and for each output (W, X, F, Y, Z). If each in-
that A’s input becomes 0, and C’s input be- put has an equal probability of being a
comes 1. Then A will be off while B and C 1 or a 0, then the calculation shows that
will be on. Now the implementations in (a) the chain implementation (a) is likely to
and (b) will differ in the amounts of switch- switch less than the tree implementation
ing activity. In (a), current from ground (b). This is because each gate in a chain
will flow past transistors B and C, dis- has a lower probability of having a 0→1
charging both the capacitors C1 and C2. transition than its predecessor; its tran-
However, in (b), the current from ground sition probability depends on those of all
will only flow past transistor C; it will not its predecessors. In the tree implementa-
get past transistor A since A is turned off. tion, on the other hand, some gates may
Thus it will only discharge the capacitor share a parent (in the tree topology) in-
C2, rather than both C1 and C2 as in part stead of being directly connected together.
(a). Thus the implementation in (b) will These gates could have the same transi-
consume less power than that in (a). tion probabilities.
Transistor reordering [Kursun et al. Nevertheless, chain implementations
2004; Sultania et al. 2004] rearranges do not necessarily save more energy than
transistors to minimize their switching tree implementations. There are other is-
activity. One of its guiding principles is sues to consider when choosing a topology.
to place transistors closer to the circuit’s One is the issue of glitches or spurious
outputs if they switch frequently in or- transitions that occur when a gate does not
der to prevent a domino effect where receive all of its inputs at the same time.
the switching activity from one transistor These glitches are more common in chain
trickles into many other transistors caus- implementations where signals can travel
ing widespread power dissipation. This re- along different paths having widely vary-
quires profiling techniques to determine ing delays. One solution to reduce glitches
how frequently different transistors are is to change the topology so that the dif-
likely to switch. ferent paths in the circuit have similar
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
9. Power Reduction Techniques for Microprocessor Systems 203
Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania State
University Microsystems Design Laboratory’s tutorial on low power design.
delays. This solution, known as path bal- such as register files. A typical master-
ancing often transforms chain implemen- slave flip-flop consists of two latches, a
tations into tree implementations. An- master latch, and a slave latch. The inputs
other solution, called retiming, involves of the master latch are the clock signal
inserting flip-flops or registers to slow and data. The inputs of the slave latch
down and thereby synchronize the signals are the inverse of the clock signal and the
that pass along different paths but recon- data output by the master latch. Thus
verge to the same gate. Because flip-flops when the clock signal is high, the master
and registers are in sync with the proces- latch is turned on, and the slave latch
sor clock, they sample their inputs less fre- is turned off. In this phase, the master
quently than logic gates and are thus more samples whatever inputs it receives and
immune to glitches. outputs them. The slave, however, does
not sample its inputs but merely outputs
3.1.5. Technology Mapping. Because of whatever it has most recently stored. On
the huge number of possibilities and the falling edge of the clock, the master
tradeoffs at the gate level, designers rely turns off and the slave turns on. Thus the
on tools to determine the most energy- master saves its most recent input and
optimal way of arranging gates and sig- stops sampling any further inputs. The
nals. Technology mapping [Chen et al. slave samples the new inputs it receives
2004; Li et al. 2004; Rutenbar et al. 2001] from the master and outputs it.
is the automated process of constructing a Besides this master-slave design are
gate-level representation of a circuit sub- other common designs such as the pulse-
ject to constraints such as area, delay, and triggered flip-flop and sense-amplifier flip-
power. Technology mapping for power re- flop. All these designs share some common
lies on gate-level power models and a li- sources of power consumption, namely
brary that describes the available gates, power dissipated from the clock signal,
and their design constraints. Before a cir- power dissipated in internal switching ac-
cuit can be described in terms of gates, it tivity (caused by the clock signal and by
is initially represented at the logic level. changes in data), and power dissipated
The problem is to design the circuit out of when the outputs change.
logic gates in a way that will mimimize the Researchers have proposed several al-
total power consumption under delay and ternative low power designs for flip-flops.
cost constraints. This is an NP-hard Di- Most of these approaches reduce the
rected Acyclic Graph (DAG) covering prob- switching activity or the power dissipated
lem, and a common heuristic to solve it is by the clock signal. One alternative is the
to break the DAG representation of a cir- self-gating flip-flop. This design inhibits
cuit into a set of trees and find the optimal the clock signal to the flip-flop when the in-
mapping for each subtree using standard puts will produce no change in the outputs.
tree-covering algorithms. Strollo et al. [2000] have proposed two ver-
sions of this design. In the double-gated
3.1.6. Low Power Flip-Flops. Flip-flops flip-flop, the master and slave latches
are the building blocks of small memories each have their own clock-gating circuit.
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
10. 204 V. Venkatachalam and M. Franz
The circuit consists of a comparator that way of reducing power is to encode the
checks whether the current and previous FSM states in a way that minimizes the
inputs are different and some logic that in- switching activity throughout the proces-
hibits the clock signal based on the output sor. Another common approach involves
of the comparator. In the sequential-gated decomposing the FSM into sub-FSMs
flip-flop, the master and the slave share and activating only the circuitry needed
the same clock-gating circuit instead of for the currently executing sub-FSM.
having their own individual circuit as in Some researchers [Gao and Hayes 2003]
the double-gated flip-flop. have developed tools that automate this
Another low-power flip-flop is the condi- process of decomposition, subject to some
tional capture FLIP-FLOP [Kong et al. 2001; quality and correctness constraints.
Nedovic et al. 2001]. This flip-flop detects
when its inputs produce no change in the 3.1.8. Delay-Based Dynamic-Supply Voltage
output and stops these redundant inputs Adjustment. Commercial processors that
from causing any spurious internal cur- can run at multiple clock speed normally
rent switches. There are many variations use a lookup table to decide what supply
on this idea, such as the conditional dis- voltage to select for a given clock speed.
charge flip-flops and conditional precharge This table is use built ahead of time
flip-flops [Zhao et al. 2004] which elim- through a worst-case analysis. Because
inate unnecessary precharging and dis- worst-case analyses may not reflect real-
charging of internal elements. ity, ARM Inc. has been developing a more
A third type of low-power flip-flop is the efficient runtime solution known as the
dual edge triggered FLIP-FLOP [Llopis and Razor pipeline [Ernst et al. 2003].
Sachdev 1996]. These flip-flops are sensi- Instead of using a lookup table, Razor
tive to both the rising and falling edges adjusts the supply voltage based on the
of the clock, and can do the same work delay in the circuit. The idea is that when-
as a regular flip-flop at half the clock fre- ever the voltage is reduced, the circuit
quency. By cutting the clock frequency slows down, causing timing errors if the
in half, these flip-flops save total power clock frequency chosen for that voltage is
consumption but may not save total en- too high. Because of these errors, some in-
ergy, although they could save energy by structions produce inconsistent results or
reducing the voltage along with the fre- fail altogether. The Razor pipeline peri-
quency. Another drawback is that these odically monitors how many such errors
flip-flops require more area than regu- occur. If the number of errors exceeds a
lar flip-flops. This increase in area may threshold, it increases the supply voltage.
cause them to consume more power on the If the number of errors is below a thresh-
whole. old, it scales the voltage further to save
more energy.
This solution requires extra hardware
3.1.7. Low-Power Control Logic Design. in order to detect and correct circuit
One can view the control logic of a pro- errors. To detect errors, it augments
cessor as a Finite State Machine (FSM). the flip-flops in delay-critical regions of
It specifies the possible processor states the circuit with shadow-latches. These
and conditions for switching between latches receive the same inputs as the
the states and generates the signals that flip-flops but are clocked more slowly to
activate the appropriate circuitry for each adapt to the reduced supply voltage. If
state. For example, when an instruction the output of the flip-flop differs from
is in the execution stage of the pipeline, that of its shadow-latch, then this sig-
the control logic may generate signals to nifies an error. In the event of such an
activate the correct execution unit. error, the circuit propagates the output
Although control logic optimizations of the shadow-latch instead of the flip-
have traditionally targeted performance, flop, delaying the pipeline for a cycle if
they are now also targeting power. One necessary.
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
11. Power Reduction Techniques for Microprocessor Systems 205
Fig. 9. Crosstalk prevention through shielding.
3.2. Low-Power Interconnect smaller, there arise additional sources of
power consumption. One of these sources
Interconnect heavily affects power con-
is crosstalk [Sylvester and Keutzer 1998].
sumption since it is the medium of most
Crosstalk is spurious activity on a wire
electrical activity. Efforts to improve chip
that is caused by activity in neighboring
performance are resulting in smaller chips
wires. As well as increasing delays and im-
with more transistors and more densely
pairing circuit integrity, crosstalk can in-
packed wires carrying larger currents. The
crease power consumption.
wires in a chip often use materials with
One way of reducing crosstalk [Taylor
poor thermal conductivity [Banerjee and
et al. 2001] is to insert a shield wire
Mehrotra 2001].
(Figure 9) between adjacent bus wires.
Since the shield remains deasserted, no
3.2.1. Bus Encoding And CrossTalk. A pop- adjacent wires switch in opposite direc-
ular way of reducing the power consumed tions. This solution doubles the number of
in buses is to reduce the switching activity wires. However, the principle behind it has
through intelligent bus encoding schemes, sparked efforts to develop self-shielding
such as bus-inversion [Stan and Burleson codes [Victor and Keutzer 2001; Patel and
1995]. Buses consist of wires that trans- Markov 2003] resistant to crosstalk. As
mit bits (logical 1 or 0). For every data in traditional bus encoding, a value is en-
transmission on a bus, the number of wires coded and then transmitted. However, the
that switch depends on the current and code chosen avoids opposing transitions on
previous values transmitted. If the Ham- adjacent bus wires.
ming distance between these values is
more than half the number of wires, then 3.2.2. Low Swing Buses. Bus-encoding
most of the wires on the bus will switch schemes attempt to reduce transitions on
current. To prevent this from happening, the bus. Alternatively, a bus can transmit
bus-inversion transmits the inverse of the the same information but at a lower
intended value and asserts a control sig- voltage. This is the principle behind low
nal alerting recipients of the inversion. swing buses [Zhang and Rabaey 1998].
For example, if the current binary value to Traditionally, logical one is represented
transmit is 110 and the previous was 000, by 5 volts and logical zero is represented
the bus instead transmits 001, the inverse by −5 volts. In a low-swing system
of 110. (Figure 10), logical one and zero are
Bus-inversion ensures that at most half encoded using lower voltages, such as
of the bus wires switch during a bus trans- +300mV and −300mV. Typically, these
action. However, because of the cost of the systems are implemented with differen-
logic required to invert the bus lines, this tial signaling. An input signal is split into
technique is mainly used in external buses two signals of opposite polarity bounded
rather than the internal chip interconnect. by a smaller voltage range. The receiver
Moreover, some researchers have ar- sees the difference between the two
gued that this technique makes the sim- transmitted signals as the actual signal
plistic assumption that the amount of and amplifies it back to normal voltage.
power that an interconnect wire consumes Low Swing Differential Signaling has
depends only on the number of bit tran- several advantages in addition to reduced
sitions on that wire. As chips become power consumption. It is immune to
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
12. 206 V. Venkatachalam and M. Franz
Fig. 10. Low voltage differential signaling
Fig. 11. Bus segmentation.
crosstalk and electromagnetic radiation 3.2.4. Adiabatic Buses. The preceding
effects. Since the two transmitted signals techniques work by reducing activity
are close together, any spurious activity or voltage. In contrast, adiabatic cir-
will affect both equally without affecting cuits [Lyuboslavsky et al. 2000] reduce to-
the difference between them. However, tal capacitance. These circuits reuse ex-
when implementing this technique, one isting electrical charge to avoid creating
needs to consider the costs of increased new charge. In a traditional bus, when
hardware at the encoder and decoder. a wire becomes deasserted, its previous
charge is wasted. A charge-recovery bus
3.2.3. Bus Segmentation. Another effec- recycles the charge for wires about to be
tive technique is bus segmentation. In a asserted.
traditional shared bus architecture, the Figure 12 illustrates a simple design for
entire bus is charged and discharged upon a two-bit adiabatic bus [Bishop and Ir-
every access. Segmentation (Figure 11) win 1999]. A comparator in each bitline
splits a bus into multiple segments con- tests whether the current bit is equal to
nected by links that regulate the traffic be- the previously sent bit. Inequality signi-
tween adjacent segments. Links connect- fies a transition and connects the bitline
ing paths essential to a communication are to a shorting wire used for sharing charge
activated independently, allowing most of across bitlines. In the sample scenario,
the bus to remain powered down. Ideally, Bitline1 has a falling transition while Bit-
devices communicating frequently should line0 has a rising transition. As Bitline1
be in the same or nearby segments to avoid falls, its positive charge transfers to Bit-
powering many links. Jone et al. [2003] line0, allowing Bitline0 to rise. The power
have examined how to partition a bus saved depends on transition patterns. No
to ensure this property. Their algorithm energy is saved when all lines rise. The
begins with an undirected graph whose most energy is saved when an equal num-
nodes are devices, edges connect commu- ber of lines rise and fall simultaneously.
nicating devices, and edge weights dis- The biggest drawback of adiabatic cir-
play communication frequency. Applying cuits is a delay for transfering shared
the Gomory and Hu algorithm [1961], they charge.
find a spanning tree in which the prod-
uct of communication frequency and num-
ber of edges between any two devices is 3.2.5. Network-On-Chip. All of the above
minimal. techniques assume an architecture in
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
13. Power Reduction Techniques for Microprocessor Systems 207
and columns corresponding to the inter-
section points end up switching current
even though only parts of these rows and
columns are actually traversed. To elim-
inate this widespread current switching,
the segmented crossbar topology divides
the rows and columns into segments de-
marcated by tristate buffers. The differ-
ent segments are selectively activated as
the data traverses through them, hence
Fig. 12. Two bit charge recovery bus. confining current switches to only those
segments along which the data actually
which multiple functional units share one passes.
or more buses. This shared-bus paradigm
has several drawbacks with respect to 3.3. Low-Power Memories and Memory
power and performance. The inherent bus Hierarchies
bandwidth limits the speed and volume
of data transfers and does not suit the 3.3.1. Types of Memory. One can classify
varying requirements of different execu- memory structures into two categories,
tion units. Only one unit can access the Random Access Memories (RAM) and
bus at a time, though several may be re- Read-Only-Memories (ROM). There are
questing bus access simultaneously. Bus two kinds of RAMs, static RAMs (SRAM)
arbitration mechanisms are energy inten- and dynamic RAMs (DRAM) which dif-
sive. Unlike simple data transfers, every fer in how they store data. SRAMs store
bus transaction involves multiple clock data using flip-flops and DRAMs store
cycles of handshaking protocols that in- each bit of data as a charge on a capac-
crease power and delay. itor; thus DRAMs need to refresh their
As a consequence, researchers have data periodically. SRAMs allow faster ac-
been investigating the use of generic in- cesses than DRAMs but require more
terconnection networks to replace buses. area and are more expensive. As a result,
Compared to buses, networks offer higher normally only register files, caches, and
bandwidth and support concurrent con- high bandwidth parts of the system are
nections. This design makes it easier to made up of SRAM cells, while main mem-
troubleshoot energy-intensive areas of the ory is made up of DRAM cells. Although
network. Many networks have sophisti- these cells have slower access times
cated mechanisms for adapting to varying than SRAMs, they contain fewer transis-
traffic patterns and quality-of-service re- tors and are less expensive than SRAM
quirements. Several authors [Zhang et al. cells.
2001; Dally and Towles 2001; Sgroi et al. The techniques we will introduce are not
2001] propose the application of these confined to any specific type of RAM or
techniques at the chip level. These claims ROM. Rather they are high-level architec-
have yet to be verified, and interconnect tural principles that apply across the spec-
power reduction remains a promising area trum of memories to the extent that the
for future research. required technology is available. They at-
Wang et al. [2003] have developed tempt to reduce the energy dissipation of
hardware-level optimizations for differ- memory accesses in two ways, either by
ent sources of energy consumption in reducing the energy dissipated in a mem-
an on-chip network. One of the alterna- ory accesses, or by reducing the number of
tive network topologies they propose is memory accesses.
the segmented crossbar topology which
aims to reduce the energy consumption 3.3.2. Splitting Memories Into Smaller Sub-
of data transfers. When data is trans- systems. An effective way to reduce the
fered over a regular network grid, all rows energy that is dissipated in a memory
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
14. 208 V. Venkatachalam and M. Franz
access is to activate only the needed mem- processor and first level cache. It would
ory circuits in each access. One way to ex- be smaller than the first level cache and
pose these circuits is to partition memo- hence dissipate less energy. Yet it would
ries into smaller, independently accessible be large enough to store an application’s
components. This can be done as different working set and filter out many mem-
granularities. ory references; only when a data request
At the lowest granularity, one can de- misses this cache would it require search-
sign a main memory system that is split ing higher levels of cache, and the penalty
up into multiple banks each of which for a miss would be offset by redirecting
can independently transition into differ- more memory references to this smaller,
ent power modes. Compilers and operating more energy efficient cache. This was the
systems can cluster data into a minimal principle behind the filter cache. Though
set of banks, allowing the other unused deceptively simple, this idea lies at the
banks to power down and thereby save en- heart of many of the low-power cache de-
ergy [Luz et al. 2002; Lebeck et al. 2000]. signs used today, ranging from simple ex-
At a higher granularity, one can parti- tensions of the cache hierarchy (e.g., block
tion the separate banks of a partitioned buffering) to the scratch pad memories
memory into subbanks and activate only used in embedded systems, and the com-
the relevant subbank in every memory ac- plex trace caches used in high-end gener-
cess. This is a technique that has been ap- ation processors.
plied to caches [Ghose and Kamble 1999]. One simple extension is the technique of
In a normal cache access, the set-selection block buffering. This technique augments
step involves transferring all blocks in all caches with small buffers to store the most
banks onto tag-comparison latches. Since recently accessed cache set. If a data item
the requested word is at a known off- that has been requested belongs to a cache
set in these blocks, energy can be saved set that is already in the buffer, then one
by transfering only words at that offset. does not have to search for it in the rest of
Cache subbanking splits the block array of the cache.
each cache line into multiple banks. Dur- In embedded systems, scratch pad mem-
ing set-selection, only the relevant bank ories [Panda et al. 2000; Banakar et al.
from each line is active. Since a single sub- 2002; Kandemir et al. 2002] have been
bank is active at a time, all subbanks of the extension of choice. These are soft-
a line share the same output latches and ware controlled memories that reside on-
sense amplifiers to reduce hardware com- chip. Unlike caches, the data they should
plexity. Subbanking saves energy with- contain is determined ahead of time and
out increasing memory access time. More- remains fixed while programs execute.
over, it is independent of program locality These memories are less volatile than
patterns. conventional caches and can be accessed
within a single cycle. They are ideal for
specialized embedded systems whose ap-
3.3.3. Augmenting the Memory Hierarchy plications are often hand tuned.
With Specialized Cache Structures. The sec- General purpose processors of the lat-
ond way to reduce energy dissipation in est generation sometimes rely on more
the memory hierarchy is to reduce the complex caching techniques such as the
number of memory hierarchy accesses. trace cache. Trace caches were originally
One group of researchers [Kin et al. 1997] developed for performance but have been
developed a simple but effective technique studied recently for their power bene-
to do this. Their idea was to integrate a fits. Instead of storing instructions in
specialized cache into the typical mem- their compiled order, a trace cache stores
ory hierarchy of a modern processor, a traces of instructions in their executed
hierarchy that may already contain one order. If an instruction sequence is al-
or more levels of caches. The new cache ready in the trace cache, then it need
they were proposing would sit between the not be fetched from the instruction cache
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
15. Power Reduction Techniques for Microprocessor Systems 209
but can be decoded directly from the
trace cache. This saves power by re-
ducing the number of instruction cache
accesses.
In the original trace cache design, both
the trace cache and the instruction cache
are accessed in parallel on every clock cy-
cle. Thus instructions are fetched from the
instruction cache while the trace cache is
searched for the next trace that is pre-
dicted to executed. If the trace is in the
trace cache, then the instructions fetched
from the instruction cache are discarded. Fig. 13. Drowsy cache line.
Otherwise the program execution pro-
ceeds out of the instruction cache and si- 3.4.1. Adaptive Caches. There is a
multaneously new traces are created from wealth of literature on adaptive caches,
the instructions that are fetched. caches whose storage elements (lines,
Low-power designs for trace caches typi- blocks, or sets) can be selectively acti-
cally aim to reduce the number of instruc- vated based on the application workload.
tion cache accesses. One alternative, the One example of such a cache is the
dynamic direction prediction-based trace Deep-Submicron Instruction (DRI)
cache [Hu et al. 2003], uses branch predi- cache [Powell et al. 2001]. This cache
tion to decide where to fetch instructions permits one to deactivate its individual
from. If the branch predictor predicts the sets on demand by gating their supply
next trace with high confidence and that voltages. To decide what sets to activate at
trace is in the trace cache, then the in- any given time, the cache uses a hardware
structions are fetched from the trace cache profiler that monitors the application’s
rather than the instruction cache. cache-miss patterns. Whenever the cache
The selective trace cache [Hu et al. 2002] misses exceed a threshold, the DRI cache
extends this technique with compiler sup- activates previously deactivated sets.
port. The idea is to identify frequently exe- Likewise, whenever the miss rate falls
cuted, or hot traces, and bring these traces below a threshold, the DRI deactivates
into the trace cache. The compiler inserts some of these sets by inhibiting their
hints after every branch instruction, indi- supply voltages.
cating whether it belongs to a hot trace. At A problem with this approach is that
runtime, these hints cause instructions to dormant memory cells lose data and need
be fetched from either the trace cache or more time to be reactivated for their next
the instruction cache. use. Thus an alternative to inhibiting their
supply voltages is to reduce their voltages
as low as possible without losing data.
3.4. Low-Power Processor Architecture
This is the aim of the drowsy cache, a cache
Adaptations
whose lines can be placed in a drowsy
So far we have described the energy sav- mode where they dissipate minimal power
ing features of hardware as though they but retain data and can be reactivated
were a fixed foundation upon which pro- faster.
grams execute. However, programs exhibit Figure 13 illustrates a typical drowsy
wide variations in behavior. Researchers cache line [Kim et al. 2002]. The wakeup
have been developing hardware structures signal chooses between voltages for the
whose parameters can be adjusted on de- active, drowsy, and off states. It also
mand so that one can save energy by regulates the word decoder’s output to
activating just the minimum hardware prevent data in a drowsy line from be-
resources needed for the code that is ing accessed. Before a cache access, the
executing. line circuitry must be precharged. In a
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
16. 210 V. Venkatachalam and M. Franz
the first adaptive instruction issue queues,
a 32-bit queue consisting of four equal size
partitions that can be independently acti-
vated. Each partition consists of wakeup
Fig. 14. Dead block elimination. logic that decides when instructions are
ready to execute and readout logic that
traditional cache, the precharger is contin- dispatches ready instructions into the
uously on. To save power, the drowsy cir- pipeline. At any time, only the partitions
cuitry regulates the wakeup signal with a that contain the currently executing in-
precharge signal. The precharger becomes structions are activated.
active only when the line is woken up. There have been many heuristics devel-
Many heuristics have been developed oped for configuring these queues. Some
for controlling these drowsy circuits. The require measuring the rate at which in-
simplest policy for controlling the drowsy structions are executed per processor clock
cache is cache decay [Kaxiras et al. 2001] cycles, or IPC. For example, Buyukto-
which simply turns off unused cache lines sunoglu et al.’s heuristic [2001] periodi-
after a fixed interval. More sophisticated cally measures the IPC over fixed length
policies [Zhang et al. 2002; 2003; Hu et al. intervals. If the IPC of the current inter-
2003] consider information about runtime val is a factor smaller than that of the
program behavior. For example, Hu et previous interval (indicating worse per-
al. [2003] have proposed hardware that formance), then the heuristic increases
detects when an execution moves into a the instruction queue size to increase the
hotspot and activates only those cache throughput. Otherwise it may decrease
lines that lie within that hotspot. The the queue size.
main idea of their approach is to count Bahar and Manne [2001] propose a
how often different branches are taken. similar heuristic targeting a simulated
When a branch is taken sufficiently many 8-wide issue processor that can be re-
times, its target address is a hotspot, and configured as a 6-wide issue or 4-wide
the hardware sets special bits to indicate issue. They also measure IPC over fixed
that cache lines within that hotspot should intervals but measure the floating point
not be shut down. Periodically, a runtime IPC separately from the integer IPC since
routine disables cache lines that do not the former is more critical for floating
have these bits set and decays the profil- point applications. Their heuristic, like
ing data. Whenever it reactivates a cache Buyuktosunoglu et al.’s [2001], compares
line before its next use, it also reactivates the measured IPC with specific thresholds
the line that corresponds to the next sub- to determine when to downsize the issue
sequent memory access. queue.
There are many other heuristics for con- Folegnani and Gonzalez [2001], on the
trolling cache lines. Dead-block elimina- other hand, use a different heuristic. They
tion [Kabadi et al. 2002] powers down find that when a program has little par-
cache lines containing basic blocks that allelism, the youngest part of the issue
have reached their final use. It is a queue (which contains the most recent in-
compiler-directed technique that requires structions) contributes very little to the
a static control flow analysis to identify overall IPC in that instructions in this
these blocks. Consider the control flow part are often committed late. Thus their
graph in Figure 14. Since block B1 exe- heuristic monitors the contribution of the
cutes only once, its cache lines can power youngest part of this queue and deacti-
down once control reaches B2. Similarly, vates this part when its contribution is
the lines containing B2 and B3 can power minimal.
down once control reaches B4.
3.4.2. Adaptive Instruction Queues. Buyu- 3.4.3. Algorithms for Reconfiguring Multi-
ktosunoglu et al. [2001] developed one of ple Structures. In addition to this work,
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
17. Power Reduction Techniques for Microprocessor Systems 211
researchers have also been tackling the to apply, they use offline profiling to plot
problem of problem of reconfiguring mul- an energy-delay tradeoff curve. Each point
tiple hardware units simultaneously. in the curve represents the energy and de-
One group [Hughes et al. 2001; Sasanka lay tradeoff of applying an adaptation to
et al. 2002; Hughes and Adve 2004] a subroutine. There are three regions in
has been developing heuristics for com- the curve. The first includes adaptations
bining hardware adaptations with dy- that save energy without impairing per-
namic voltage scaling during multime- formance; these are the adaptations that
dia applications. Their strategy is to run will always be applied. The third repre-
two algorithms while individual video sents adaptations that worsen both per-
frames are being processed. A global algo- formance and energy; these are the adap-
rithm chooses the initial DVS setting and tations that will never be applied. Be-
baseline hardware configuration for each tween these two extremes are adaptations
frame, while a local algorithm tunes var- that trade performance for energy savings;
ious hardware parameters (e.g., instruc- some of these may be applied depending on
tion window sizes) while the frame is ex- the tolerable performance loss. The main
ecuting to use up any remaining slack idea is to trace the curve from the ori-
periods. gin and apply every adaptation that one
Another group [Iyer and Marculescu encounters until the cumulative perfor-
2001, 2002a] has developed heuristics that mance loss from applying all the encoun-
adjust the pipeline width and register up- tered adaptations reaches the maximum
date unit (RUU) size for different hotspots tolerable performance loss.
or frequently executed program regions. Ponomarev et al. [2001] develop heuris-
To detect these hotspots, their technique tics for controlling a configurable issue
counts the number of times each branch queue, a configurable reorder buffer and a
instruction is taken. When a branch is configurable load/store queue. Unlike the
taken enough times, it is marked as a can- IPC based heuristics we have seen, this
didate branch. heuristic is occupancy-based because the
To detect how often candidate branches occupancy of hardware structures indi-
are taken, the heuristic uses a hotspot cates how congested these structures are,
detection counter. Initially the hotspot and congestion is a leading cause of perfor-
counter contains a positive integer. Each mance loss. Whenever a hardware struc-
time a candidate branch is taken, the ture (e.g., instruction queue) has low oc-
counter is decremented, and each time a cupancy, the heuristic reduces its size. In
noncandidate branch is taken, the counter contrast, if the structure is completely full
is incremented. If the counter becomes for a long time, the heuristic increases its
zero, this means that the program execu- size to avoid congestion.
tion is inside a hotspot. Another group of researchers [Albonesi
Once the program execution is in a et al. 2003; Dropsho et al. 2002] ex-
hotspot, this heuristic tests the energy plore the complex problem of configur-
consumption of different processor config- ing a processor whose adaptable compo-
urations at fixed length intervals. It finally nents include two levels of caches, integer
chooses the most energy saving configura- and floating point issue queues, load/store
tion as the one to use the next time the queues, register files, as well as a reorder
same hotspot is entered. To measure en- buffer. To dodge the sheer explosion of pos-
ergy at runtime, the heuristic relies on sible configurations, they tune each com-
hardware that monitors usage statistics of ponent based only on its local usage statis-
the most power hungry units and calcu- tics. They propose two heuristics for doing
lates the total power based on energy per this, one for tuning caches and another for
access values. tuning buffers and register files.
Huang et al.[2000; 2003] apply hard- The caches they consider are selec-
ware adaptations at the granularity of tive way caches; each of their multiple
subroutines. To decide what adaptations ways can independently be activated or
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
18. 212 V. Venkatachalam and M. Franz
disabled. Each way has a most recently [Turing 1937]). There have been ongoing
used (MRU) counter that is incremented efforts [Nilsen and Rygg 1995; Lim et al.
every time that a cache searche hits the 1995; Diniz 2003; Ishihara and Yasuura
way. The cache tuning heuristic samples 1998] in the real-time systems community
this counter at fixed intervals to determine to accurately estimate the worst case exe-
the number of hits in each way and the cution times of programs. These attempts
number of overall misses. Using these ac- use either measurement-based prediction,
cess statistics, the heuristic computes the static analysis, or a mixture of both. Mea-
energy and performance overheads for all surement involves a learning lag whereby
possible cache configurations and chooses the execution times of various tasks are
the best configuration dynamically. sampled and future execution times are
The heuristic for controlling other struc- extrapolated. The problem is that, when
tures such as issue queues is occupancy workloads are irregular, future execu-
based. It measures how often different tion times may not resemble past execu-
components of these structures fill up with tion times. Microarchitectural innovations
data and uses this information to de- such as pipelining, hyperthreading and
cide whether to upsize or downsize the out-of-order execution make it difficult to
structure. predict execution times in real systems
statically. Among other things, a compiler
3.5. Dynamic Voltage Scaling needs to know how instructions are inter-
Dynamic voltage scaling (DVS) addresses leaved in the pipeline, what the probabili-
the problem of how to modulate a pro- ties are that different branches are taken,
cessor’s clock frequency and supply volt- and when cache misses and pipeline haz-
age in lockstep as programs execute. The ards are likely to occur. This requires mod-
premise is that a processor’s workloads eling the pipeline and memory hierarchy.
vary and that when the processor has less A number of researchers [Ferdinand 1997;
work, it can be slowed down without af- Clements 1996] have attempted to inte-
fecting performance adversely. For exam- grate complete pipeline and cache mod-
ple, if the system has only one task and it els into compilers. It remains nontrivial
has a workload that requires 10 cycles to to develop models that can be used ef-
execute and a deadline of 100 seconds, the ficiently in a compiler and that capture
processor can slow down to 1/10 cycles/sec, all the complexities inherent in current
saving power and meeting the deadline systems.
right on time. This is presumably more ef-
ficient than running the task at full speed
and idling for the remainder of the pe- 3.5.2. Indeterminism And Anomalies In Real
riod. Though it appears straightforward, Systems. Even if the DVS algorithm pre-
this naive picture of how DVS works is dicts correctly what the processor’s work-
highly simplified and hides serious real- loads will be, determining how fast to run
world complexities. the processor is nontrivial. The nondeter-
minism in real systems removes any strict
3.5.1. Unpredictable Nature Of Workloads. relationships between clock frequency, ex-
The first problem is how to predict work- ecution time, and power consumption.
loads with reasonable accuracy. This re- Thus most theoretical studies on voltage
quires knowing what tasks will execute scaling are based on assumptions that
at any given time and the work re- may sound reasonable but are not guar-
quired for these tasks. Two issues com- anteed to hold in real systems.
plicate this problem. First, tasks can One misconception is that total micro-
be preempted at arbitrary times due to processor system power is quadratic in
user and I/O device requests. Second, supply voltage. Under the CMOS transis-
it is not always possible to accurately tor model, the power dissipation of indi-
predict the future runtime of an arbi- vidual transistors are quadratic in their
trary algorithm (c.f., the Halting Problem supply voltages, but there remains no
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
19. Power Reduction Techniques for Microprocessor Systems 213
precise way of estimating the power dis-
sipation of an entire system. One can
construct scenarios where total system
power is not quadratic in supply voltage.
Martin and Siewiorek [2001] found that
the StrongARM SA-1100 has two supply
voltages, a 1.5V supply that powers the
CPU and a 3.3V supply that powers the I/O
pins. The total power dissipation is domi-
nated by the larger supply voltage; hence,
even if the smaller supply voltage were al-
lowed to vary, it would not reduce power
quadratically. Fan et al. [2003] found that,
in some architectures where active mem-
ory banks dominate total system power
consumption, reducing the supply voltage
dramatically reduces CPU power dissi-
pation but has a miniscule effect on to-
tal power. Thus, for full benefits, dynamic
voltage scaling may need to be combined
with resource hibernation to power down Fig. 15. Scheduling effects. When DVS is applied
(a), task one takes a longer time to enter its criti-
other energy hungry parts of the chip. cal section, and task two is able to preempt it right
Another misconception is that it is before it does. When DVS is not applied (b), task
most power efficient to run a task at one enters its critical section sooner and thus pre-
the slowest constant speed that allows vents task two from preempting it until it finishes
it to exactly meet its deadline. Several its critical section. This figure has been adopted from
Buttazzo [2002].
theoretical studies [Ishihara and Yasuura
1998] attempt to prove this claim. These
proofs rest on idealistic assumptions such two tasks (Figure 15). When the processor
as “power is a convex linearly increasing slows down (a), Task One takes longer to
function of frequency”, assumptions that enter its critical section and is thus pre-
ignore how DVS affects the system as a empted by Task Two as soon as Task Two
whole. When the processor slows down, is released into the system. When the pro-
peripheral devices may remain activated cessor instead runs at maximum speed (b),
longer, consuming more power. An impor- Task One enters its critical section earlier
tant study of this issue was done by Fan and blocks Task Two until it finishes this
et al. [2003] who show that, for specific section. This hypothetical example sug-
DRAM architectures, the energy versus gests that DVS can affect the order in
slowdown curve is “U” shaped. As the pro- which tasks are executed. But the order in
cessor slows down, CPU energy decreases which tasks execute affects the state of the
but the cumulative energy consumed in cache which, in turn, affects performance.
active memory banks increases. Thus the The exact relationship between the cache
optimal speed is actually higher than the state and execution time is not clear. Thus,
processor’s lowest speed; any speed lower it is too simplistic to assume that execu-
than this causes memory energy dissipa- tion time is inversely proportional to clock
tion to overshadow the effects of DVS. frequency.
A third misconception is that execution All of these issues show that theo-
time is inversely proportional to clock fre- retical studies are insufficient for un-
quency. It is actually an open problem derstanding how DVS will affect system
how slowing down the processor affects state. One needs to develop an experi-
the execution time of an arbitrary appli- mental approach driven by heuristics and
cation. DVS may result in nonlinearities evaluate the tradeoffs of these heuristics
[Buttazzo 2002]. Consider a system with empirically.
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
20. 214 V. Venkatachalam and M. Franz
Most of the existing DVS approaches formance and energy dissipation. Weissel
can be classified as interval-based ap- and Bellosa’s technique involves moni-
proaches, intertask approaches, or in- toring these event rates using hardware
tratask approaches. performance counters and attributing
them to the processes that are executing.
At each context switch, the heuristic
3.5.3. Interval-Based Approaches.
adjusts the clock frequency for the process
Interval-based DVS algorithms mea-
being activated based on its previously
sure how busy the processor is over
measured event rates. To do this, the
some interval or intervals, estimate how
heuristic refers to a previously con-
busy it will be in the next interval, and
structed table of event rate combinations,
adjust the processor speed accordingly.
divided into frequency domains. Weissel
These algorithms differ in how they
and Bellosa construct this table through
estimate future processor utilization.
a detailed offline simulation where, for
Weiser et al. [1994] developed one of the
different combinations of event rates, they
earliest interval-based algorithms. This
determine the lowest clock frequency that
algorithm, Past, periodically measures
does not violate a user-specified threshold
how long the processor idles. If it idles
for performance loss.
longer than a threshold, the algorithm
Intertask DVS has two drawbacks.
slows it down; if it remains busy longer
First, task workloads are usually un-
than a threshold, it speeds it up. Since
known until tasks finish running. Thus
this approach makes decisions based
traditional algorithms either assume per-
only on the most recent interval, it is
fect knowledge of these workloads or es-
error prone. It is also prone to thrashing
timate future workloads based on prior
since it computes a CPU speed for every
workloads. When workloads are irregular,
interval. Govil et al. [1995] examine a
estimating them is nontrivial. Flautner
number of algorithms that extend Past
et al. [2001] address this problem by
by considering measurements collected
classifying all workloads as interactive
over larger windows of time. The Aged
episodes and producer-consumer episodes
Averages algorithm, for example, esti-
and using separate DVS algorithms for
mates the CPU usage for the upcoming
each. For interactive episodes, the clock
interval as a weighted average of usages
frequency used depends not only on the
measured over all previous intervals.
predicted workload but also on a percep-
These algorithms save more energy than
tion threshold that estimates how fast the
Past since they base their decisions on
episode would have to run to be accept-
more information. However, they all
able to a user. To recover from prediction
assume that workloads are regular. As we
error, the algorithm also includes a panic
have mentioned, it is very difficult, if not
threshold. If a session lasts longer than
impossible, to predict irregular workloads
this threshold, the processor immediately
using history information alone.
switches to full speed.
The second drawback of intertask ap-
3.5.4. Intertask Approaches. Intertask proaches is that they are unaware of pro-
DVS algorithms assign different speeds gram structure. A program’s structure
for different tasks. These speeds remain may provide insight into the work re-
fixed for the duration of each task’s exe- quired for it, insight that, in turn, may
cution. Weissel and Bellosa [2002] have provide opportunities for techniques such
developed an intertask algorithm that as voltage scaling. Within specific program
uses hardware events as the basis for regions, for example, the processor may
choosing the clock frequencies for differ- spend significant time waiting for mem-
ent processes. The motivation is that, as ory or disk accesses to complete. During
processes execute, they generate various the execution of these regions, the proces-
hardware events (e.g., cache misses, IPC) sor can be slowed down to meet pipeline
at varying rates that relate to their per- stall delays.
ACM Computing Surveys, Vol. 37, No. 3, September 2005.
21. Power Reduction Techniques for Microprocessor Systems 215
3.5.5. Intratask Approaches. Intratask the first of these algorithms. They no-
approaches [Lee and Sakurai 2000; Lorch ticed that programs have multiple exe-
and Smith 2001; Gruian 2001; Yuan and cution paths, some more time consum-
Nahrstedt 2003] adjust the processor ing than others, and that whenever con-
speed and voltage within tasks. Lee and trol flows away from a critical path, there
Sakurai [2000] developed one of the earli- are opportunities for the processor to slow
est approaches. Their approach, run-time down and still finish the programs within
voltage hopping, splits each task into their deadlines. Based on this observation,
fixed length timeslots. The algorithm as- Shin and Kim developed a tool that pro-
signs each timeslot the lowest speed that files a program offline to determine worst
allows it to complete within its preferred case and average cycles for different exe-
execution time which is measured as cution paths, and then inserting instruc-
the worst case execution time minus the tions to change the processor frequency at
elapsed execution time up to the current the start of different paths based on this
timeslot. An issue not considered in this information.
work is how to divide a task into timeslots. Another approach, program checkpoint-
Moreover, the algorithm is pessimistic ing [Azevedo et al. 2002], annotates a pro-
since tasks could finish before their worst gram with checkpoints and assigns tim-
case execution time. It is also insensitive ing constraints for executing code between
to program structure. checkpoints. It then profiles the program
There are many variations on this idea. offline to determine the average number
Two operating system-level intratask al- of cycles between different checkpoints.
gorithms are PACE [Lorch and Smith Based on this profiling information and
2001] and Stochastic DVS [Gruian 2001]. timing constraints, it adjusts the CPU
These algorithms choose a speed for ev- speed at each checkpoint.
ery cycle of a task’s execution based on the Perhaps the most well known approach
probability distribution of the task work- is by Hsu and Kremer [2003]. They pro-
load measured over previous cycles. The file a program offline with respect to all
difference between PACE and stochastic possible combinations of clock frequencies
DVS lies in their cost functions. Stochastic assigned to different regions. They then
DVS assumes that energy is proportional build a table describing how each combi-
to the square of the supply voltage, while nation affects execution time and power
PACE assumes it is proportional to the consumption. Using this table, they select
square of the clock frequency. These sim- the combination of regions and frequen-
plifying assumptions are not guaranteed cies that saves the most power without in-
to hold in real systems. creasing runtime beyond a threshold.
Dudani et al. [2002] have also developed
intratask policies at the operating system
level. Their work aims to combine in- 3.5.6. The Implications of Memory Bounded
tratask voltage scaling with the earliest- Code. Memory bounded applications
deadline-first scheduling algorithm. have become an attractive target for DVS
Dudani et al. split into two subprograms algorithms because the time for a memory
each program that the scheduler chooses access, on most commercial systems, is
to execute. The first subprogram runs at independent of how fast the processor is
the maximum clock frequency, while the running. When frequent memory accesses
second runs slow enough to keep the com- dominate the program execution time,
bined execution time of both subprograms they limit how fast the program can finish
below the average execution time for the executing. Because of this “memory wall”,
whole program. the processor can actually run slower
In addition to these schemes, there and save a lot of energy without losing
have been a number of intratask poli- as much performance as it would if it
cies implemented at the compiler level. were slowing down a compute-intensive
Shin and Kim [2001] developed one of code.
ACM Computing Surveys, Vol. 37, No. 3, September 2005.