SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
Power Reduction Techniques For Microprocessor Systems
VASANTH VENKATACHALAM AND MICHAEL FRANZ
University of California, Irvine




             Power consumption is a major factor that limits the performance of computers. We
             survey the “state of the art” in techniques that reduce the total power consumed by a
             microprocessor system over time. These techniques are applied at various levels
             ranging from circuits to architectures, architectures to system software, and system
             software to applications. They also include holistic approaches that will become more
             important over the next decade. We conclude that power management is a multifaceted
             discipline that is continually expanding with new techniques being developed at every
             level. These techniques may eventually allow computers to break through the “power
             wall” and achieve unprecedented levels of performance, versatility, and reliability. Yet it
             remains too early to tell which techniques will ultimately solve the power problem.

             Categories and Subject Descriptors: C.5.3 [Computer System Implementation]:
             Microcomputers—Microprocessors; D.2.10 [Software Engineering]: Design—
             Methodologies; I.m [Computing Methodologies]: Miscellaneous
             General Terms: Algorithms, Design, Experimentation, Management, Measurement,
             Performance
             Additional Key Words and Phrases: Energy dissipation, power reduction



1. INTRODUCTION                                           of power; so much power, in fact, that
                                                          their power densities and concomitant
Computer scientists have always tried to                  heat generation are rapidly approaching
improve the performance of computers.                     levels comparable to nuclear reactors
But although today’s computers are much                   (Figure 1). These high power densities
faster and far more versatile than their                  impair chip reliability and life expectancy,
predecessors, they also consume a lot                     increase cooling costs, and, for large


Parts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712
and by the Office of Naval Research under grant N00014-01-1-0854.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
authors and should not be interpreted as necessarily representing the official views, policies or endorsements,
either expressed or implied, of the National Science foundation (NSF), the Office of Naval Research (ONR),
or any other agency of the U.S. Government.
The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems that
partially supported this work.
Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University of
California at Irvine, Irvine, CA 92697-3425; email: vvenkata@uci.edu; Michael Franz, School of Information
and Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: franz@uci.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or direct commercial advantage and
that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212)
869-0481, or permissions@acm.org.
 c 2005 ACM 0360-0300/05/0900-0195 $5.00


                                               ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237.
196                                                              V. Venkatachalam and M. Franz

                                                      the power problem. Finally, we will dis-
                                                      cuss some commercial power management
                                                      systems (Section 3.10) and provide a
                                                      glimpse into some more radical technolo-
                                                      gies that are emerging (Section 3.11).

                                                      2. DEFINING POWER
                                                      Power and energy are commonly defined
                                                      in terms of the work that a system per-
                                                      forms. Energy is the total amount of work
                                                      a system performs over a period of time,
Fig. 1. Power densities rising. Figure adapted from   while power is the rate at which the sys-
Pollack 1999.
                                                      tem performs that work. In formal terms,
data centers, even raise environmental
                                                                          P = W/T                        (1)
concerns.
   At the other end of the performance
spectrum, power issues also pose prob-                                 E = P ∗ T,                (2)
lems for smaller mobile devices with lim-             where P is power, E is energy, T is a spe-
ited battery capacities. Although one could           cific time interval, and W is the total work
give these devices faster processors and              performed in that interval. Energy is mea-
larger memories, this would diminish                  sured in joules, while power is measured
their battery life even further.                      in watts.
   Without cost effective solutions to                  These concepts of work, power, and en-
the power problem, improvements in                    ergy are used differently in different con-
micro-processor technology will eventu-               texts. In the context of computers, work
ally reach a standstill. Power manage-                involves activities associated with run-
ment is a multidisciplinary field that in-             ning programs (e.g., addition, subtraction,
volves many aspects (i.e., energy, temper-            memory operations), power is the rate at
ature, reliability), each of which is complex         which the computer consumes electrical
enough to merit a survey of its own. The              energy (or dissipates it in the form of heat)
focus of our survey will be on techniques             while performing these activities, and en-
that reduce the total power consumed by               ergy is the total electrical energy the com-
typical microprocessor systems.                       puter consumes (or dissipates as heat)
   We will follow the high-level taxonomy             over time.
illustrated in Figure 2. First, we will de-             This distinction between power and en-
fine power and energy and explain the                  ergy is important because techniques that
complex parameters that dynamic and                   reduce power do not necessarily reduce en-
static power depend on (Section 2). Next,             ergy. For example, the power consumed
we will introduce techniques that reduce              by a computer can be reduced by halv-
power and energy (Section 3), starting                ing the clock frequency, but if the com-
with circuit (Section 3.1) and architec-              puter then takes twice as long to run
tural techniques (Section 3.2, Section 3.3,           the same programs, the total energy con-
and Section 3.4), and then moving on to               sumed will be similar. Whether one should
two techniques that are widely applied in             reduce power or energy depends on the
hardware and software, dynamic voltage                context. In mobile applications, reducing
scaling (Section 3.5) and resource hiberna-           energy is often more important because
tion (Section 3.6). Third, we will examine            it increases the battery lifetime. How-
what compilers can do to manage power                 ever, for other systems (e.g., servers), tem-
(Section 3.7). We will then discuss recent            perature is a larger issue. To keep the
work in application level power manage-               temperature within acceptable limits, one
ment (Section 3.8), and recent efforts (Sec-          would need to reduce instantaneous power
tion 3.9) to develop a holistic solution to           regardless of the impact on total energy.

                                                       ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                            197




        Fig. 2. Organization of this survey.


2.1. Dynamic Power Consumption                           capacitance (C) and an activity factor (a)
                                                         that relates to how many 0 → 1 or 1 → 0
There are two forms of power consump-
                                                         transitions occur in a chip:
tion, dynamic power consumption and
static power consumption. Dynamic power
consumption arises from circuit activity                             Pdynamic ∼ aCV 2 f .         (3)
such as the changes of inputs in an
adder or values in a register. It has two                   Accordingly, there are four ways to re-
sources, switched capacitance and short-                 duce dynamic power consumption, though
circuit current.                                         they each have different tradeoffs and not
   Switched capacitance is the primary                   all of them reduce the total energy con-
source of dynamic power consumption and                  sumed. The first way is to reduce the phys-
arises from the charging and discharging                 ical capacitance or stored electrical charge
of capacitors at the outputs of circuits.                of a circuit. The physical capacitance de-
   Short-circuit current is a secondary                  pends on low level design parameters
source of dynamic power consumption and                  such as transistor sizes and wire lengths.
accounts for only 10-15% of the total power              One can reduce the capacitance by re-
consumption. It arises because circuits are              ducing transistor sizes, but this worsens
composed of transistors having opposite                  performance.
polarity, negative or NMOS and positive                     The second way to lower dynamic power
or PMOS. When these two types of tran-                   is to reduce the switching activity. As com-
sistors switch current, there is an instant              puter chips get packed with increasingly
when they are simultaneously on, creat-                  complex functionalities, their switching
ing a short circuit. We will not deal fur-               activity increases [De and Borkar 1999],
ther with the power dissipation caused by                making it more important to develop tech-
this short circuit because it is a smaller               niques that fall into this category. One
percentage of total power, and researchers               popular technique, clock gating, gates
have not found a way to reduce it without                the clock signal from reaching idle func-
sacrificing performance.                                  tional units. Because the clock network
   As the following equation shows, the                  accounts for a large fraction of a chip’s
more dominant component of dynamic                       total energy consumption, this is a very
power, switched capacitance (Pdynamic ), de-             effective way of reducing power and en-
pends on four parameters namely, supply                  ergy throughout a processor and is imple-
voltage (V), clock frequency (f), physical               mented in numerous commercial systems

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
198                                                             V. Venkatachalam and M. Franz

                                                     2.2. Understanding Leakage Power
                                                          Consumption
                                                     In addition to consuming dynamic
                                                     power, computer components consume
                                                     static power, also known as idle power
                                                     or leakage. According to the most re-
                                                     cently published industrial roadmaps
                                                     [ITRSRoadMap], leakage power is rapidly
                                                     becoming the dominant source of power
                                                     consumption in circuits (Figure 3) and
                                                     persists whether a computer is active or
Fig. 3. ITRS trends for leakage power dissipation.   idle. Because its causes are different from
Figure adapted from Meng et al., 2005.               those of dynamic power, dynamic power
                                                     reduction techniques do not necessarily
                                                     reduce the leakage power.
                                                        As the equation that follows illustrates,
including the Pentium 4, Pentium M, Intel            leakage power consumption is the product
XScale and Tensilica Xtensa, to mention              of the supply voltage (V) and leakage cur-
but a few.                                           rent (Il eak ), or parasitic current, that flows
   The third way to reduce dynamic power             through transistors even when the tran-
consumption is to reduce the clock fre-              sistors are turned off.
quency. But as we have just mentioned,
this worsens performance and does not al-
ways reduce the total energy consumed.                                 Pleak = V Ileak .                (4)
One would use this technique only if
the target system does not support volt-                To understand how leakage current
age scaling and if the goal is to re-                arises, one must understand how transis-
duce the peak or average power dissi-                tors work. A transistor regulates the flow
pation and indirectly reduce the chip’s              of current between two terminals called
temperature.                                         the source and the drain. Between these
   The fourth way to reduce dynamic power            two terminals is an insulator, called the
consumption is to reduce the supply volt-            channel, that resists current. As the volt-
age. Because reducing the supply voltage             age at a third terminal, the gate, is in-
increases gate delays, it also requires re-          creased, electrical charge accumulates in
ducing the clock frequency to allow the cir-         the channel, reducing the channel’s resis-
cuit to work properly.                               tance and creating a path along which
   The combination of scaling the supply             electricity can flow. Once the gate volt-
voltage and clock frequency in tandem is             age is high enough, the channel’s polar-
called dynamic voltage scaling (DVS). This           ity changes, allowing the normal flow of
technique should ideally reduce dynamic              current between the source and the drain.
power dissipation cubically because dy-              The threshold at which the gate’s voltage
namic power is quadratic in voltage and              is high enough for the path to open is called
linear in clock frequency. This is the most          the threshold voltage.
widely adopted technique. A growing num-                According to this model, a transistor is
ber of processors, including the Pentium             similar to a water dam. It is supposed
M, mobile Pentium 4, AMD’s Athlon, and               to allow current to flow when the gate
Transmeta’s Crusoe and Efficieon proces-              voltage exceeds the threshold voltage but
sors allow software to adjust clock fre-             should otherwise prevent current from
quencies and voltage settings in tandem.             flowing. However, transistors are imper-
However, DVS has limitations and cannot              fect. They leak current even when the gate
always be applied, and even when it can              voltage is below the threshold voltage. In
be applied, it is nontrivial to apply as we          fact, there are six different types of cur-
will see in Section 3.5.                             rent that leak through a transistor. These

                                                      ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                                 199

include reverse-biased-junction leakage,                    ture increases. But as the Equation (6)
gate-induced-drain leakage, subthreshold                    shows, this increases leakage further,
leakage, gate-oxide leakage, gate-current                   causing yet higher temperatures. This vi-
leakage, and punch-through leakage. Of                      cious cycle is known as thermal runaway.
these six, subthreshold leakage and gate-                   It is the chip designer’s worst nightmare.
oxide leakage dominate the total leakage                       How can these problems be solved?
current.                                                    Equations (4) and (6) indicate four ways to
  Gate-oxide leakage flows from the gate of                  reduce leakage power. The first way is to
a transistor into the substrate. This type                  reduce the supply voltage. As we will see,
of leakage current depends on the thick-                    supply voltage reduction is a very common
ness of the oxide material that insulates                   technique that has been applied to compo-
the gate:                                                   nents throughout a system (e.g., processor,
                                                            buses, cache memories).
                            V     2
                                            Tox
                                                               The second way to reduce leakage power
          Iox = K 2 W                 e−α    V    .   (5)   is to reduce the size of a circuit because the
                            Tox                             total leakage is proportional to the leak-
                                                            age dissipated in all of a circuit’s tran-
   According to this equation, the gate-                    sistors. One way of doing this is to de-
oxide leakage Iox increases exponentially                   sign a circuit with fewer transistors by
as the thickness Tox of the gate’s oxide ma-                omitting redundant hardware and using
terial decreases. This is a problem because                 smaller caches, but this may limit perfor-
future chip designs will require the thick-                 mance and versatility. Another idea is to
ness to be reduced along with other scaled                  reduce the effective transistor count dy-
parameters such as transistor length and                    namically by cutting the power supplies to
supply voltage. One way of solving this                     idle components. Here, too, there are chal-
problem would be to insulate the gate us-                   lenges such as how to predict when dif-
ing a high-k dialectric material instead                    ferent components will be idle and how to
of the oxide materials that are currently                   minimize the overhead of shutting them
used. This solution is likely to emerge over                on or off. This, is also a common approach
the next decade.                                            of which we will see examples in the sec-
   Subthreshold leakage current flows be-                    tions to follow.
tween the drain and source of a transistor.                    The third way to reduce leakage power
It is the dominant source of leakage and                    is to cool the computer. Several cooling
depends on a number of parameters that                      techniques have been developed since
are related through the following equa-                     the 1960s. Some blow cold air into the
tion:                                                       circuit, while others refrigerate the
                           −Vth         −V
                                                            processor [Schmidt and Notohardjono
          Isub = K 1 W e    nT    1−e    T        .   (6)   2002], sometimes even by costly means
                                                            such as circulating cryogenic fluids like
  In this equation, W is the gate width                     liquid nitrogen [Krane et al. 1988]. These
and K and n are constants. The impor-                       techniques have three advantages. First
tant parameters are the supply voltage                      they significantly reduce subthreshold
V , the threshold voltage Vth , and the                     leakage. In fact, a recent study [Schmidt
temperature T. The subthreshold leak-                       and Notohardjono 2002] showed that cool-
age current Isub increases exponentially                    ing a memory cell by 50 degrees Celsius
as the threshold voltage Vth decreases.                     reduces the leakage energy by five times.
This again raises a problem for future                      Second, these techniques allow a circuit to
chip designs, because as technology scales,                 work faster because electricity encounters
threshold voltages will have to scale along                 less resistance at lower temperatures.
with supply voltages.                                       Third, cooling eliminates some negative
  The increase in subthreshold leakage                      effects of high temperatures, namely the
current causes another problem. When the                    degradation of a chip’s reliability and life
leakage current increases, the tempera-                     expectancy. Despite these advantages,

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
200                                                                 V. Venkatachalam and M. Franz




 Fig. 4. The transistor stacking effect. The cir-
 cuits in both (a) and (b) are leaking current be-
 tween Vdd and Ground. However, Ileak1, the leak-
 age in (a), is less than Ileak2, the leakage in (b).
 Figure adapted from Butts and Sohi 2000.
                                                            Fig. 5. Multiple threshold circuits with
                                                            sleep transistors.
there are issues to consider such as the
costs of the hardware used to cool the
                                                         and this increases the threshold voltage
circuit. Moreover, cooling techniques
                                                         of the bottom transistor, making it more
are insufficient if they result in wide
                                                         resistant to leakage. As a result, in Fig-
temperature variations in different parts
                                                         ure 4(a), in which all transistors are in
of a circuit. Rather, one needs to prevent
                                                         the Off position, transistor B leaks less
hotspots by distributing heat evenly
                                                         current than transistor A, and transistor
throughout a chip.
                                                         C leaks less current than transistor B.
                                                         Hence, the total leakage current is atten-
   2.2.1. Reducing Threshold Voltage. The                uated as it flows from Vdd to the ground
fourth way of reducing leakage power is to               through transistors A, B, and C. This is not
increase the threshold voltage. As Equa-                 the case in the circuit shown in Figure 4
tion (6) shows, this reduces the subthresh-              (b), which contains only a single off
old leakage exponentially. However, it also              transistor.
reduces the circuit’s performance as is ap-                Another way to increase the thresh-
parent in the following equation that re-                old voltage is to use Multiple Thresh-
lates frequency (f), supply voltage (V), and             old Circuits With Sleep Transistors (MTC-
threshold voltage (Vth ) and where α is a                MOS) [Calhoun et al. 2003; Won et al.
constant:                                                2003] . This involves isolating a leaky cir-
                                                         cuit element by connecting it to a pair
                      (V − Vth )α                        of virtual power supplies that are linked
                 f∞               .                (7)
                          V                              to its actual power supplies through sleep
                                                         transistors (Figure 5). When the circuit is
  One of the less intuitive ways of                      active, the sleep transistors are activated,
increasing the threshold voltage is to                   connecting the circuit to its power sup-
exploit what is called the stacking effect               plies. But when the circuit is inactive, the
(refer to Figure 4). When two or more tran-              sleep transistors are deactivated, thus dis-
sistors that are switched off are stacked                connecting the circuit from its power sup-
on top of each other (a), then they dis-                 plies. In this inactive state, almost no leak-
sipate less leakage than a single tran-                  age passes through the circuit because the
sistor that is turned off (b). This is be-               sleep transistors have high threshold volt-
cause each transistor in the stack induces               ages. (Recall that subthreshold leakage
a slight reverse bias between the gate and               drops exponentially with a rise in thresh-
source of the transistor right below it,                 old voltage, according to Equation (6).)

                                                          ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                              201




                Fig. 6. Adaptive body biasing.


   This technique effectively confines the                   To adjust the threshold voltage, adap-
leakage to one part of the circuit, but is               tive body biasing applies a voltage to the
tricky to implement for several reasons.                 transistor’s body known as a body bias
The sleep transistors must be sized prop-                voltage (Figure 6). This voltage changes
erly to minimize the overhead of acti-                   the polarity of a transistor’s channel, de-
vating them. They cannot be turned on                    creasing or increasing its resistance to
and off too frequently. Moreover, this tech-             current flow. When the body bias voltage
nique does not readily apply to memories                 is chosen to fill the transistor’s channel
because memories lose data when their                    with positive ions (b), the threshold volt-
power supplies are cut.                                  age increases and reduces leakage cur-
   A third way to increase the threshold                 rents. However, when the voltage is cho-
is to employ dual threshold circuits. Dual               sen to fill the channel with negative ions,
threshold circuits [Liu et al. 2004; Wei                 the threshold voltage decreases, allowing
et al. 1998; Ho and Hwang 2004] reduce                   higher performance, though at the cost of
leakage by using high threshold (low leak-               more leakage.
age) transistors on noncritical paths and
low threshold transistors on critical paths,
the idea being that noncritical paths can                3. REDUCING POWER
execute instructions more slowly without                 3.1. Circuit And Logic Level Techniques
impairing performance. This is a diffi-
cult technique to implement because it re-                  3.1.1. Transistor Sizing. Transistor siz-
quires choosing the right combination of                 ing [Penzes et al. 2002; Ebergen et al.
transistors for high-threshold voltages. If              2004] reduces the width of transistors
too many transistors are assigned high                   to reduce their dynamic power consump-
threshold voltages, the noncritical paths                tion, using low-level models that relate the
in the circuit can slow down too much.                   power consumption to the width. Accord-
   A fourth way to increase the thresh-                  ing to these models, reducing the width
old voltage is to apply a technique known                also increases the transistor’s delay and
as adaptive body biasing [Seta et al.                    thus the transistors that lie away from
1995; Kobayashi and Sakurai 1994; Kim                    the critical paths of a circuit are usually
and Roy 2002]. Adaptive body biasing is                  the best candidates for this technique. Al-
a runtime technique that reduces leak-                   gorithms for applying this technique usu-
age power by dynamically adjusting the                   ally associate with each transistor a toler-
threshold voltages of circuits depending                 able delay which varies depending on how
on whether the circuits are active. When                 close that transistor is to the critical path.
a circuit is not active, the technique in-               These algorithms then try to scale each
creases its threshold voltage, thus saving               transistor to be as small as possible with-
leakage power exponentially, although at                 out violating its tolerable delay.
the expense of a delay in circuit operation.
When the circuit is active, the technique                  3.1.2. Transistor Reordering. The ar-
decreases the threshold voltage to avoid                 rangement of transistors in a circuit
slowing it down.                                         affects energy consumption. Figure 7

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
202                                                        V. Venkatachalam and M. Franz

                                                  3.1.3. Half Frequency and Half Swing Clocks.
                                                Half-frequency and half-swing clocks re-
                                                duce frequency and voltage, respectively.
                                                Traditionally, hardware events such as
                                                register file writes occur on a rising
                                                clock edge. Half-frequency clocks synchro-
                                                nize events using both edges, and they
                                                tick at half the speed of regular clocks,
                                                thus cutting clock switching power in
                                                half. Reduced-swing clocks also often use
                                                a lower voltage signal and thus reduce
                                                power quadratically.
      Fig. 7. Transistor Reordering. Figure
      adapted from Hossain et al. 1996.            3.1.4. Logic Gate Restructuring. There are
                                                many ways to build a circuit out of logic
                                                gates. One decision that affects power con-
shows two possible implementations of           sumption is how to arrange the gates and
the same circuit that differ only in their      their input signals.
placement of the transistors marked A              For example, consider two implementa-
and B. Suppose that the input to transis-       tions of a four-input AND gate (Figure 8),
tor A is 1, the input to transistor B is 1,     a chain implementation (a), and a tree
and the input to transistor C is 0. Then        implementation (b). Knowing the signal
transistors A and B will be on, allowing        probabilities (1 or 0) at each of the pri-
current from Vdd to flow through them            mary inputs (A, B, C, D), one can easily cal-
and charge the capacitors C1 and C2.            culate the transition probabilities (0→1)
   Now suppose that the inputs change and       for each output (W, X, F, Y, Z). If each in-
that A’s input becomes 0, and C’s input be-     put has an equal probability of being a
comes 1. Then A will be off while B and C       1 or a 0, then the calculation shows that
will be on. Now the implementations in (a)      the chain implementation (a) is likely to
and (b) will differ in the amounts of switch-   switch less than the tree implementation
ing activity. In (a), current from ground       (b). This is because each gate in a chain
will flow past transistors B and C, dis-         has a lower probability of having a 0→1
charging both the capacitors C1 and C2.         transition than its predecessor; its tran-
However, in (b), the current from ground        sition probability depends on those of all
will only flow past transistor C; it will not    its predecessors. In the tree implementa-
get past transistor A since A is turned off.    tion, on the other hand, some gates may
Thus it will only discharge the capacitor       share a parent (in the tree topology) in-
C2, rather than both C1 and C2 as in part       stead of being directly connected together.
(a). Thus the implementation in (b) will        These gates could have the same transi-
consume less power than that in (a).            tion probabilities.
   Transistor reordering [Kursun et al.            Nevertheless, chain implementations
2004; Sultania et al. 2004] rearranges          do not necessarily save more energy than
transistors to minimize their switching         tree implementations. There are other is-
activity. One of its guiding principles is      sues to consider when choosing a topology.
to place transistors closer to the circuit’s    One is the issue of glitches or spurious
outputs if they switch frequently in or-        transitions that occur when a gate does not
der to prevent a domino effect where            receive all of its inputs at the same time.
the switching activity from one transistor      These glitches are more common in chain
trickles into many other transistors caus-      implementations where signals can travel
ing widespread power dissipation. This re-      along different paths having widely vary-
quires profiling techniques to determine         ing delays. One solution to reduce glitches
how frequently different transistors are        is to change the topology so that the dif-
likely to switch.                               ferent paths in the circuit have similar

                                                 ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                            203




                Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania State
                University Microsystems Design Laboratory’s tutorial on low power design.


delays. This solution, known as path bal-                such as register files. A typical master-
ancing often transforms chain implemen-                  slave flip-flop consists of two latches, a
tations into tree implementations. An-                   master latch, and a slave latch. The inputs
other solution, called retiming, involves                of the master latch are the clock signal
inserting flip-flops or registers to slow                  and data. The inputs of the slave latch
down and thereby synchronize the signals                 are the inverse of the clock signal and the
that pass along different paths but recon-               data output by the master latch. Thus
verge to the same gate. Because flip-flops                 when the clock signal is high, the master
and registers are in sync with the proces-               latch is turned on, and the slave latch
sor clock, they sample their inputs less fre-            is turned off. In this phase, the master
quently than logic gates and are thus more               samples whatever inputs it receives and
immune to glitches.                                      outputs them. The slave, however, does
                                                         not sample its inputs but merely outputs
   3.1.5. Technology Mapping. Because of                 whatever it has most recently stored. On
the huge number of possibilities and                     the falling edge of the clock, the master
tradeoffs at the gate level, designers rely              turns off and the slave turns on. Thus the
on tools to determine the most energy-                   master saves its most recent input and
optimal way of arranging gates and sig-                  stops sampling any further inputs. The
nals. Technology mapping [Chen et al.                    slave samples the new inputs it receives
2004; Li et al. 2004; Rutenbar et al. 2001]              from the master and outputs it.
is the automated process of constructing a                  Besides this master-slave design are
gate-level representation of a circuit sub-              other common designs such as the pulse-
ject to constraints such as area, delay, and             triggered flip-flop and sense-amplifier flip-
power. Technology mapping for power re-                  flop. All these designs share some common
lies on gate-level power models and a li-                sources of power consumption, namely
brary that describes the available gates,                power dissipated from the clock signal,
and their design constraints. Before a cir-              power dissipated in internal switching ac-
cuit can be described in terms of gates, it              tivity (caused by the clock signal and by
is initially represented at the logic level.             changes in data), and power dissipated
The problem is to design the circuit out of              when the outputs change.
logic gates in a way that will mimimize the                 Researchers have proposed several al-
total power consumption under delay and                  ternative low power designs for flip-flops.
cost constraints. This is an NP-hard Di-                 Most of these approaches reduce the
rected Acyclic Graph (DAG) covering prob-                switching activity or the power dissipated
lem, and a common heuristic to solve it is               by the clock signal. One alternative is the
to break the DAG representation of a cir-                self-gating flip-flop. This design inhibits
cuit into a set of trees and find the optimal             the clock signal to the flip-flop when the in-
mapping for each subtree using standard                  puts will produce no change in the outputs.
tree-covering algorithms.                                Strollo et al. [2000] have proposed two ver-
                                                         sions of this design. In the double-gated
  3.1.6. Low Power Flip-Flops. Flip-flops                 flip-flop, the master and slave latches
are the building blocks of small memories                each have their own clock-gating circuit.

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
204                                                         V. Venkatachalam and M. Franz

The circuit consists of a comparator that        way of reducing power is to encode the
checks whether the current and previous          FSM states in a way that minimizes the
inputs are different and some logic that in-     switching activity throughout the proces-
hibits the clock signal based on the output      sor. Another common approach involves
of the comparator. In the sequential-gated       decomposing the FSM into sub-FSMs
flip-flop, the master and the slave share          and activating only the circuitry needed
the same clock-gating circuit instead of         for the currently executing sub-FSM.
having their own individual circuit as in        Some researchers [Gao and Hayes 2003]
the double-gated flip-flop.                        have developed tools that automate this
   Another low-power flip-flop is the condi-       process of decomposition, subject to some
tional capture FLIP-FLOP [Kong et al. 2001;      quality and correctness constraints.
Nedovic et al. 2001]. This flip-flop detects
when its inputs produce no change in the           3.1.8. Delay-Based Dynamic-Supply Voltage
output and stops these redundant inputs          Adjustment. Commercial processors that
from causing any spurious internal cur-          can run at multiple clock speed normally
rent switches. There are many variations         use a lookup table to decide what supply
on this idea, such as the conditional dis-       voltage to select for a given clock speed.
charge flip-flops and conditional precharge        This table is use built ahead of time
flip-flops [Zhao et al. 2004] which elim-          through a worst-case analysis. Because
inate unnecessary precharging and dis-           worst-case analyses may not reflect real-
charging of internal elements.                   ity, ARM Inc. has been developing a more
   A third type of low-power flip-flop is the      efficient runtime solution known as the
dual edge triggered FLIP-FLOP [Llopis and        Razor pipeline [Ernst et al. 2003].
Sachdev 1996]. These flip-flops are sensi-            Instead of using a lookup table, Razor
tive to both the rising and falling edges        adjusts the supply voltage based on the
of the clock, and can do the same work           delay in the circuit. The idea is that when-
as a regular flip-flop at half the clock fre-      ever the voltage is reduced, the circuit
quency. By cutting the clock frequency           slows down, causing timing errors if the
in half, these flip-flops save total power         clock frequency chosen for that voltage is
consumption but may not save total en-           too high. Because of these errors, some in-
ergy, although they could save energy by         structions produce inconsistent results or
reducing the voltage along with the fre-         fail altogether. The Razor pipeline peri-
quency. Another drawback is that these           odically monitors how many such errors
flip-flops require more area than regu-            occur. If the number of errors exceeds a
lar flip-flops. This increase in area may          threshold, it increases the supply voltage.
cause them to consume more power on the          If the number of errors is below a thresh-
whole.                                           old, it scales the voltage further to save
                                                 more energy.
                                                    This solution requires extra hardware
  3.1.7. Low-Power   Control   Logic   Design.   in order to detect and correct circuit
One can view the control logic of a pro-         errors. To detect errors, it augments
cessor as a Finite State Machine (FSM).          the flip-flops in delay-critical regions of
It specifies the possible processor states        the circuit with shadow-latches. These
and conditions for switching between             latches receive the same inputs as the
the states and generates the signals that        flip-flops but are clocked more slowly to
activate the appropriate circuitry for each      adapt to the reduced supply voltage. If
state. For example, when an instruction          the output of the flip-flop differs from
is in the execution stage of the pipeline,       that of its shadow-latch, then this sig-
the control logic may generate signals to        nifies an error. In the event of such an
activate the correct execution unit.             error, the circuit propagates the output
   Although control logic optimizations          of the shadow-latch instead of the flip-
have traditionally targeted performance,         flop, delaying the pipeline for a cycle if
they are now also targeting power. One           necessary.

                                                  ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                             205




                  Fig. 9. Crosstalk prevention through shielding.


3.2. Low-Power Interconnect                              smaller, there arise additional sources of
                                                         power consumption. One of these sources
Interconnect heavily affects power con-
                                                         is crosstalk [Sylvester and Keutzer 1998].
sumption since it is the medium of most
                                                         Crosstalk is spurious activity on a wire
electrical activity. Efforts to improve chip
                                                         that is caused by activity in neighboring
performance are resulting in smaller chips
                                                         wires. As well as increasing delays and im-
with more transistors and more densely
                                                         pairing circuit integrity, crosstalk can in-
packed wires carrying larger currents. The
                                                         crease power consumption.
wires in a chip often use materials with
                                                            One way of reducing crosstalk [Taylor
poor thermal conductivity [Banerjee and
                                                         et al. 2001] is to insert a shield wire
Mehrotra 2001].
                                                         (Figure 9) between adjacent bus wires.
                                                         Since the shield remains deasserted, no
   3.2.1. Bus Encoding And CrossTalk. A pop-             adjacent wires switch in opposite direc-
ular way of reducing the power consumed                  tions. This solution doubles the number of
in buses is to reduce the switching activity             wires. However, the principle behind it has
through intelligent bus encoding schemes,                sparked efforts to develop self-shielding
such as bus-inversion [Stan and Burleson                 codes [Victor and Keutzer 2001; Patel and
1995]. Buses consist of wires that trans-                Markov 2003] resistant to crosstalk. As
mit bits (logical 1 or 0). For every data                in traditional bus encoding, a value is en-
transmission on a bus, the number of wires               coded and then transmitted. However, the
that switch depends on the current and                   code chosen avoids opposing transitions on
previous values transmitted. If the Ham-                 adjacent bus wires.
ming distance between these values is
more than half the number of wires, then                    3.2.2. Low Swing Buses. Bus-encoding
most of the wires on the bus will switch                 schemes attempt to reduce transitions on
current. To prevent this from happening,                 the bus. Alternatively, a bus can transmit
bus-inversion transmits the inverse of the               the same information but at a lower
intended value and asserts a control sig-                voltage. This is the principle behind low
nal alerting recipients of the inversion.                swing buses [Zhang and Rabaey 1998].
For example, if the current binary value to              Traditionally, logical one is represented
transmit is 110 and the previous was 000,                by 5 volts and logical zero is represented
the bus instead transmits 001, the inverse               by −5 volts. In a low-swing system
of 110.                                                  (Figure 10), logical one and zero are
   Bus-inversion ensures that at most half               encoded using lower voltages, such as
of the bus wires switch during a bus trans-              +300mV and −300mV. Typically, these
action. However, because of the cost of the              systems are implemented with differen-
logic required to invert the bus lines, this             tial signaling. An input signal is split into
technique is mainly used in external buses               two signals of opposite polarity bounded
rather than the internal chip interconnect.              by a smaller voltage range. The receiver
   Moreover, some researchers have ar-                   sees the difference between the two
gued that this technique makes the sim-                  transmitted signals as the actual signal
plistic assumption that the amount of                    and amplifies it back to normal voltage.
power that an interconnect wire consumes                 Low Swing Differential Signaling has
depends only on the number of bit tran-                  several advantages in addition to reduced
sitions on that wire. As chips become                    power consumption. It is immune to

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
206                                                            V. Venkatachalam and M. Franz




               Fig. 10. Low voltage differential signaling




           Fig. 11. Bus segmentation.

crosstalk and electromagnetic radiation                3.2.4. Adiabatic Buses. The preceding
effects. Since the two transmitted signals          techniques work by reducing activity
are close together, any spurious activity           or voltage. In contrast, adiabatic cir-
will affect both equally without affecting          cuits [Lyuboslavsky et al. 2000] reduce to-
the difference between them. However,               tal capacitance. These circuits reuse ex-
when implementing this technique, one               isting electrical charge to avoid creating
needs to consider the costs of increased            new charge. In a traditional bus, when
hardware at the encoder and decoder.                a wire becomes deasserted, its previous
                                                    charge is wasted. A charge-recovery bus
   3.2.3. Bus Segmentation. Another effec-          recycles the charge for wires about to be
tive technique is bus segmentation. In a            asserted.
traditional shared bus architecture, the               Figure 12 illustrates a simple design for
entire bus is charged and discharged upon           a two-bit adiabatic bus [Bishop and Ir-
every access. Segmentation (Figure 11)              win 1999]. A comparator in each bitline
splits a bus into multiple segments con-            tests whether the current bit is equal to
nected by links that regulate the traffic be-        the previously sent bit. Inequality signi-
tween adjacent segments. Links connect-             fies a transition and connects the bitline
ing paths essential to a communication are          to a shorting wire used for sharing charge
activated independently, allowing most of           across bitlines. In the sample scenario,
the bus to remain powered down. Ideally,            Bitline1 has a falling transition while Bit-
devices communicating frequently should             line0 has a rising transition. As Bitline1
be in the same or nearby segments to avoid          falls, its positive charge transfers to Bit-
powering many links. Jone et al. [2003]             line0, allowing Bitline0 to rise. The power
have examined how to partition a bus                saved depends on transition patterns. No
to ensure this property. Their algorithm            energy is saved when all lines rise. The
begins with an undirected graph whose               most energy is saved when an equal num-
nodes are devices, edges connect commu-             ber of lines rise and fall simultaneously.
nicating devices, and edge weights dis-                The biggest drawback of adiabatic cir-
play communication frequency. Applying              cuits is a delay for transfering shared
the Gomory and Hu algorithm [1961], they            charge.
find a spanning tree in which the prod-
uct of communication frequency and num-
ber of edges between any two devices is               3.2.5. Network-On-Chip. All of the above
minimal.                                            techniques assume an architecture in

                                                     ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                             207

                                                         and columns corresponding to the inter-
                                                         section points end up switching current
                                                         even though only parts of these rows and
                                                         columns are actually traversed. To elim-
                                                         inate this widespread current switching,
                                                         the segmented crossbar topology divides
                                                         the rows and columns into segments de-
                                                         marcated by tristate buffers. The differ-
                                                         ent segments are selectively activated as
                                                         the data traverses through them, hence
Fig. 12. Two bit charge recovery bus.                    confining current switches to only those
                                                         segments along which the data actually
which multiple functional units share one                passes.
or more buses. This shared-bus paradigm
has several drawbacks with respect to                    3.3. Low-Power Memories and Memory
power and performance. The inherent bus                       Hierarchies
bandwidth limits the speed and volume
of data transfers and does not suit the                     3.3.1. Types of Memory. One can classify
varying requirements of different execu-                 memory structures into two categories,
tion units. Only one unit can access the                 Random Access Memories (RAM) and
bus at a time, though several may be re-                 Read-Only-Memories (ROM). There are
questing bus access simultaneously. Bus                  two kinds of RAMs, static RAMs (SRAM)
arbitration mechanisms are energy inten-                 and dynamic RAMs (DRAM) which dif-
sive. Unlike simple data transfers, every                fer in how they store data. SRAMs store
bus transaction involves multiple clock                  data using flip-flops and DRAMs store
cycles of handshaking protocols that in-                 each bit of data as a charge on a capac-
crease power and delay.                                  itor; thus DRAMs need to refresh their
   As a consequence, researchers have                    data periodically. SRAMs allow faster ac-
been investigating the use of generic in-                cesses than DRAMs but require more
terconnection networks to replace buses.                 area and are more expensive. As a result,
Compared to buses, networks offer higher                 normally only register files, caches, and
bandwidth and support concurrent con-                    high bandwidth parts of the system are
nections. This design makes it easier to                 made up of SRAM cells, while main mem-
troubleshoot energy-intensive areas of the               ory is made up of DRAM cells. Although
network. Many networks have sophisti-                    these cells have slower access times
cated mechanisms for adapting to varying                 than SRAMs, they contain fewer transis-
traffic patterns and quality-of-service re-               tors and are less expensive than SRAM
quirements. Several authors [Zhang et al.                cells.
2001; Dally and Towles 2001; Sgroi et al.                   The techniques we will introduce are not
2001] propose the application of these                   confined to any specific type of RAM or
techniques at the chip level. These claims               ROM. Rather they are high-level architec-
have yet to be verified, and interconnect                 tural principles that apply across the spec-
power reduction remains a promising area                 trum of memories to the extent that the
for future research.                                     required technology is available. They at-
   Wang et al. [2003] have developed                     tempt to reduce the energy dissipation of
hardware-level optimizations for differ-                 memory accesses in two ways, either by
ent sources of energy consumption in                     reducing the energy dissipated in a mem-
an on-chip network. One of the alterna-                  ory accesses, or by reducing the number of
tive network topologies they propose is                  memory accesses.
the segmented crossbar topology which
aims to reduce the energy consumption                      3.3.2. Splitting Memories Into Smaller Sub-
of data transfers. When data is trans-                   systems. An effective way to reduce the
fered over a regular network grid, all rows              energy that is dissipated in a memory

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
208                                                        V. Venkatachalam and M. Franz

access is to activate only the needed mem-      processor and first level cache. It would
ory circuits in each access. One way to ex-     be smaller than the first level cache and
pose these circuits is to partition memo-       hence dissipate less energy. Yet it would
ries into smaller, independently accessible     be large enough to store an application’s
components. This can be done as different       working set and filter out many mem-
granularities.                                  ory references; only when a data request
   At the lowest granularity, one can de-       misses this cache would it require search-
sign a main memory system that is split         ing higher levels of cache, and the penalty
up into multiple banks each of which            for a miss would be offset by redirecting
can independently transition into differ-       more memory references to this smaller,
ent power modes. Compilers and operating        more energy efficient cache. This was the
systems can cluster data into a minimal         principle behind the filter cache. Though
set of banks, allowing the other unused         deceptively simple, this idea lies at the
banks to power down and thereby save en-        heart of many of the low-power cache de-
ergy [Luz et al. 2002; Lebeck et al. 2000].     signs used today, ranging from simple ex-
   At a higher granularity, one can parti-      tensions of the cache hierarchy (e.g., block
tion the separate banks of a partitioned        buffering) to the scratch pad memories
memory into subbanks and activate only          used in embedded systems, and the com-
the relevant subbank in every memory ac-        plex trace caches used in high-end gener-
cess. This is a technique that has been ap-     ation processors.
plied to caches [Ghose and Kamble 1999].          One simple extension is the technique of
In a normal cache access, the set-selection     block buffering. This technique augments
step involves transferring all blocks in all    caches with small buffers to store the most
banks onto tag-comparison latches. Since        recently accessed cache set. If a data item
the requested word is at a known off-           that has been requested belongs to a cache
set in these blocks, energy can be saved        set that is already in the buffer, then one
by transfering only words at that offset.       does not have to search for it in the rest of
Cache subbanking splits the block array of      the cache.
each cache line into multiple banks. Dur-         In embedded systems, scratch pad mem-
ing set-selection, only the relevant bank       ories [Panda et al. 2000; Banakar et al.
from each line is active. Since a single sub-   2002; Kandemir et al. 2002] have been
bank is active at a time, all subbanks of       the extension of choice. These are soft-
a line share the same output latches and        ware controlled memories that reside on-
sense amplifiers to reduce hardware com-         chip. Unlike caches, the data they should
plexity. Subbanking saves energy with-          contain is determined ahead of time and
out increasing memory access time. More-        remains fixed while programs execute.
over, it is independent of program locality     These memories are less volatile than
patterns.                                       conventional caches and can be accessed
                                                within a single cycle. They are ideal for
                                                specialized embedded systems whose ap-
  3.3.3. Augmenting the Memory Hierarchy        plications are often hand tuned.
With Specialized Cache Structures. The sec-       General purpose processors of the lat-
ond way to reduce energy dissipation in         est generation sometimes rely on more
the memory hierarchy is to reduce the           complex caching techniques such as the
number of memory hierarchy accesses.            trace cache. Trace caches were originally
One group of researchers [Kin et al. 1997]      developed for performance but have been
developed a simple but effective technique      studied recently for their power bene-
to do this. Their idea was to integrate a       fits. Instead of storing instructions in
specialized cache into the typical mem-         their compiled order, a trace cache stores
ory hierarchy of a modern processor, a          traces of instructions in their executed
hierarchy that may already contain one          order. If an instruction sequence is al-
or more levels of caches. The new cache         ready in the trace cache, then it need
they were proposing would sit between the       not be fetched from the instruction cache

                                                 ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                            209

but can be decoded directly from the
trace cache. This saves power by re-
ducing the number of instruction cache
accesses.
   In the original trace cache design, both
the trace cache and the instruction cache
are accessed in parallel on every clock cy-
cle. Thus instructions are fetched from the
instruction cache while the trace cache is
searched for the next trace that is pre-
dicted to executed. If the trace is in the
trace cache, then the instructions fetched
from the instruction cache are discarded.                Fig. 13. Drowsy cache line.
Otherwise the program execution pro-
ceeds out of the instruction cache and si-                  3.4.1. Adaptive Caches. There is a
multaneously new traces are created from                 wealth of literature on adaptive caches,
the instructions that are fetched.                       caches whose storage elements (lines,
   Low-power designs for trace caches typi-              blocks, or sets) can be selectively acti-
cally aim to reduce the number of instruc-               vated based on the application workload.
tion cache accesses. One alternative, the                One example of such a cache is the
dynamic direction prediction-based trace                 Deep-Submicron         Instruction     (DRI)
cache [Hu et al. 2003], uses branch predi-               cache [Powell et al. 2001]. This cache
tion to decide where to fetch instructions               permits one to deactivate its individual
from. If the branch predictor predicts the               sets on demand by gating their supply
next trace with high confidence and that                  voltages. To decide what sets to activate at
trace is in the trace cache, then the in-                any given time, the cache uses a hardware
structions are fetched from the trace cache              profiler that monitors the application’s
rather than the instruction cache.                       cache-miss patterns. Whenever the cache
   The selective trace cache [Hu et al. 2002]            misses exceed a threshold, the DRI cache
extends this technique with compiler sup-                activates previously deactivated sets.
port. The idea is to identify frequently exe-            Likewise, whenever the miss rate falls
cuted, or hot traces, and bring these traces             below a threshold, the DRI deactivates
into the trace cache. The compiler inserts               some of these sets by inhibiting their
hints after every branch instruction, indi-              supply voltages.
cating whether it belongs to a hot trace. At                A problem with this approach is that
runtime, these hints cause instructions to               dormant memory cells lose data and need
be fetched from either the trace cache or                more time to be reactivated for their next
the instruction cache.                                   use. Thus an alternative to inhibiting their
                                                         supply voltages is to reduce their voltages
                                                         as low as possible without losing data.
3.4. Low-Power Processor Architecture
                                                         This is the aim of the drowsy cache, a cache
     Adaptations
                                                         whose lines can be placed in a drowsy
So far we have described the energy sav-                 mode where they dissipate minimal power
ing features of hardware as though they                  but retain data and can be reactivated
were a fixed foundation upon which pro-                   faster.
grams execute. However, programs exhibit                    Figure 13 illustrates a typical drowsy
wide variations in behavior. Researchers                 cache line [Kim et al. 2002]. The wakeup
have been developing hardware structures                 signal chooses between voltages for the
whose parameters can be adjusted on de-                  active, drowsy, and off states. It also
mand so that one can save energy by                      regulates the word decoder’s output to
activating just the minimum hardware                     prevent data in a drowsy line from be-
resources needed for the code that is                    ing accessed. Before a cache access, the
executing.                                               line circuitry must be precharged. In a

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
210                                                        V. Venkatachalam and M. Franz

                                                the first adaptive instruction issue queues,
                                                a 32-bit queue consisting of four equal size
                                                partitions that can be independently acti-
                                                vated. Each partition consists of wakeup
  Fig. 14. Dead block elimination.              logic that decides when instructions are
                                                ready to execute and readout logic that
traditional cache, the precharger is contin-    dispatches ready instructions into the
uously on. To save power, the drowsy cir-       pipeline. At any time, only the partitions
cuitry regulates the wakeup signal with a       that contain the currently executing in-
precharge signal. The precharger becomes        structions are activated.
active only when the line is woken up.            There have been many heuristics devel-
   Many heuristics have been developed          oped for configuring these queues. Some
for controlling these drowsy circuits. The      require measuring the rate at which in-
simplest policy for controlling the drowsy      structions are executed per processor clock
cache is cache decay [Kaxiras et al. 2001]      cycles, or IPC. For example, Buyukto-
which simply turns off unused cache lines       sunoglu et al.’s heuristic [2001] periodi-
after a fixed interval. More sophisticated       cally measures the IPC over fixed length
policies [Zhang et al. 2002; 2003; Hu et al.    intervals. If the IPC of the current inter-
2003] consider information about runtime        val is a factor smaller than that of the
program behavior. For example, Hu et            previous interval (indicating worse per-
al. [2003] have proposed hardware that          formance), then the heuristic increases
detects when an execution moves into a          the instruction queue size to increase the
hotspot and activates only those cache          throughput. Otherwise it may decrease
lines that lie within that hotspot. The         the queue size.
main idea of their approach is to count           Bahar and Manne [2001] propose a
how often different branches are taken.         similar heuristic targeting a simulated
When a branch is taken sufficiently many         8-wide issue processor that can be re-
times, its target address is a hotspot, and     configured as a 6-wide issue or 4-wide
the hardware sets special bits to indicate      issue. They also measure IPC over fixed
that cache lines within that hotspot should     intervals but measure the floating point
not be shut down. Periodically, a runtime       IPC separately from the integer IPC since
routine disables cache lines that do not        the former is more critical for floating
have these bits set and decays the profil-       point applications. Their heuristic, like
ing data. Whenever it reactivates a cache       Buyuktosunoglu et al.’s [2001], compares
line before its next use, it also reactivates   the measured IPC with specific thresholds
the line that corresponds to the next sub-      to determine when to downsize the issue
sequent memory access.                          queue.
   There are many other heuristics for con-       Folegnani and Gonzalez [2001], on the
trolling cache lines. Dead-block elimina-       other hand, use a different heuristic. They
tion [Kabadi et al. 2002] powers down           find that when a program has little par-
cache lines containing basic blocks that        allelism, the youngest part of the issue
have reached their final use. It is a            queue (which contains the most recent in-
compiler-directed technique that requires       structions) contributes very little to the
a static control flow analysis to identify       overall IPC in that instructions in this
these blocks. Consider the control flow          part are often committed late. Thus their
graph in Figure 14. Since block B1 exe-         heuristic monitors the contribution of the
cutes only once, its cache lines can power      youngest part of this queue and deacti-
down once control reaches B2. Similarly,        vates this part when its contribution is
the lines containing B2 and B3 can power        minimal.
down once control reaches B4.

  3.4.2. Adaptive Instruction Queues. Buyu-        3.4.3. Algorithms for Reconfiguring Multi-
ktosunoglu et al. [2001] developed one of       ple Structures. In addition to this work,


                                                 ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                             211

researchers have also been tackling the                  to apply, they use offline profiling to plot
problem of problem of reconfiguring mul-                  an energy-delay tradeoff curve. Each point
tiple hardware units simultaneously.                     in the curve represents the energy and de-
   One group [Hughes et al. 2001; Sasanka                lay tradeoff of applying an adaptation to
et al. 2002; Hughes and Adve 2004]                       a subroutine. There are three regions in
has been developing heuristics for com-                  the curve. The first includes adaptations
bining hardware adaptations with dy-                     that save energy without impairing per-
namic voltage scaling during multime-                    formance; these are the adaptations that
dia applications. Their strategy is to run               will always be applied. The third repre-
two algorithms while individual video                    sents adaptations that worsen both per-
frames are being processed. A global algo-               formance and energy; these are the adap-
rithm chooses the initial DVS setting and                tations that will never be applied. Be-
baseline hardware configuration for each                  tween these two extremes are adaptations
frame, while a local algorithm tunes var-                that trade performance for energy savings;
ious hardware parameters (e.g., instruc-                 some of these may be applied depending on
tion window sizes) while the frame is ex-                the tolerable performance loss. The main
ecuting to use up any remaining slack                    idea is to trace the curve from the ori-
periods.                                                 gin and apply every adaptation that one
   Another group [Iyer and Marculescu                    encounters until the cumulative perfor-
2001, 2002a] has developed heuristics that               mance loss from applying all the encoun-
adjust the pipeline width and register up-               tered adaptations reaches the maximum
date unit (RUU) size for different hotspots              tolerable performance loss.
or frequently executed program regions.                     Ponomarev et al. [2001] develop heuris-
To detect these hotspots, their technique                tics for controlling a configurable issue
counts the number of times each branch                   queue, a configurable reorder buffer and a
instruction is taken. When a branch is                   configurable load/store queue. Unlike the
taken enough times, it is marked as a can-               IPC based heuristics we have seen, this
didate branch.                                           heuristic is occupancy-based because the
   To detect how often candidate branches                occupancy of hardware structures indi-
are taken, the heuristic uses a hotspot                  cates how congested these structures are,
detection counter. Initially the hotspot                 and congestion is a leading cause of perfor-
counter contains a positive integer. Each                mance loss. Whenever a hardware struc-
time a candidate branch is taken, the                    ture (e.g., instruction queue) has low oc-
counter is decremented, and each time a                  cupancy, the heuristic reduces its size. In
noncandidate branch is taken, the counter                contrast, if the structure is completely full
is incremented. If the counter becomes                   for a long time, the heuristic increases its
zero, this means that the program execu-                 size to avoid congestion.
tion is inside a hotspot.                                   Another group of researchers [Albonesi
   Once the program execution is in a                    et al. 2003; Dropsho et al. 2002] ex-
hotspot, this heuristic tests the energy                 plore the complex problem of configur-
consumption of different processor config-                ing a processor whose adaptable compo-
urations at fixed length intervals. It finally             nents include two levels of caches, integer
chooses the most energy saving configura-                 and floating point issue queues, load/store
tion as the one to use the next time the                 queues, register files, as well as a reorder
same hotspot is entered. To measure en-                  buffer. To dodge the sheer explosion of pos-
ergy at runtime, the heuristic relies on                 sible configurations, they tune each com-
hardware that monitors usage statistics of               ponent based only on its local usage statis-
the most power hungry units and calcu-                   tics. They propose two heuristics for doing
lates the total power based on energy per                this, one for tuning caches and another for
access values.                                           tuning buffers and register files.
   Huang et al.[2000; 2003] apply hard-                     The caches they consider are selec-
ware adaptations at the granularity of                   tive way caches; each of their multiple
subroutines. To decide what adaptations                  ways can independently be activated or

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
212                                                       V. Venkatachalam and M. Franz

disabled. Each way has a most recently         [Turing 1937]). There have been ongoing
used (MRU) counter that is incremented         efforts [Nilsen and Rygg 1995; Lim et al.
every time that a cache searche hits the       1995; Diniz 2003; Ishihara and Yasuura
way. The cache tuning heuristic samples        1998] in the real-time systems community
this counter at fixed intervals to determine    to accurately estimate the worst case exe-
the number of hits in each way and the         cution times of programs. These attempts
number of overall misses. Using these ac-      use either measurement-based prediction,
cess statistics, the heuristic computes the    static analysis, or a mixture of both. Mea-
energy and performance overheads for all       surement involves a learning lag whereby
possible cache configurations and chooses       the execution times of various tasks are
the best configuration dynamically.             sampled and future execution times are
  The heuristic for controlling other struc-   extrapolated. The problem is that, when
tures such as issue queues is occupancy        workloads are irregular, future execu-
based. It measures how often different         tion times may not resemble past execu-
components of these structures fill up with     tion times. Microarchitectural innovations
data and uses this information to de-          such as pipelining, hyperthreading and
cide whether to upsize or downsize the         out-of-order execution make it difficult to
structure.                                     predict execution times in real systems
                                               statically. Among other things, a compiler
3.5. Dynamic Voltage Scaling                   needs to know how instructions are inter-
Dynamic voltage scaling (DVS) addresses        leaved in the pipeline, what the probabili-
the problem of how to modulate a pro-          ties are that different branches are taken,
cessor’s clock frequency and supply volt-      and when cache misses and pipeline haz-
age in lockstep as programs execute. The       ards are likely to occur. This requires mod-
premise is that a processor’s workloads        eling the pipeline and memory hierarchy.
vary and that when the processor has less      A number of researchers [Ferdinand 1997;
work, it can be slowed down without af-        Clements 1996] have attempted to inte-
fecting performance adversely. For exam-       grate complete pipeline and cache mod-
ple, if the system has only one task and it    els into compilers. It remains nontrivial
has a workload that requires 10 cycles to      to develop models that can be used ef-
execute and a deadline of 100 seconds, the     ficiently in a compiler and that capture
processor can slow down to 1/10 cycles/sec,    all the complexities inherent in current
saving power and meeting the deadline          systems.
right on time. This is presumably more ef-
ficient than running the task at full speed
and idling for the remainder of the pe-          3.5.2. Indeterminism And Anomalies In Real
riod. Though it appears straightforward,       Systems. Even if the DVS algorithm pre-
this naive picture of how DVS works is         dicts correctly what the processor’s work-
highly simplified and hides serious real-       loads will be, determining how fast to run
world complexities.                            the processor is nontrivial. The nondeter-
                                               minism in real systems removes any strict
  3.5.1. Unpredictable Nature Of Workloads.    relationships between clock frequency, ex-
The first problem is how to predict work-       ecution time, and power consumption.
loads with reasonable accuracy. This re-       Thus most theoretical studies on voltage
quires knowing what tasks will execute         scaling are based on assumptions that
at any given time and the work re-             may sound reasonable but are not guar-
quired for these tasks. Two issues com-        anteed to hold in real systems.
plicate this problem. First, tasks can           One misconception is that total micro-
be preempted at arbitrary times due to         processor system power is quadratic in
user and I/O device requests. Second,          supply voltage. Under the CMOS transis-
it is not always possible to accurately        tor model, the power dissipation of indi-
predict the future runtime of an arbi-         vidual transistors are quadratic in their
trary algorithm (c.f., the Halting Problem     supply voltages, but there remains no

                                                ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                                     213

precise way of estimating the power dis-
sipation of an entire system. One can
construct scenarios where total system
power is not quadratic in supply voltage.
Martin and Siewiorek [2001] found that
the StrongARM SA-1100 has two supply
voltages, a 1.5V supply that powers the
CPU and a 3.3V supply that powers the I/O
pins. The total power dissipation is domi-
nated by the larger supply voltage; hence,
even if the smaller supply voltage were al-
lowed to vary, it would not reduce power
quadratically. Fan et al. [2003] found that,
in some architectures where active mem-
ory banks dominate total system power
consumption, reducing the supply voltage
dramatically reduces CPU power dissi-
pation but has a miniscule effect on to-
tal power. Thus, for full benefits, dynamic
voltage scaling may need to be combined
with resource hibernation to power down                  Fig. 15. Scheduling effects. When DVS is applied
                                                         (a), task one takes a longer time to enter its criti-
other energy hungry parts of the chip.                   cal section, and task two is able to preempt it right
   Another misconception is that it is                   before it does. When DVS is not applied (b), task
most power efficient to run a task at                     one enters its critical section sooner and thus pre-
the slowest constant speed that allows                   vents task two from preempting it until it finishes
it to exactly meet its deadline. Several                 its critical section. This figure has been adopted from
                                                         Buttazzo [2002].
theoretical studies [Ishihara and Yasuura
1998] attempt to prove this claim. These
proofs rest on idealistic assumptions such               two tasks (Figure 15). When the processor
as “power is a convex linearly increasing                slows down (a), Task One takes longer to
function of frequency”, assumptions that                 enter its critical section and is thus pre-
ignore how DVS affects the system as a                   empted by Task Two as soon as Task Two
whole. When the processor slows down,                    is released into the system. When the pro-
peripheral devices may remain activated                  cessor instead runs at maximum speed (b),
longer, consuming more power. An impor-                  Task One enters its critical section earlier
tant study of this issue was done by Fan                 and blocks Task Two until it finishes this
et al. [2003] who show that, for specific                 section. This hypothetical example sug-
DRAM architectures, the energy versus                    gests that DVS can affect the order in
slowdown curve is “U” shaped. As the pro-                which tasks are executed. But the order in
cessor slows down, CPU energy decreases                  which tasks execute affects the state of the
but the cumulative energy consumed in                    cache which, in turn, affects performance.
active memory banks increases. Thus the                  The exact relationship between the cache
optimal speed is actually higher than the                state and execution time is not clear. Thus,
processor’s lowest speed; any speed lower                it is too simplistic to assume that execu-
than this causes memory energy dissipa-                  tion time is inversely proportional to clock
tion to overshadow the effects of DVS.                   frequency.
   A third misconception is that execution                  All of these issues show that theo-
time is inversely proportional to clock fre-             retical studies are insufficient for un-
quency. It is actually an open problem                   derstanding how DVS will affect system
how slowing down the processor affects                   state. One needs to develop an experi-
the execution time of an arbitrary appli-                mental approach driven by heuristics and
cation. DVS may result in nonlinearities                 evaluate the tradeoffs of these heuristics
[Buttazzo 2002]. Consider a system with                  empirically.

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
214                                                      V. Venkatachalam and M. Franz

  Most of the existing DVS approaches         formance and energy dissipation. Weissel
can be classified as interval-based ap-        and Bellosa’s technique involves moni-
proaches, intertask approaches, or in-        toring these event rates using hardware
tratask approaches.                           performance counters and attributing
                                              them to the processes that are executing.
                                              At each context switch, the heuristic
  3.5.3. Interval-Based        Approaches.
                                              adjusts the clock frequency for the process
Interval-based DVS algorithms mea-
                                              being activated based on its previously
sure how busy the processor is over
                                              measured event rates. To do this, the
some interval or intervals, estimate how
                                              heuristic refers to a previously con-
busy it will be in the next interval, and
                                              structed table of event rate combinations,
adjust the processor speed accordingly.
                                              divided into frequency domains. Weissel
These algorithms differ in how they
                                              and Bellosa construct this table through
estimate future processor utilization.
                                              a detailed offline simulation where, for
Weiser et al. [1994] developed one of the
                                              different combinations of event rates, they
earliest interval-based algorithms. This
                                              determine the lowest clock frequency that
algorithm, Past, periodically measures
                                              does not violate a user-specified threshold
how long the processor idles. If it idles
                                              for performance loss.
longer than a threshold, the algorithm
                                                 Intertask DVS has two drawbacks.
slows it down; if it remains busy longer
                                              First, task workloads are usually un-
than a threshold, it speeds it up. Since
                                              known until tasks finish running. Thus
this approach makes decisions based
                                              traditional algorithms either assume per-
only on the most recent interval, it is
                                              fect knowledge of these workloads or es-
error prone. It is also prone to thrashing
                                              timate future workloads based on prior
since it computes a CPU speed for every
                                              workloads. When workloads are irregular,
interval. Govil et al. [1995] examine a
                                              estimating them is nontrivial. Flautner
number of algorithms that extend Past
                                              et al. [2001] address this problem by
by considering measurements collected
                                              classifying all workloads as interactive
over larger windows of time. The Aged
                                              episodes and producer-consumer episodes
Averages algorithm, for example, esti-
                                              and using separate DVS algorithms for
mates the CPU usage for the upcoming
                                              each. For interactive episodes, the clock
interval as a weighted average of usages
                                              frequency used depends not only on the
measured over all previous intervals.
                                              predicted workload but also on a percep-
These algorithms save more energy than
                                              tion threshold that estimates how fast the
Past since they base their decisions on
                                              episode would have to run to be accept-
more information. However, they all
                                              able to a user. To recover from prediction
assume that workloads are regular. As we
                                              error, the algorithm also includes a panic
have mentioned, it is very difficult, if not
                                              threshold. If a session lasts longer than
impossible, to predict irregular workloads
                                              this threshold, the processor immediately
using history information alone.
                                              switches to full speed.
                                                 The second drawback of intertask ap-
  3.5.4. Intertask  Approaches. Intertask     proaches is that they are unaware of pro-
DVS algorithms assign different speeds        gram structure. A program’s structure
for different tasks. These speeds remain      may provide insight into the work re-
fixed for the duration of each task’s exe-     quired for it, insight that, in turn, may
cution. Weissel and Bellosa [2002] have       provide opportunities for techniques such
developed an intertask algorithm that         as voltage scaling. Within specific program
uses hardware events as the basis for         regions, for example, the processor may
choosing the clock frequencies for differ-    spend significant time waiting for mem-
ent processes. The motivation is that, as     ory or disk accesses to complete. During
processes execute, they generate various      the execution of these regions, the proces-
hardware events (e.g., cache misses, IPC)     sor can be slowed down to meet pipeline
at varying rates that relate to their per-    stall delays.

                                               ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems                                           215

  3.5.5. Intratask    Approaches. Intratask              the first of these algorithms. They no-
approaches [Lee and Sakurai 2000; Lorch                  ticed that programs have multiple exe-
and Smith 2001; Gruian 2001; Yuan and                    cution paths, some more time consum-
Nahrstedt 2003] adjust the processor                     ing than others, and that whenever con-
speed and voltage within tasks. Lee and                  trol flows away from a critical path, there
Sakurai [2000] developed one of the earli-               are opportunities for the processor to slow
est approaches. Their approach, run-time                 down and still finish the programs within
voltage hopping, splits each task into                   their deadlines. Based on this observation,
fixed length timeslots. The algorithm as-                 Shin and Kim developed a tool that pro-
signs each timeslot the lowest speed that                files a program offline to determine worst
allows it to complete within its preferred               case and average cycles for different exe-
execution time which is measured as                      cution paths, and then inserting instruc-
the worst case execution time minus the                  tions to change the processor frequency at
elapsed execution time up to the current                 the start of different paths based on this
timeslot. An issue not considered in this                information.
work is how to divide a task into timeslots.                Another approach, program checkpoint-
Moreover, the algorithm is pessimistic                   ing [Azevedo et al. 2002], annotates a pro-
since tasks could finish before their worst               gram with checkpoints and assigns tim-
case execution time. It is also insensitive              ing constraints for executing code between
to program structure.                                    checkpoints. It then profiles the program
  There are many variations on this idea.                offline to determine the average number
Two operating system-level intratask al-                 of cycles between different checkpoints.
gorithms are PACE [Lorch and Smith                       Based on this profiling information and
2001] and Stochastic DVS [Gruian 2001].                  timing constraints, it adjusts the CPU
These algorithms choose a speed for ev-                  speed at each checkpoint.
ery cycle of a task’s execution based on the                Perhaps the most well known approach
probability distribution of the task work-               is by Hsu and Kremer [2003]. They pro-
load measured over previous cycles. The                  file a program offline with respect to all
difference between PACE and stochastic                   possible combinations of clock frequencies
DVS lies in their cost functions. Stochastic             assigned to different regions. They then
DVS assumes that energy is proportional                  build a table describing how each combi-
to the square of the supply voltage, while               nation affects execution time and power
PACE assumes it is proportional to the                   consumption. Using this table, they select
square of the clock frequency. These sim-                the combination of regions and frequen-
plifying assumptions are not guaranteed                  cies that saves the most power without in-
to hold in real systems.                                 creasing runtime beyond a threshold.
  Dudani et al. [2002] have also developed
intratask policies at the operating system
level. Their work aims to combine in-                      3.5.6. The Implications of Memory Bounded
tratask voltage scaling with the earliest-               Code. Memory bounded applications
deadline-first      scheduling     algorithm.             have become an attractive target for DVS
Dudani et al. split into two subprograms                 algorithms because the time for a memory
each program that the scheduler chooses                  access, on most commercial systems, is
to execute. The first subprogram runs at                  independent of how fast the processor is
the maximum clock frequency, while the                   running. When frequent memory accesses
second runs slow enough to keep the com-                 dominate the program execution time,
bined execution time of both subprograms                 they limit how fast the program can finish
below the average execution time for the                 executing. Because of this “memory wall”,
whole program.                                           the processor can actually run slower
  In addition to these schemes, there                    and save a lot of energy without losing
have been a number of intratask poli-                    as much performance as it would if it
cies implemented at the compiler level.                  were slowing down a compute-intensive
Shin and Kim [2001] developed one of                     code.

ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors
Power reductionofmicroprocessors

Contenu connexe

Tendances

Loss Estimation: A Load Factor Method
Loss Estimation: A Load Factor MethodLoss Estimation: A Load Factor Method
Loss Estimation: A Load Factor MethodIDES Editor
 
A new simplified approach for optimum allocation of a distributed generation
A new simplified approach for optimum allocation of a distributed generationA new simplified approach for optimum allocation of a distributed generation
A new simplified approach for optimum allocation of a distributed generationIAEME Publication
 
Modeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOM
Modeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOMModeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOM
Modeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOMijsrd.com
 
Achieving Energy Proportionality In Server Clusters
Achieving Energy Proportionality In Server ClustersAchieving Energy Proportionality In Server Clusters
Achieving Energy Proportionality In Server ClustersCSCJournals
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Report angel fern_ndez_sarabia
Report angel fern_ndez_sarabiaReport angel fern_ndez_sarabia
Report angel fern_ndez_sarabia.AHMED AMER
 
Voltage Profile Improvement of distribution system Using Particle Swarm Optim...
Voltage Profile Improvement of distribution system Using Particle Swarm Optim...Voltage Profile Improvement of distribution system Using Particle Swarm Optim...
Voltage Profile Improvement of distribution system Using Particle Swarm Optim...IJERA Editor
 
An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...
An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...
An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...IDES Editor
 
Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...
Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...
Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...IJAPEJOURNAL
 
New solutions for optimization of the electrical distribution system availabi...
New solutions for optimization of the electrical distribution system availabi...New solutions for optimization of the electrical distribution system availabi...
New solutions for optimization of the electrical distribution system availabi...Mohamed Ghaieth Abidi
 
Calculating total power
Calculating total powerCalculating total power
Calculating total powerjeet69
 
Battery Aware Dynamic Scheduling for Periodic Task Graphs
Battery Aware Dynamic Scheduling for Periodic Task GraphsBattery Aware Dynamic Scheduling for Periodic Task Graphs
Battery Aware Dynamic Scheduling for Periodic Task GraphsNicolas Navet
 
Facility Optimization
Facility OptimizationFacility Optimization
Facility Optimizationcwoodson
 
Evaluating the Opportunity for DC Power in the Data Center
Evaluating the Opportunity for DC Power in the Data CenterEvaluating the Opportunity for DC Power in the Data Center
Evaluating the Opportunity for DC Power in the Data Centermmurrill
 
Interplay of Communication and Computation Energy Consumption for Low Power S...
Interplay of Communication and Computation Energy Consumption for Low Power S...Interplay of Communication and Computation Energy Consumption for Low Power S...
Interplay of Communication and Computation Energy Consumption for Low Power S...ijasuc
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
 

Tendances (20)

Reliability Assesment of Debremarkos Distribution System Found In Ethiopia
Reliability Assesment of Debremarkos Distribution System Found In EthiopiaReliability Assesment of Debremarkos Distribution System Found In Ethiopia
Reliability Assesment of Debremarkos Distribution System Found In Ethiopia
 
Loss Estimation: A Load Factor Method
Loss Estimation: A Load Factor MethodLoss Estimation: A Load Factor Method
Loss Estimation: A Load Factor Method
 
A new simplified approach for optimum allocation of a distributed generation
A new simplified approach for optimum allocation of a distributed generationA new simplified approach for optimum allocation of a distributed generation
A new simplified approach for optimum allocation of a distributed generation
 
Modeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOM
Modeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOMModeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOM
Modeling Analysis& Solution of Power Quality Problems Using DVR & DSTATCOM
 
Achieving Energy Proportionality In Server Clusters
Achieving Energy Proportionality In Server ClustersAchieving Energy Proportionality In Server Clusters
Achieving Energy Proportionality In Server Clusters
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Report angel fern_ndez_sarabia
Report angel fern_ndez_sarabiaReport angel fern_ndez_sarabia
Report angel fern_ndez_sarabia
 
Voltage Profile Improvement of distribution system Using Particle Swarm Optim...
Voltage Profile Improvement of distribution system Using Particle Swarm Optim...Voltage Profile Improvement of distribution system Using Particle Swarm Optim...
Voltage Profile Improvement of distribution system Using Particle Swarm Optim...
 
Be4102404413
Be4102404413Be4102404413
Be4102404413
 
An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...
An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...
An Improved Particle Swarm Optimization for Proficient Solving of Unit Commit...
 
M01052109115
M01052109115M01052109115
M01052109115
 
Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...
Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...
Modeling Simulation and Design of Photovoltaic Array with MPPT Control Techni...
 
New solutions for optimization of the electrical distribution system availabi...
New solutions for optimization of the electrical distribution system availabi...New solutions for optimization of the electrical distribution system availabi...
New solutions for optimization of the electrical distribution system availabi...
 
Calculating total power
Calculating total powerCalculating total power
Calculating total power
 
Battery Aware Dynamic Scheduling for Periodic Task Graphs
Battery Aware Dynamic Scheduling for Periodic Task GraphsBattery Aware Dynamic Scheduling for Periodic Task Graphs
Battery Aware Dynamic Scheduling for Periodic Task Graphs
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Facility Optimization
Facility OptimizationFacility Optimization
Facility Optimization
 
Evaluating the Opportunity for DC Power in the Data Center
Evaluating the Opportunity for DC Power in the Data CenterEvaluating the Opportunity for DC Power in the Data Center
Evaluating the Opportunity for DC Power in the Data Center
 
Interplay of Communication and Computation Energy Consumption for Low Power S...
Interplay of Communication and Computation Energy Consumption for Low Power S...Interplay of Communication and Computation Energy Consumption for Low Power S...
Interplay of Communication and Computation Energy Consumption for Low Power S...
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 

En vedette

LeaderQuest SharePoint Business Intelligence Presentation
LeaderQuest SharePoint Business Intelligence PresentationLeaderQuest SharePoint Business Intelligence Presentation
LeaderQuest SharePoint Business Intelligence Presentationmbrinks
 
Chapter 11 the eu, multilateralism and competition with structural powers
Chapter 11   the eu, multilateralism and competition with structural powersChapter 11   the eu, multilateralism and competition with structural powers
Chapter 11 the eu, multilateralism and competition with structural powersluca_lapenna
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

En vedette (6)

LeaderQuest SharePoint Business Intelligence Presentation
LeaderQuest SharePoint Business Intelligence PresentationLeaderQuest SharePoint Business Intelligence Presentation
LeaderQuest SharePoint Business Intelligence Presentation
 
Company/Product Overview
Company/Product OverviewCompany/Product Overview
Company/Product Overview
 
No-Sew Book Purse
No-Sew Book PurseNo-Sew Book Purse
No-Sew Book Purse
 
Chapter 11 the eu, multilateralism and competition with structural powers
Chapter 11   the eu, multilateralism and competition with structural powersChapter 11   the eu, multilateralism and competition with structural powers
Chapter 11 the eu, multilateralism and competition with structural powers
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similaire à Power reductionofmicroprocessors

SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...VLSICS Design
 
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...VLSICS Design
 
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...VLSICS Design
 
Energy efficient-resource-allocation-in-distributed-computing-systems
Energy efficient-resource-allocation-in-distributed-computing-systemsEnergy efficient-resource-allocation-in-distributed-computing-systems
Energy efficient-resource-allocation-in-distributed-computing-systemsCemal Ardil
 
Artigo impacts of distributed
Artigo impacts of distributedArtigo impacts of distributed
Artigo impacts of distributedJuliane Soares
 
Feasibility study of pervasive computing
Feasibility study of pervasive computingFeasibility study of pervasive computing
Feasibility study of pervasive computingiaemedu
 
System Operational Challenges from the Energy Transition
System Operational Challenges from the Energy TransitionSystem Operational Challenges from the Energy Transition
System Operational Challenges from the Energy TransitionPower System Operation
 
Power System Problems in Teaching Control Theory on Simulink
Power System Problems in Teaching Control Theory on SimulinkPower System Problems in Teaching Control Theory on Simulink
Power System Problems in Teaching Control Theory on Simulinkijctcm
 
POWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINK
POWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINKPOWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINK
POWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINKijctcm
 
Power System Operational Challenges from the Energy Transition
Power System Operational Challenges from the Energy TransitionPower System Operational Challenges from the Energy Transition
Power System Operational Challenges from the Energy TransitionPower System Operation
 
Microcontroller based transformer protectio
Microcontroller based transformer protectioMicrocontroller based transformer protectio
Microcontroller based transformer protectioAminu Bugaje
 
11.optimization of loss minimization using facts in deregulated power systems
11.optimization of loss minimization using facts in deregulated power systems11.optimization of loss minimization using facts in deregulated power systems
11.optimization of loss minimization using facts in deregulated power systemsAlexander Decker
 
Optimization of loss minimization using facts in deregulated power systems
Optimization of loss minimization using facts in deregulated power systemsOptimization of loss minimization using facts in deregulated power systems
Optimization of loss minimization using facts in deregulated power systemsAlexander Decker
 
Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...
Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...
Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...TELKOMNIKA JOURNAL
 
Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...
Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...
Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...paperpublications3
 
H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...
H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...
H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...EMERSON EDUARDO RODRIGUES
 
Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...
Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...
Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...Power System Operation
 

Similaire à Power reductionofmicroprocessors (20)

SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
 
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
 
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
SURVEY ON POWER OPTIMIZATION TECHNIQUES FOR LOW POWER VLSI CIRCUIT IN DEEP SU...
 
Energy efficient-resource-allocation-in-distributed-computing-systems
Energy efficient-resource-allocation-in-distributed-computing-systemsEnergy efficient-resource-allocation-in-distributed-computing-systems
Energy efficient-resource-allocation-in-distributed-computing-systems
 
Artigo impacts of distributed
Artigo impacts of distributedArtigo impacts of distributed
Artigo impacts of distributed
 
Feasibility study of pervasive computing
Feasibility study of pervasive computingFeasibility study of pervasive computing
Feasibility study of pervasive computing
 
System Operational Challenges from the Energy Transition
System Operational Challenges from the Energy TransitionSystem Operational Challenges from the Energy Transition
System Operational Challenges from the Energy Transition
 
Power System Problems in Teaching Control Theory on Simulink
Power System Problems in Teaching Control Theory on SimulinkPower System Problems in Teaching Control Theory on Simulink
Power System Problems in Teaching Control Theory on Simulink
 
POWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINK
POWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINKPOWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINK
POWER SYSTEM PROBLEMS IN TEACHING CONTROL THEORY ON SIMULINK
 
Power System Operational Challenges from the Energy Transition
Power System Operational Challenges from the Energy TransitionPower System Operational Challenges from the Energy Transition
Power System Operational Challenges from the Energy Transition
 
Computer Application in Power system chapter one - introduction
Computer Application in Power system chapter one - introductionComputer Application in Power system chapter one - introduction
Computer Application in Power system chapter one - introduction
 
Microcontroller based transformer protectio
Microcontroller based transformer protectioMicrocontroller based transformer protectio
Microcontroller based transformer protectio
 
Article
ArticleArticle
Article
 
Wind PV Hydrogen Ribeiro
Wind PV Hydrogen RibeiroWind PV Hydrogen Ribeiro
Wind PV Hydrogen Ribeiro
 
11.optimization of loss minimization using facts in deregulated power systems
11.optimization of loss minimization using facts in deregulated power systems11.optimization of loss minimization using facts in deregulated power systems
11.optimization of loss minimization using facts in deregulated power systems
 
Optimization of loss minimization using facts in deregulated power systems
Optimization of loss minimization using facts in deregulated power systemsOptimization of loss minimization using facts in deregulated power systems
Optimization of loss minimization using facts in deregulated power systems
 
Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...
Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...
Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelera...
 
Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...
Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...
Optimal Unit Commitment Based on Economic Dispatch Using Improved Particle Sw...
 
H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...
H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...
H6 Handbook of Electrical Power System Dynamics Modeling, Stability and Contr...
 
Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...
Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...
Power Systems Resilience Assessment: Hardening and Smart Operational Enhancem...
 

Dernier

定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一lvtagr7
 
Call Girls SG Highway 7397865700 Ridhima Hire Me Full Night
Call Girls SG Highway 7397865700 Ridhima Hire Me Full NightCall Girls SG Highway 7397865700 Ridhima Hire Me Full Night
Call Girls SG Highway 7397865700 Ridhima Hire Me Full Nightssuser7cb4ff
 
The Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty VixenThe Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty VixenSalty Vixen Stories & More
 
(伦敦大学毕业证学位证成绩单-PDF版)
(伦敦大学毕业证学位证成绩单-PDF版)(伦敦大学毕业证学位证成绩单-PDF版)
(伦敦大学毕业证学位证成绩单-PDF版)twfkn8xj
 
Call Girls Ellis Bridge 7397865700 Independent Call Girls
Call Girls Ellis Bridge 7397865700 Independent Call GirlsCall Girls Ellis Bridge 7397865700 Independent Call Girls
Call Girls Ellis Bridge 7397865700 Independent Call Girlsssuser7cb4ff
 
Vip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts Service
Vip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts ServiceVip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts Service
Vip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts ServiceApsara Of India
 
5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...
5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...
5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...Apsara Of India
 
Taken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch DocumentTaken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch Documentf4ssvxpz62
 
Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...
Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...
Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...Amil Baba Company
 
Call Girl Contact Number Andheri WhatsApp:+91-9833363713
Call Girl Contact Number Andheri WhatsApp:+91-9833363713Call Girl Contact Number Andheri WhatsApp:+91-9833363713
Call Girl Contact Number Andheri WhatsApp:+91-9833363713Sonam Pathan
 
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...Amil baba
 
Call Girls in Faridabad 9000000000 Faridabad Escorts Service
Call Girls in Faridabad 9000000000 Faridabad Escorts ServiceCall Girls in Faridabad 9000000000 Faridabad Escorts Service
Call Girls in Faridabad 9000000000 Faridabad Escorts ServiceTina Ji
 
Udaipur Call Girls 9602870969 Call Girl in Udaipur Rajasthan
Udaipur Call Girls 9602870969 Call Girl in Udaipur RajasthanUdaipur Call Girls 9602870969 Call Girl in Udaipur Rajasthan
Udaipur Call Girls 9602870969 Call Girl in Udaipur RajasthanApsara Of India
 
Vip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services AvailableVip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services AvailableKomal Khan
 
Gripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to MissGripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to Missget joys
 
QUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzers
QUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzersQUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzers
QUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzersSJU Quizzers
 
Call Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full Night
Call Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full NightCall Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full Night
Call Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full Nightssuser7cb4ff
 
1681275559_haunting-adeline and hunting.pdf
1681275559_haunting-adeline and hunting.pdf1681275559_haunting-adeline and hunting.pdf
1681275559_haunting-adeline and hunting.pdfTanjirokamado769606
 

Dernier (20)

定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
定制(UofT毕业证书)加拿大多伦多大学毕业证成绩单原版一比一
 
Call Girls SG Highway 7397865700 Ridhima Hire Me Full Night
Call Girls SG Highway 7397865700 Ridhima Hire Me Full NightCall Girls SG Highway 7397865700 Ridhima Hire Me Full Night
Call Girls SG Highway 7397865700 Ridhima Hire Me Full Night
 
The Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty VixenThe Fine Line Between Honest and Evil Comics by Salty Vixen
The Fine Line Between Honest and Evil Comics by Salty Vixen
 
(伦敦大学毕业证学位证成绩单-PDF版)
(伦敦大学毕业证学位证成绩单-PDF版)(伦敦大学毕业证学位证成绩单-PDF版)
(伦敦大学毕业证学位证成绩单-PDF版)
 
Call Girls Ellis Bridge 7397865700 Independent Call Girls
Call Girls Ellis Bridge 7397865700 Independent Call GirlsCall Girls Ellis Bridge 7397865700 Independent Call Girls
Call Girls Ellis Bridge 7397865700 Independent Call Girls
 
Vip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts Service
Vip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts ServiceVip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts Service
Vip Udaipur Call Girls 9602870969 Dabok Airport Udaipur Escorts Service
 
5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...
5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...
5* Hotel Call Girls In Goa 7028418221 Call Girls In Calangute Beach Escort Se...
 
Taken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch DocumentTaken Pilot Episode Story pitch Document
Taken Pilot Episode Story pitch Document
 
Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...
Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...
Real NO1 Amil baba in Faisalabad Kala jadu in faisalabad Aamil baba Faisalaba...
 
Call Girl Contact Number Andheri WhatsApp:+91-9833363713
Call Girl Contact Number Andheri WhatsApp:+91-9833363713Call Girl Contact Number Andheri WhatsApp:+91-9833363713
Call Girl Contact Number Andheri WhatsApp:+91-9833363713
 
young call girls in Hari Nagar,🔝 9953056974 🔝 escort Service
young call girls in Hari Nagar,🔝 9953056974 🔝 escort Serviceyoung call girls in Hari Nagar,🔝 9953056974 🔝 escort Service
young call girls in Hari Nagar,🔝 9953056974 🔝 escort Service
 
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
NO1 WorldWide Amil baba in pakistan Amil Baba in Karachi Black Magic Islamaba...
 
Call Girls in Faridabad 9000000000 Faridabad Escorts Service
Call Girls in Faridabad 9000000000 Faridabad Escorts ServiceCall Girls in Faridabad 9000000000 Faridabad Escorts Service
Call Girls in Faridabad 9000000000 Faridabad Escorts Service
 
Udaipur Call Girls 9602870969 Call Girl in Udaipur Rajasthan
Udaipur Call Girls 9602870969 Call Girl in Udaipur RajasthanUdaipur Call Girls 9602870969 Call Girl in Udaipur Rajasthan
Udaipur Call Girls 9602870969 Call Girl in Udaipur Rajasthan
 
Vip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services AvailableVip Delhi Ncr Call Girls Best Services Available
Vip Delhi Ncr Call Girls Best Services Available
 
Call Girls Koti 7001305949 all area service COD available Any Time
Call Girls Koti 7001305949 all area service COD available Any TimeCall Girls Koti 7001305949 all area service COD available Any Time
Call Girls Koti 7001305949 all area service COD available Any Time
 
Gripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to MissGripping Adult Web Series You Can't Afford to Miss
Gripping Adult Web Series You Can't Afford to Miss
 
QUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzers
QUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzersQUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzers
QUIZ BOLLYWOOD ( weekly quiz ) - SJU quizzers
 
Call Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full Night
Call Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full NightCall Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full Night
Call Girls Prahlad Nagar 9920738301 Ridhima Hire Me Full Night
 
1681275559_haunting-adeline and hunting.pdf
1681275559_haunting-adeline and hunting.pdf1681275559_haunting-adeline and hunting.pdf
1681275559_haunting-adeline and hunting.pdf
 

Power reductionofmicroprocessors

  • 1. Power Reduction Techniques For Microprocessor Systems VASANTH VENKATACHALAM AND MICHAEL FRANZ University of California, Irvine Power consumption is a major factor that limits the performance of computers. We survey the “state of the art” in techniques that reduce the total power consumed by a microprocessor system over time. These techniques are applied at various levels ranging from circuits to architectures, architectures to system software, and system software to applications. They also include holistic approaches that will become more important over the next decade. We conclude that power management is a multifaceted discipline that is continually expanding with new techniques being developed at every level. These techniques may eventually allow computers to break through the “power wall” and achieve unprecedented levels of performance, versatility, and reliability. Yet it remains too early to tell which techniques will ultimately solve the power problem. Categories and Subject Descriptors: C.5.3 [Computer System Implementation]: Microcomputers—Microprocessors; D.2.10 [Software Engineering]: Design— Methodologies; I.m [Computing Methodologies]: Miscellaneous General Terms: Algorithms, Design, Experimentation, Management, Measurement, Performance Additional Key Words and Phrases: Energy dissipation, power reduction 1. INTRODUCTION of power; so much power, in fact, that their power densities and concomitant Computer scientists have always tried to heat generation are rapidly approaching improve the performance of computers. levels comparable to nuclear reactors But although today’s computers are much (Figure 1). These high power densities faster and far more versatile than their impair chip reliability and life expectancy, predecessors, they also consume a lot increase cooling costs, and, for large Parts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712 and by the Office of Naval Research under grant N00014-01-1-0854. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as necessarily representing the official views, policies or endorsements, either expressed or implied, of the National Science foundation (NSF), the Office of Naval Research (ONR), or any other agency of the U.S. Government. The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems that partially supported this work. Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: vvenkata@uci.edu; Michael Franz, School of Information and Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: franz@uci.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. c 2005 ACM 0360-0300/05/0900-0195 $5.00 ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237.
  • 2. 196 V. Venkatachalam and M. Franz the power problem. Finally, we will dis- cuss some commercial power management systems (Section 3.10) and provide a glimpse into some more radical technolo- gies that are emerging (Section 3.11). 2. DEFINING POWER Power and energy are commonly defined in terms of the work that a system per- forms. Energy is the total amount of work a system performs over a period of time, Fig. 1. Power densities rising. Figure adapted from while power is the rate at which the sys- Pollack 1999. tem performs that work. In formal terms, data centers, even raise environmental P = W/T (1) concerns. At the other end of the performance spectrum, power issues also pose prob- E = P ∗ T, (2) lems for smaller mobile devices with lim- where P is power, E is energy, T is a spe- ited battery capacities. Although one could cific time interval, and W is the total work give these devices faster processors and performed in that interval. Energy is mea- larger memories, this would diminish sured in joules, while power is measured their battery life even further. in watts. Without cost effective solutions to These concepts of work, power, and en- the power problem, improvements in ergy are used differently in different con- micro-processor technology will eventu- texts. In the context of computers, work ally reach a standstill. Power manage- involves activities associated with run- ment is a multidisciplinary field that in- ning programs (e.g., addition, subtraction, volves many aspects (i.e., energy, temper- memory operations), power is the rate at ature, reliability), each of which is complex which the computer consumes electrical enough to merit a survey of its own. The energy (or dissipates it in the form of heat) focus of our survey will be on techniques while performing these activities, and en- that reduce the total power consumed by ergy is the total electrical energy the com- typical microprocessor systems. puter consumes (or dissipates as heat) We will follow the high-level taxonomy over time. illustrated in Figure 2. First, we will de- This distinction between power and en- fine power and energy and explain the ergy is important because techniques that complex parameters that dynamic and reduce power do not necessarily reduce en- static power depend on (Section 2). Next, ergy. For example, the power consumed we will introduce techniques that reduce by a computer can be reduced by halv- power and energy (Section 3), starting ing the clock frequency, but if the com- with circuit (Section 3.1) and architec- puter then takes twice as long to run tural techniques (Section 3.2, Section 3.3, the same programs, the total energy con- and Section 3.4), and then moving on to sumed will be similar. Whether one should two techniques that are widely applied in reduce power or energy depends on the hardware and software, dynamic voltage context. In mobile applications, reducing scaling (Section 3.5) and resource hiberna- energy is often more important because tion (Section 3.6). Third, we will examine it increases the battery lifetime. How- what compilers can do to manage power ever, for other systems (e.g., servers), tem- (Section 3.7). We will then discuss recent perature is a larger issue. To keep the work in application level power manage- temperature within acceptable limits, one ment (Section 3.8), and recent efforts (Sec- would need to reduce instantaneous power tion 3.9) to develop a holistic solution to regardless of the impact on total energy. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 3. Power Reduction Techniques for Microprocessor Systems 197 Fig. 2. Organization of this survey. 2.1. Dynamic Power Consumption capacitance (C) and an activity factor (a) that relates to how many 0 → 1 or 1 → 0 There are two forms of power consump- transitions occur in a chip: tion, dynamic power consumption and static power consumption. Dynamic power consumption arises from circuit activity Pdynamic ∼ aCV 2 f . (3) such as the changes of inputs in an adder or values in a register. It has two Accordingly, there are four ways to re- sources, switched capacitance and short- duce dynamic power consumption, though circuit current. they each have different tradeoffs and not Switched capacitance is the primary all of them reduce the total energy con- source of dynamic power consumption and sumed. The first way is to reduce the phys- arises from the charging and discharging ical capacitance or stored electrical charge of capacitors at the outputs of circuits. of a circuit. The physical capacitance de- Short-circuit current is a secondary pends on low level design parameters source of dynamic power consumption and such as transistor sizes and wire lengths. accounts for only 10-15% of the total power One can reduce the capacitance by re- consumption. It arises because circuits are ducing transistor sizes, but this worsens composed of transistors having opposite performance. polarity, negative or NMOS and positive The second way to lower dynamic power or PMOS. When these two types of tran- is to reduce the switching activity. As com- sistors switch current, there is an instant puter chips get packed with increasingly when they are simultaneously on, creat- complex functionalities, their switching ing a short circuit. We will not deal fur- activity increases [De and Borkar 1999], ther with the power dissipation caused by making it more important to develop tech- this short circuit because it is a smaller niques that fall into this category. One percentage of total power, and researchers popular technique, clock gating, gates have not found a way to reduce it without the clock signal from reaching idle func- sacrificing performance. tional units. Because the clock network As the following equation shows, the accounts for a large fraction of a chip’s more dominant component of dynamic total energy consumption, this is a very power, switched capacitance (Pdynamic ), de- effective way of reducing power and en- pends on four parameters namely, supply ergy throughout a processor and is imple- voltage (V), clock frequency (f), physical mented in numerous commercial systems ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 4. 198 V. Venkatachalam and M. Franz 2.2. Understanding Leakage Power Consumption In addition to consuming dynamic power, computer components consume static power, also known as idle power or leakage. According to the most re- cently published industrial roadmaps [ITRSRoadMap], leakage power is rapidly becoming the dominant source of power consumption in circuits (Figure 3) and persists whether a computer is active or Fig. 3. ITRS trends for leakage power dissipation. idle. Because its causes are different from Figure adapted from Meng et al., 2005. those of dynamic power, dynamic power reduction techniques do not necessarily reduce the leakage power. As the equation that follows illustrates, including the Pentium 4, Pentium M, Intel leakage power consumption is the product XScale and Tensilica Xtensa, to mention of the supply voltage (V) and leakage cur- but a few. rent (Il eak ), or parasitic current, that flows The third way to reduce dynamic power through transistors even when the tran- consumption is to reduce the clock fre- sistors are turned off. quency. But as we have just mentioned, this worsens performance and does not al- ways reduce the total energy consumed. Pleak = V Ileak . (4) One would use this technique only if the target system does not support volt- To understand how leakage current age scaling and if the goal is to re- arises, one must understand how transis- duce the peak or average power dissi- tors work. A transistor regulates the flow pation and indirectly reduce the chip’s of current between two terminals called temperature. the source and the drain. Between these The fourth way to reduce dynamic power two terminals is an insulator, called the consumption is to reduce the supply volt- channel, that resists current. As the volt- age. Because reducing the supply voltage age at a third terminal, the gate, is in- increases gate delays, it also requires re- creased, electrical charge accumulates in ducing the clock frequency to allow the cir- the channel, reducing the channel’s resis- cuit to work properly. tance and creating a path along which The combination of scaling the supply electricity can flow. Once the gate volt- voltage and clock frequency in tandem is age is high enough, the channel’s polar- called dynamic voltage scaling (DVS). This ity changes, allowing the normal flow of technique should ideally reduce dynamic current between the source and the drain. power dissipation cubically because dy- The threshold at which the gate’s voltage namic power is quadratic in voltage and is high enough for the path to open is called linear in clock frequency. This is the most the threshold voltage. widely adopted technique. A growing num- According to this model, a transistor is ber of processors, including the Pentium similar to a water dam. It is supposed M, mobile Pentium 4, AMD’s Athlon, and to allow current to flow when the gate Transmeta’s Crusoe and Efficieon proces- voltage exceeds the threshold voltage but sors allow software to adjust clock fre- should otherwise prevent current from quencies and voltage settings in tandem. flowing. However, transistors are imper- However, DVS has limitations and cannot fect. They leak current even when the gate always be applied, and even when it can voltage is below the threshold voltage. In be applied, it is nontrivial to apply as we fact, there are six different types of cur- will see in Section 3.5. rent that leak through a transistor. These ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 5. Power Reduction Techniques for Microprocessor Systems 199 include reverse-biased-junction leakage, ture increases. But as the Equation (6) gate-induced-drain leakage, subthreshold shows, this increases leakage further, leakage, gate-oxide leakage, gate-current causing yet higher temperatures. This vi- leakage, and punch-through leakage. Of cious cycle is known as thermal runaway. these six, subthreshold leakage and gate- It is the chip designer’s worst nightmare. oxide leakage dominate the total leakage How can these problems be solved? current. Equations (4) and (6) indicate four ways to Gate-oxide leakage flows from the gate of reduce leakage power. The first way is to a transistor into the substrate. This type reduce the supply voltage. As we will see, of leakage current depends on the thick- supply voltage reduction is a very common ness of the oxide material that insulates technique that has been applied to compo- the gate: nents throughout a system (e.g., processor, buses, cache memories). V 2 Tox The second way to reduce leakage power Iox = K 2 W e−α V . (5) is to reduce the size of a circuit because the Tox total leakage is proportional to the leak- age dissipated in all of a circuit’s tran- According to this equation, the gate- sistors. One way of doing this is to de- oxide leakage Iox increases exponentially sign a circuit with fewer transistors by as the thickness Tox of the gate’s oxide ma- omitting redundant hardware and using terial decreases. This is a problem because smaller caches, but this may limit perfor- future chip designs will require the thick- mance and versatility. Another idea is to ness to be reduced along with other scaled reduce the effective transistor count dy- parameters such as transistor length and namically by cutting the power supplies to supply voltage. One way of solving this idle components. Here, too, there are chal- problem would be to insulate the gate us- lenges such as how to predict when dif- ing a high-k dialectric material instead ferent components will be idle and how to of the oxide materials that are currently minimize the overhead of shutting them used. This solution is likely to emerge over on or off. This, is also a common approach the next decade. of which we will see examples in the sec- Subthreshold leakage current flows be- tions to follow. tween the drain and source of a transistor. The third way to reduce leakage power It is the dominant source of leakage and is to cool the computer. Several cooling depends on a number of parameters that techniques have been developed since are related through the following equa- the 1960s. Some blow cold air into the tion: circuit, while others refrigerate the −Vth −V processor [Schmidt and Notohardjono Isub = K 1 W e nT 1−e T . (6) 2002], sometimes even by costly means such as circulating cryogenic fluids like In this equation, W is the gate width liquid nitrogen [Krane et al. 1988]. These and K and n are constants. The impor- techniques have three advantages. First tant parameters are the supply voltage they significantly reduce subthreshold V , the threshold voltage Vth , and the leakage. In fact, a recent study [Schmidt temperature T. The subthreshold leak- and Notohardjono 2002] showed that cool- age current Isub increases exponentially ing a memory cell by 50 degrees Celsius as the threshold voltage Vth decreases. reduces the leakage energy by five times. This again raises a problem for future Second, these techniques allow a circuit to chip designs, because as technology scales, work faster because electricity encounters threshold voltages will have to scale along less resistance at lower temperatures. with supply voltages. Third, cooling eliminates some negative The increase in subthreshold leakage effects of high temperatures, namely the current causes another problem. When the degradation of a chip’s reliability and life leakage current increases, the tempera- expectancy. Despite these advantages, ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 6. 200 V. Venkatachalam and M. Franz Fig. 4. The transistor stacking effect. The cir- cuits in both (a) and (b) are leaking current be- tween Vdd and Ground. However, Ileak1, the leak- age in (a), is less than Ileak2, the leakage in (b). Figure adapted from Butts and Sohi 2000. Fig. 5. Multiple threshold circuits with sleep transistors. there are issues to consider such as the costs of the hardware used to cool the and this increases the threshold voltage circuit. Moreover, cooling techniques of the bottom transistor, making it more are insufficient if they result in wide resistant to leakage. As a result, in Fig- temperature variations in different parts ure 4(a), in which all transistors are in of a circuit. Rather, one needs to prevent the Off position, transistor B leaks less hotspots by distributing heat evenly current than transistor A, and transistor throughout a chip. C leaks less current than transistor B. Hence, the total leakage current is atten- 2.2.1. Reducing Threshold Voltage. The uated as it flows from Vdd to the ground fourth way of reducing leakage power is to through transistors A, B, and C. This is not increase the threshold voltage. As Equa- the case in the circuit shown in Figure 4 tion (6) shows, this reduces the subthresh- (b), which contains only a single off old leakage exponentially. However, it also transistor. reduces the circuit’s performance as is ap- Another way to increase the thresh- parent in the following equation that re- old voltage is to use Multiple Thresh- lates frequency (f), supply voltage (V), and old Circuits With Sleep Transistors (MTC- threshold voltage (Vth ) and where α is a MOS) [Calhoun et al. 2003; Won et al. constant: 2003] . This involves isolating a leaky cir- cuit element by connecting it to a pair (V − Vth )α of virtual power supplies that are linked f∞ . (7) V to its actual power supplies through sleep transistors (Figure 5). When the circuit is One of the less intuitive ways of active, the sleep transistors are activated, increasing the threshold voltage is to connecting the circuit to its power sup- exploit what is called the stacking effect plies. But when the circuit is inactive, the (refer to Figure 4). When two or more tran- sleep transistors are deactivated, thus dis- sistors that are switched off are stacked connecting the circuit from its power sup- on top of each other (a), then they dis- plies. In this inactive state, almost no leak- sipate less leakage than a single tran- age passes through the circuit because the sistor that is turned off (b). This is be- sleep transistors have high threshold volt- cause each transistor in the stack induces ages. (Recall that subthreshold leakage a slight reverse bias between the gate and drops exponentially with a rise in thresh- source of the transistor right below it, old voltage, according to Equation (6).) ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 7. Power Reduction Techniques for Microprocessor Systems 201 Fig. 6. Adaptive body biasing. This technique effectively confines the To adjust the threshold voltage, adap- leakage to one part of the circuit, but is tive body biasing applies a voltage to the tricky to implement for several reasons. transistor’s body known as a body bias The sleep transistors must be sized prop- voltage (Figure 6). This voltage changes erly to minimize the overhead of acti- the polarity of a transistor’s channel, de- vating them. They cannot be turned on creasing or increasing its resistance to and off too frequently. Moreover, this tech- current flow. When the body bias voltage nique does not readily apply to memories is chosen to fill the transistor’s channel because memories lose data when their with positive ions (b), the threshold volt- power supplies are cut. age increases and reduces leakage cur- A third way to increase the threshold rents. However, when the voltage is cho- is to employ dual threshold circuits. Dual sen to fill the channel with negative ions, threshold circuits [Liu et al. 2004; Wei the threshold voltage decreases, allowing et al. 1998; Ho and Hwang 2004] reduce higher performance, though at the cost of leakage by using high threshold (low leak- more leakage. age) transistors on noncritical paths and low threshold transistors on critical paths, the idea being that noncritical paths can 3. REDUCING POWER execute instructions more slowly without 3.1. Circuit And Logic Level Techniques impairing performance. This is a diffi- cult technique to implement because it re- 3.1.1. Transistor Sizing. Transistor siz- quires choosing the right combination of ing [Penzes et al. 2002; Ebergen et al. transistors for high-threshold voltages. If 2004] reduces the width of transistors too many transistors are assigned high to reduce their dynamic power consump- threshold voltages, the noncritical paths tion, using low-level models that relate the in the circuit can slow down too much. power consumption to the width. Accord- A fourth way to increase the thresh- ing to these models, reducing the width old voltage is to apply a technique known also increases the transistor’s delay and as adaptive body biasing [Seta et al. thus the transistors that lie away from 1995; Kobayashi and Sakurai 1994; Kim the critical paths of a circuit are usually and Roy 2002]. Adaptive body biasing is the best candidates for this technique. Al- a runtime technique that reduces leak- gorithms for applying this technique usu- age power by dynamically adjusting the ally associate with each transistor a toler- threshold voltages of circuits depending able delay which varies depending on how on whether the circuits are active. When close that transistor is to the critical path. a circuit is not active, the technique in- These algorithms then try to scale each creases its threshold voltage, thus saving transistor to be as small as possible with- leakage power exponentially, although at out violating its tolerable delay. the expense of a delay in circuit operation. When the circuit is active, the technique 3.1.2. Transistor Reordering. The ar- decreases the threshold voltage to avoid rangement of transistors in a circuit slowing it down. affects energy consumption. Figure 7 ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 8. 202 V. Venkatachalam and M. Franz 3.1.3. Half Frequency and Half Swing Clocks. Half-frequency and half-swing clocks re- duce frequency and voltage, respectively. Traditionally, hardware events such as register file writes occur on a rising clock edge. Half-frequency clocks synchro- nize events using both edges, and they tick at half the speed of regular clocks, thus cutting clock switching power in half. Reduced-swing clocks also often use a lower voltage signal and thus reduce power quadratically. Fig. 7. Transistor Reordering. Figure adapted from Hossain et al. 1996. 3.1.4. Logic Gate Restructuring. There are many ways to build a circuit out of logic gates. One decision that affects power con- shows two possible implementations of sumption is how to arrange the gates and the same circuit that differ only in their their input signals. placement of the transistors marked A For example, consider two implementa- and B. Suppose that the input to transis- tions of a four-input AND gate (Figure 8), tor A is 1, the input to transistor B is 1, a chain implementation (a), and a tree and the input to transistor C is 0. Then implementation (b). Knowing the signal transistors A and B will be on, allowing probabilities (1 or 0) at each of the pri- current from Vdd to flow through them mary inputs (A, B, C, D), one can easily cal- and charge the capacitors C1 and C2. culate the transition probabilities (0→1) Now suppose that the inputs change and for each output (W, X, F, Y, Z). If each in- that A’s input becomes 0, and C’s input be- put has an equal probability of being a comes 1. Then A will be off while B and C 1 or a 0, then the calculation shows that will be on. Now the implementations in (a) the chain implementation (a) is likely to and (b) will differ in the amounts of switch- switch less than the tree implementation ing activity. In (a), current from ground (b). This is because each gate in a chain will flow past transistors B and C, dis- has a lower probability of having a 0→1 charging both the capacitors C1 and C2. transition than its predecessor; its tran- However, in (b), the current from ground sition probability depends on those of all will only flow past transistor C; it will not its predecessors. In the tree implementa- get past transistor A since A is turned off. tion, on the other hand, some gates may Thus it will only discharge the capacitor share a parent (in the tree topology) in- C2, rather than both C1 and C2 as in part stead of being directly connected together. (a). Thus the implementation in (b) will These gates could have the same transi- consume less power than that in (a). tion probabilities. Transistor reordering [Kursun et al. Nevertheless, chain implementations 2004; Sultania et al. 2004] rearranges do not necessarily save more energy than transistors to minimize their switching tree implementations. There are other is- activity. One of its guiding principles is sues to consider when choosing a topology. to place transistors closer to the circuit’s One is the issue of glitches or spurious outputs if they switch frequently in or- transitions that occur when a gate does not der to prevent a domino effect where receive all of its inputs at the same time. the switching activity from one transistor These glitches are more common in chain trickles into many other transistors caus- implementations where signals can travel ing widespread power dissipation. This re- along different paths having widely vary- quires profiling techniques to determine ing delays. One solution to reduce glitches how frequently different transistors are is to change the topology so that the dif- likely to switch. ferent paths in the circuit have similar ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 9. Power Reduction Techniques for Microprocessor Systems 203 Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania State University Microsystems Design Laboratory’s tutorial on low power design. delays. This solution, known as path bal- such as register files. A typical master- ancing often transforms chain implemen- slave flip-flop consists of two latches, a tations into tree implementations. An- master latch, and a slave latch. The inputs other solution, called retiming, involves of the master latch are the clock signal inserting flip-flops or registers to slow and data. The inputs of the slave latch down and thereby synchronize the signals are the inverse of the clock signal and the that pass along different paths but recon- data output by the master latch. Thus verge to the same gate. Because flip-flops when the clock signal is high, the master and registers are in sync with the proces- latch is turned on, and the slave latch sor clock, they sample their inputs less fre- is turned off. In this phase, the master quently than logic gates and are thus more samples whatever inputs it receives and immune to glitches. outputs them. The slave, however, does not sample its inputs but merely outputs 3.1.5. Technology Mapping. Because of whatever it has most recently stored. On the huge number of possibilities and the falling edge of the clock, the master tradeoffs at the gate level, designers rely turns off and the slave turns on. Thus the on tools to determine the most energy- master saves its most recent input and optimal way of arranging gates and sig- stops sampling any further inputs. The nals. Technology mapping [Chen et al. slave samples the new inputs it receives 2004; Li et al. 2004; Rutenbar et al. 2001] from the master and outputs it. is the automated process of constructing a Besides this master-slave design are gate-level representation of a circuit sub- other common designs such as the pulse- ject to constraints such as area, delay, and triggered flip-flop and sense-amplifier flip- power. Technology mapping for power re- flop. All these designs share some common lies on gate-level power models and a li- sources of power consumption, namely brary that describes the available gates, power dissipated from the clock signal, and their design constraints. Before a cir- power dissipated in internal switching ac- cuit can be described in terms of gates, it tivity (caused by the clock signal and by is initially represented at the logic level. changes in data), and power dissipated The problem is to design the circuit out of when the outputs change. logic gates in a way that will mimimize the Researchers have proposed several al- total power consumption under delay and ternative low power designs for flip-flops. cost constraints. This is an NP-hard Di- Most of these approaches reduce the rected Acyclic Graph (DAG) covering prob- switching activity or the power dissipated lem, and a common heuristic to solve it is by the clock signal. One alternative is the to break the DAG representation of a cir- self-gating flip-flop. This design inhibits cuit into a set of trees and find the optimal the clock signal to the flip-flop when the in- mapping for each subtree using standard puts will produce no change in the outputs. tree-covering algorithms. Strollo et al. [2000] have proposed two ver- sions of this design. In the double-gated 3.1.6. Low Power Flip-Flops. Flip-flops flip-flop, the master and slave latches are the building blocks of small memories each have their own clock-gating circuit. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 10. 204 V. Venkatachalam and M. Franz The circuit consists of a comparator that way of reducing power is to encode the checks whether the current and previous FSM states in a way that minimizes the inputs are different and some logic that in- switching activity throughout the proces- hibits the clock signal based on the output sor. Another common approach involves of the comparator. In the sequential-gated decomposing the FSM into sub-FSMs flip-flop, the master and the slave share and activating only the circuitry needed the same clock-gating circuit instead of for the currently executing sub-FSM. having their own individual circuit as in Some researchers [Gao and Hayes 2003] the double-gated flip-flop. have developed tools that automate this Another low-power flip-flop is the condi- process of decomposition, subject to some tional capture FLIP-FLOP [Kong et al. 2001; quality and correctness constraints. Nedovic et al. 2001]. This flip-flop detects when its inputs produce no change in the 3.1.8. Delay-Based Dynamic-Supply Voltage output and stops these redundant inputs Adjustment. Commercial processors that from causing any spurious internal cur- can run at multiple clock speed normally rent switches. There are many variations use a lookup table to decide what supply on this idea, such as the conditional dis- voltage to select for a given clock speed. charge flip-flops and conditional precharge This table is use built ahead of time flip-flops [Zhao et al. 2004] which elim- through a worst-case analysis. Because inate unnecessary precharging and dis- worst-case analyses may not reflect real- charging of internal elements. ity, ARM Inc. has been developing a more A third type of low-power flip-flop is the efficient runtime solution known as the dual edge triggered FLIP-FLOP [Llopis and Razor pipeline [Ernst et al. 2003]. Sachdev 1996]. These flip-flops are sensi- Instead of using a lookup table, Razor tive to both the rising and falling edges adjusts the supply voltage based on the of the clock, and can do the same work delay in the circuit. The idea is that when- as a regular flip-flop at half the clock fre- ever the voltage is reduced, the circuit quency. By cutting the clock frequency slows down, causing timing errors if the in half, these flip-flops save total power clock frequency chosen for that voltage is consumption but may not save total en- too high. Because of these errors, some in- ergy, although they could save energy by structions produce inconsistent results or reducing the voltage along with the fre- fail altogether. The Razor pipeline peri- quency. Another drawback is that these odically monitors how many such errors flip-flops require more area than regu- occur. If the number of errors exceeds a lar flip-flops. This increase in area may threshold, it increases the supply voltage. cause them to consume more power on the If the number of errors is below a thresh- whole. old, it scales the voltage further to save more energy. This solution requires extra hardware 3.1.7. Low-Power Control Logic Design. in order to detect and correct circuit One can view the control logic of a pro- errors. To detect errors, it augments cessor as a Finite State Machine (FSM). the flip-flops in delay-critical regions of It specifies the possible processor states the circuit with shadow-latches. These and conditions for switching between latches receive the same inputs as the the states and generates the signals that flip-flops but are clocked more slowly to activate the appropriate circuitry for each adapt to the reduced supply voltage. If state. For example, when an instruction the output of the flip-flop differs from is in the execution stage of the pipeline, that of its shadow-latch, then this sig- the control logic may generate signals to nifies an error. In the event of such an activate the correct execution unit. error, the circuit propagates the output Although control logic optimizations of the shadow-latch instead of the flip- have traditionally targeted performance, flop, delaying the pipeline for a cycle if they are now also targeting power. One necessary. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 11. Power Reduction Techniques for Microprocessor Systems 205 Fig. 9. Crosstalk prevention through shielding. 3.2. Low-Power Interconnect smaller, there arise additional sources of power consumption. One of these sources Interconnect heavily affects power con- is crosstalk [Sylvester and Keutzer 1998]. sumption since it is the medium of most Crosstalk is spurious activity on a wire electrical activity. Efforts to improve chip that is caused by activity in neighboring performance are resulting in smaller chips wires. As well as increasing delays and im- with more transistors and more densely pairing circuit integrity, crosstalk can in- packed wires carrying larger currents. The crease power consumption. wires in a chip often use materials with One way of reducing crosstalk [Taylor poor thermal conductivity [Banerjee and et al. 2001] is to insert a shield wire Mehrotra 2001]. (Figure 9) between adjacent bus wires. Since the shield remains deasserted, no 3.2.1. Bus Encoding And CrossTalk. A pop- adjacent wires switch in opposite direc- ular way of reducing the power consumed tions. This solution doubles the number of in buses is to reduce the switching activity wires. However, the principle behind it has through intelligent bus encoding schemes, sparked efforts to develop self-shielding such as bus-inversion [Stan and Burleson codes [Victor and Keutzer 2001; Patel and 1995]. Buses consist of wires that trans- Markov 2003] resistant to crosstalk. As mit bits (logical 1 or 0). For every data in traditional bus encoding, a value is en- transmission on a bus, the number of wires coded and then transmitted. However, the that switch depends on the current and code chosen avoids opposing transitions on previous values transmitted. If the Ham- adjacent bus wires. ming distance between these values is more than half the number of wires, then 3.2.2. Low Swing Buses. Bus-encoding most of the wires on the bus will switch schemes attempt to reduce transitions on current. To prevent this from happening, the bus. Alternatively, a bus can transmit bus-inversion transmits the inverse of the the same information but at a lower intended value and asserts a control sig- voltage. This is the principle behind low nal alerting recipients of the inversion. swing buses [Zhang and Rabaey 1998]. For example, if the current binary value to Traditionally, logical one is represented transmit is 110 and the previous was 000, by 5 volts and logical zero is represented the bus instead transmits 001, the inverse by −5 volts. In a low-swing system of 110. (Figure 10), logical one and zero are Bus-inversion ensures that at most half encoded using lower voltages, such as of the bus wires switch during a bus trans- +300mV and −300mV. Typically, these action. However, because of the cost of the systems are implemented with differen- logic required to invert the bus lines, this tial signaling. An input signal is split into technique is mainly used in external buses two signals of opposite polarity bounded rather than the internal chip interconnect. by a smaller voltage range. The receiver Moreover, some researchers have ar- sees the difference between the two gued that this technique makes the sim- transmitted signals as the actual signal plistic assumption that the amount of and amplifies it back to normal voltage. power that an interconnect wire consumes Low Swing Differential Signaling has depends only on the number of bit tran- several advantages in addition to reduced sitions on that wire. As chips become power consumption. It is immune to ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 12. 206 V. Venkatachalam and M. Franz Fig. 10. Low voltage differential signaling Fig. 11. Bus segmentation. crosstalk and electromagnetic radiation 3.2.4. Adiabatic Buses. The preceding effects. Since the two transmitted signals techniques work by reducing activity are close together, any spurious activity or voltage. In contrast, adiabatic cir- will affect both equally without affecting cuits [Lyuboslavsky et al. 2000] reduce to- the difference between them. However, tal capacitance. These circuits reuse ex- when implementing this technique, one isting electrical charge to avoid creating needs to consider the costs of increased new charge. In a traditional bus, when hardware at the encoder and decoder. a wire becomes deasserted, its previous charge is wasted. A charge-recovery bus 3.2.3. Bus Segmentation. Another effec- recycles the charge for wires about to be tive technique is bus segmentation. In a asserted. traditional shared bus architecture, the Figure 12 illustrates a simple design for entire bus is charged and discharged upon a two-bit adiabatic bus [Bishop and Ir- every access. Segmentation (Figure 11) win 1999]. A comparator in each bitline splits a bus into multiple segments con- tests whether the current bit is equal to nected by links that regulate the traffic be- the previously sent bit. Inequality signi- tween adjacent segments. Links connect- fies a transition and connects the bitline ing paths essential to a communication are to a shorting wire used for sharing charge activated independently, allowing most of across bitlines. In the sample scenario, the bus to remain powered down. Ideally, Bitline1 has a falling transition while Bit- devices communicating frequently should line0 has a rising transition. As Bitline1 be in the same or nearby segments to avoid falls, its positive charge transfers to Bit- powering many links. Jone et al. [2003] line0, allowing Bitline0 to rise. The power have examined how to partition a bus saved depends on transition patterns. No to ensure this property. Their algorithm energy is saved when all lines rise. The begins with an undirected graph whose most energy is saved when an equal num- nodes are devices, edges connect commu- ber of lines rise and fall simultaneously. nicating devices, and edge weights dis- The biggest drawback of adiabatic cir- play communication frequency. Applying cuits is a delay for transfering shared the Gomory and Hu algorithm [1961], they charge. find a spanning tree in which the prod- uct of communication frequency and num- ber of edges between any two devices is 3.2.5. Network-On-Chip. All of the above minimal. techniques assume an architecture in ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 13. Power Reduction Techniques for Microprocessor Systems 207 and columns corresponding to the inter- section points end up switching current even though only parts of these rows and columns are actually traversed. To elim- inate this widespread current switching, the segmented crossbar topology divides the rows and columns into segments de- marcated by tristate buffers. The differ- ent segments are selectively activated as the data traverses through them, hence Fig. 12. Two bit charge recovery bus. confining current switches to only those segments along which the data actually which multiple functional units share one passes. or more buses. This shared-bus paradigm has several drawbacks with respect to 3.3. Low-Power Memories and Memory power and performance. The inherent bus Hierarchies bandwidth limits the speed and volume of data transfers and does not suit the 3.3.1. Types of Memory. One can classify varying requirements of different execu- memory structures into two categories, tion units. Only one unit can access the Random Access Memories (RAM) and bus at a time, though several may be re- Read-Only-Memories (ROM). There are questing bus access simultaneously. Bus two kinds of RAMs, static RAMs (SRAM) arbitration mechanisms are energy inten- and dynamic RAMs (DRAM) which dif- sive. Unlike simple data transfers, every fer in how they store data. SRAMs store bus transaction involves multiple clock data using flip-flops and DRAMs store cycles of handshaking protocols that in- each bit of data as a charge on a capac- crease power and delay. itor; thus DRAMs need to refresh their As a consequence, researchers have data periodically. SRAMs allow faster ac- been investigating the use of generic in- cesses than DRAMs but require more terconnection networks to replace buses. area and are more expensive. As a result, Compared to buses, networks offer higher normally only register files, caches, and bandwidth and support concurrent con- high bandwidth parts of the system are nections. This design makes it easier to made up of SRAM cells, while main mem- troubleshoot energy-intensive areas of the ory is made up of DRAM cells. Although network. Many networks have sophisti- these cells have slower access times cated mechanisms for adapting to varying than SRAMs, they contain fewer transis- traffic patterns and quality-of-service re- tors and are less expensive than SRAM quirements. Several authors [Zhang et al. cells. 2001; Dally and Towles 2001; Sgroi et al. The techniques we will introduce are not 2001] propose the application of these confined to any specific type of RAM or techniques at the chip level. These claims ROM. Rather they are high-level architec- have yet to be verified, and interconnect tural principles that apply across the spec- power reduction remains a promising area trum of memories to the extent that the for future research. required technology is available. They at- Wang et al. [2003] have developed tempt to reduce the energy dissipation of hardware-level optimizations for differ- memory accesses in two ways, either by ent sources of energy consumption in reducing the energy dissipated in a mem- an on-chip network. One of the alterna- ory accesses, or by reducing the number of tive network topologies they propose is memory accesses. the segmented crossbar topology which aims to reduce the energy consumption 3.3.2. Splitting Memories Into Smaller Sub- of data transfers. When data is trans- systems. An effective way to reduce the fered over a regular network grid, all rows energy that is dissipated in a memory ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 14. 208 V. Venkatachalam and M. Franz access is to activate only the needed mem- processor and first level cache. It would ory circuits in each access. One way to ex- be smaller than the first level cache and pose these circuits is to partition memo- hence dissipate less energy. Yet it would ries into smaller, independently accessible be large enough to store an application’s components. This can be done as different working set and filter out many mem- granularities. ory references; only when a data request At the lowest granularity, one can de- misses this cache would it require search- sign a main memory system that is split ing higher levels of cache, and the penalty up into multiple banks each of which for a miss would be offset by redirecting can independently transition into differ- more memory references to this smaller, ent power modes. Compilers and operating more energy efficient cache. This was the systems can cluster data into a minimal principle behind the filter cache. Though set of banks, allowing the other unused deceptively simple, this idea lies at the banks to power down and thereby save en- heart of many of the low-power cache de- ergy [Luz et al. 2002; Lebeck et al. 2000]. signs used today, ranging from simple ex- At a higher granularity, one can parti- tensions of the cache hierarchy (e.g., block tion the separate banks of a partitioned buffering) to the scratch pad memories memory into subbanks and activate only used in embedded systems, and the com- the relevant subbank in every memory ac- plex trace caches used in high-end gener- cess. This is a technique that has been ap- ation processors. plied to caches [Ghose and Kamble 1999]. One simple extension is the technique of In a normal cache access, the set-selection block buffering. This technique augments step involves transferring all blocks in all caches with small buffers to store the most banks onto tag-comparison latches. Since recently accessed cache set. If a data item the requested word is at a known off- that has been requested belongs to a cache set in these blocks, energy can be saved set that is already in the buffer, then one by transfering only words at that offset. does not have to search for it in the rest of Cache subbanking splits the block array of the cache. each cache line into multiple banks. Dur- In embedded systems, scratch pad mem- ing set-selection, only the relevant bank ories [Panda et al. 2000; Banakar et al. from each line is active. Since a single sub- 2002; Kandemir et al. 2002] have been bank is active at a time, all subbanks of the extension of choice. These are soft- a line share the same output latches and ware controlled memories that reside on- sense amplifiers to reduce hardware com- chip. Unlike caches, the data they should plexity. Subbanking saves energy with- contain is determined ahead of time and out increasing memory access time. More- remains fixed while programs execute. over, it is independent of program locality These memories are less volatile than patterns. conventional caches and can be accessed within a single cycle. They are ideal for specialized embedded systems whose ap- 3.3.3. Augmenting the Memory Hierarchy plications are often hand tuned. With Specialized Cache Structures. The sec- General purpose processors of the lat- ond way to reduce energy dissipation in est generation sometimes rely on more the memory hierarchy is to reduce the complex caching techniques such as the number of memory hierarchy accesses. trace cache. Trace caches were originally One group of researchers [Kin et al. 1997] developed for performance but have been developed a simple but effective technique studied recently for their power bene- to do this. Their idea was to integrate a fits. Instead of storing instructions in specialized cache into the typical mem- their compiled order, a trace cache stores ory hierarchy of a modern processor, a traces of instructions in their executed hierarchy that may already contain one order. If an instruction sequence is al- or more levels of caches. The new cache ready in the trace cache, then it need they were proposing would sit between the not be fetched from the instruction cache ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 15. Power Reduction Techniques for Microprocessor Systems 209 but can be decoded directly from the trace cache. This saves power by re- ducing the number of instruction cache accesses. In the original trace cache design, both the trace cache and the instruction cache are accessed in parallel on every clock cy- cle. Thus instructions are fetched from the instruction cache while the trace cache is searched for the next trace that is pre- dicted to executed. If the trace is in the trace cache, then the instructions fetched from the instruction cache are discarded. Fig. 13. Drowsy cache line. Otherwise the program execution pro- ceeds out of the instruction cache and si- 3.4.1. Adaptive Caches. There is a multaneously new traces are created from wealth of literature on adaptive caches, the instructions that are fetched. caches whose storage elements (lines, Low-power designs for trace caches typi- blocks, or sets) can be selectively acti- cally aim to reduce the number of instruc- vated based on the application workload. tion cache accesses. One alternative, the One example of such a cache is the dynamic direction prediction-based trace Deep-Submicron Instruction (DRI) cache [Hu et al. 2003], uses branch predi- cache [Powell et al. 2001]. This cache tion to decide where to fetch instructions permits one to deactivate its individual from. If the branch predictor predicts the sets on demand by gating their supply next trace with high confidence and that voltages. To decide what sets to activate at trace is in the trace cache, then the in- any given time, the cache uses a hardware structions are fetched from the trace cache profiler that monitors the application’s rather than the instruction cache. cache-miss patterns. Whenever the cache The selective trace cache [Hu et al. 2002] misses exceed a threshold, the DRI cache extends this technique with compiler sup- activates previously deactivated sets. port. The idea is to identify frequently exe- Likewise, whenever the miss rate falls cuted, or hot traces, and bring these traces below a threshold, the DRI deactivates into the trace cache. The compiler inserts some of these sets by inhibiting their hints after every branch instruction, indi- supply voltages. cating whether it belongs to a hot trace. At A problem with this approach is that runtime, these hints cause instructions to dormant memory cells lose data and need be fetched from either the trace cache or more time to be reactivated for their next the instruction cache. use. Thus an alternative to inhibiting their supply voltages is to reduce their voltages as low as possible without losing data. 3.4. Low-Power Processor Architecture This is the aim of the drowsy cache, a cache Adaptations whose lines can be placed in a drowsy So far we have described the energy sav- mode where they dissipate minimal power ing features of hardware as though they but retain data and can be reactivated were a fixed foundation upon which pro- faster. grams execute. However, programs exhibit Figure 13 illustrates a typical drowsy wide variations in behavior. Researchers cache line [Kim et al. 2002]. The wakeup have been developing hardware structures signal chooses between voltages for the whose parameters can be adjusted on de- active, drowsy, and off states. It also mand so that one can save energy by regulates the word decoder’s output to activating just the minimum hardware prevent data in a drowsy line from be- resources needed for the code that is ing accessed. Before a cache access, the executing. line circuitry must be precharged. In a ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 16. 210 V. Venkatachalam and M. Franz the first adaptive instruction issue queues, a 32-bit queue consisting of four equal size partitions that can be independently acti- vated. Each partition consists of wakeup Fig. 14. Dead block elimination. logic that decides when instructions are ready to execute and readout logic that traditional cache, the precharger is contin- dispatches ready instructions into the uously on. To save power, the drowsy cir- pipeline. At any time, only the partitions cuitry regulates the wakeup signal with a that contain the currently executing in- precharge signal. The precharger becomes structions are activated. active only when the line is woken up. There have been many heuristics devel- Many heuristics have been developed oped for configuring these queues. Some for controlling these drowsy circuits. The require measuring the rate at which in- simplest policy for controlling the drowsy structions are executed per processor clock cache is cache decay [Kaxiras et al. 2001] cycles, or IPC. For example, Buyukto- which simply turns off unused cache lines sunoglu et al.’s heuristic [2001] periodi- after a fixed interval. More sophisticated cally measures the IPC over fixed length policies [Zhang et al. 2002; 2003; Hu et al. intervals. If the IPC of the current inter- 2003] consider information about runtime val is a factor smaller than that of the program behavior. For example, Hu et previous interval (indicating worse per- al. [2003] have proposed hardware that formance), then the heuristic increases detects when an execution moves into a the instruction queue size to increase the hotspot and activates only those cache throughput. Otherwise it may decrease lines that lie within that hotspot. The the queue size. main idea of their approach is to count Bahar and Manne [2001] propose a how often different branches are taken. similar heuristic targeting a simulated When a branch is taken sufficiently many 8-wide issue processor that can be re- times, its target address is a hotspot, and configured as a 6-wide issue or 4-wide the hardware sets special bits to indicate issue. They also measure IPC over fixed that cache lines within that hotspot should intervals but measure the floating point not be shut down. Periodically, a runtime IPC separately from the integer IPC since routine disables cache lines that do not the former is more critical for floating have these bits set and decays the profil- point applications. Their heuristic, like ing data. Whenever it reactivates a cache Buyuktosunoglu et al.’s [2001], compares line before its next use, it also reactivates the measured IPC with specific thresholds the line that corresponds to the next sub- to determine when to downsize the issue sequent memory access. queue. There are many other heuristics for con- Folegnani and Gonzalez [2001], on the trolling cache lines. Dead-block elimina- other hand, use a different heuristic. They tion [Kabadi et al. 2002] powers down find that when a program has little par- cache lines containing basic blocks that allelism, the youngest part of the issue have reached their final use. It is a queue (which contains the most recent in- compiler-directed technique that requires structions) contributes very little to the a static control flow analysis to identify overall IPC in that instructions in this these blocks. Consider the control flow part are often committed late. Thus their graph in Figure 14. Since block B1 exe- heuristic monitors the contribution of the cutes only once, its cache lines can power youngest part of this queue and deacti- down once control reaches B2. Similarly, vates this part when its contribution is the lines containing B2 and B3 can power minimal. down once control reaches B4. 3.4.2. Adaptive Instruction Queues. Buyu- 3.4.3. Algorithms for Reconfiguring Multi- ktosunoglu et al. [2001] developed one of ple Structures. In addition to this work, ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 17. Power Reduction Techniques for Microprocessor Systems 211 researchers have also been tackling the to apply, they use offline profiling to plot problem of problem of reconfiguring mul- an energy-delay tradeoff curve. Each point tiple hardware units simultaneously. in the curve represents the energy and de- One group [Hughes et al. 2001; Sasanka lay tradeoff of applying an adaptation to et al. 2002; Hughes and Adve 2004] a subroutine. There are three regions in has been developing heuristics for com- the curve. The first includes adaptations bining hardware adaptations with dy- that save energy without impairing per- namic voltage scaling during multime- formance; these are the adaptations that dia applications. Their strategy is to run will always be applied. The third repre- two algorithms while individual video sents adaptations that worsen both per- frames are being processed. A global algo- formance and energy; these are the adap- rithm chooses the initial DVS setting and tations that will never be applied. Be- baseline hardware configuration for each tween these two extremes are adaptations frame, while a local algorithm tunes var- that trade performance for energy savings; ious hardware parameters (e.g., instruc- some of these may be applied depending on tion window sizes) while the frame is ex- the tolerable performance loss. The main ecuting to use up any remaining slack idea is to trace the curve from the ori- periods. gin and apply every adaptation that one Another group [Iyer and Marculescu encounters until the cumulative perfor- 2001, 2002a] has developed heuristics that mance loss from applying all the encoun- adjust the pipeline width and register up- tered adaptations reaches the maximum date unit (RUU) size for different hotspots tolerable performance loss. or frequently executed program regions. Ponomarev et al. [2001] develop heuris- To detect these hotspots, their technique tics for controlling a configurable issue counts the number of times each branch queue, a configurable reorder buffer and a instruction is taken. When a branch is configurable load/store queue. Unlike the taken enough times, it is marked as a can- IPC based heuristics we have seen, this didate branch. heuristic is occupancy-based because the To detect how often candidate branches occupancy of hardware structures indi- are taken, the heuristic uses a hotspot cates how congested these structures are, detection counter. Initially the hotspot and congestion is a leading cause of perfor- counter contains a positive integer. Each mance loss. Whenever a hardware struc- time a candidate branch is taken, the ture (e.g., instruction queue) has low oc- counter is decremented, and each time a cupancy, the heuristic reduces its size. In noncandidate branch is taken, the counter contrast, if the structure is completely full is incremented. If the counter becomes for a long time, the heuristic increases its zero, this means that the program execu- size to avoid congestion. tion is inside a hotspot. Another group of researchers [Albonesi Once the program execution is in a et al. 2003; Dropsho et al. 2002] ex- hotspot, this heuristic tests the energy plore the complex problem of configur- consumption of different processor config- ing a processor whose adaptable compo- urations at fixed length intervals. It finally nents include two levels of caches, integer chooses the most energy saving configura- and floating point issue queues, load/store tion as the one to use the next time the queues, register files, as well as a reorder same hotspot is entered. To measure en- buffer. To dodge the sheer explosion of pos- ergy at runtime, the heuristic relies on sible configurations, they tune each com- hardware that monitors usage statistics of ponent based only on its local usage statis- the most power hungry units and calcu- tics. They propose two heuristics for doing lates the total power based on energy per this, one for tuning caches and another for access values. tuning buffers and register files. Huang et al.[2000; 2003] apply hard- The caches they consider are selec- ware adaptations at the granularity of tive way caches; each of their multiple subroutines. To decide what adaptations ways can independently be activated or ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 18. 212 V. Venkatachalam and M. Franz disabled. Each way has a most recently [Turing 1937]). There have been ongoing used (MRU) counter that is incremented efforts [Nilsen and Rygg 1995; Lim et al. every time that a cache searche hits the 1995; Diniz 2003; Ishihara and Yasuura way. The cache tuning heuristic samples 1998] in the real-time systems community this counter at fixed intervals to determine to accurately estimate the worst case exe- the number of hits in each way and the cution times of programs. These attempts number of overall misses. Using these ac- use either measurement-based prediction, cess statistics, the heuristic computes the static analysis, or a mixture of both. Mea- energy and performance overheads for all surement involves a learning lag whereby possible cache configurations and chooses the execution times of various tasks are the best configuration dynamically. sampled and future execution times are The heuristic for controlling other struc- extrapolated. The problem is that, when tures such as issue queues is occupancy workloads are irregular, future execu- based. It measures how often different tion times may not resemble past execu- components of these structures fill up with tion times. Microarchitectural innovations data and uses this information to de- such as pipelining, hyperthreading and cide whether to upsize or downsize the out-of-order execution make it difficult to structure. predict execution times in real systems statically. Among other things, a compiler 3.5. Dynamic Voltage Scaling needs to know how instructions are inter- Dynamic voltage scaling (DVS) addresses leaved in the pipeline, what the probabili- the problem of how to modulate a pro- ties are that different branches are taken, cessor’s clock frequency and supply volt- and when cache misses and pipeline haz- age in lockstep as programs execute. The ards are likely to occur. This requires mod- premise is that a processor’s workloads eling the pipeline and memory hierarchy. vary and that when the processor has less A number of researchers [Ferdinand 1997; work, it can be slowed down without af- Clements 1996] have attempted to inte- fecting performance adversely. For exam- grate complete pipeline and cache mod- ple, if the system has only one task and it els into compilers. It remains nontrivial has a workload that requires 10 cycles to to develop models that can be used ef- execute and a deadline of 100 seconds, the ficiently in a compiler and that capture processor can slow down to 1/10 cycles/sec, all the complexities inherent in current saving power and meeting the deadline systems. right on time. This is presumably more ef- ficient than running the task at full speed and idling for the remainder of the pe- 3.5.2. Indeterminism And Anomalies In Real riod. Though it appears straightforward, Systems. Even if the DVS algorithm pre- this naive picture of how DVS works is dicts correctly what the processor’s work- highly simplified and hides serious real- loads will be, determining how fast to run world complexities. the processor is nontrivial. The nondeter- minism in real systems removes any strict 3.5.1. Unpredictable Nature Of Workloads. relationships between clock frequency, ex- The first problem is how to predict work- ecution time, and power consumption. loads with reasonable accuracy. This re- Thus most theoretical studies on voltage quires knowing what tasks will execute scaling are based on assumptions that at any given time and the work re- may sound reasonable but are not guar- quired for these tasks. Two issues com- anteed to hold in real systems. plicate this problem. First, tasks can One misconception is that total micro- be preempted at arbitrary times due to processor system power is quadratic in user and I/O device requests. Second, supply voltage. Under the CMOS transis- it is not always possible to accurately tor model, the power dissipation of indi- predict the future runtime of an arbi- vidual transistors are quadratic in their trary algorithm (c.f., the Halting Problem supply voltages, but there remains no ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 19. Power Reduction Techniques for Microprocessor Systems 213 precise way of estimating the power dis- sipation of an entire system. One can construct scenarios where total system power is not quadratic in supply voltage. Martin and Siewiorek [2001] found that the StrongARM SA-1100 has two supply voltages, a 1.5V supply that powers the CPU and a 3.3V supply that powers the I/O pins. The total power dissipation is domi- nated by the larger supply voltage; hence, even if the smaller supply voltage were al- lowed to vary, it would not reduce power quadratically. Fan et al. [2003] found that, in some architectures where active mem- ory banks dominate total system power consumption, reducing the supply voltage dramatically reduces CPU power dissi- pation but has a miniscule effect on to- tal power. Thus, for full benefits, dynamic voltage scaling may need to be combined with resource hibernation to power down Fig. 15. Scheduling effects. When DVS is applied (a), task one takes a longer time to enter its criti- other energy hungry parts of the chip. cal section, and task two is able to preempt it right Another misconception is that it is before it does. When DVS is not applied (b), task most power efficient to run a task at one enters its critical section sooner and thus pre- the slowest constant speed that allows vents task two from preempting it until it finishes it to exactly meet its deadline. Several its critical section. This figure has been adopted from Buttazzo [2002]. theoretical studies [Ishihara and Yasuura 1998] attempt to prove this claim. These proofs rest on idealistic assumptions such two tasks (Figure 15). When the processor as “power is a convex linearly increasing slows down (a), Task One takes longer to function of frequency”, assumptions that enter its critical section and is thus pre- ignore how DVS affects the system as a empted by Task Two as soon as Task Two whole. When the processor slows down, is released into the system. When the pro- peripheral devices may remain activated cessor instead runs at maximum speed (b), longer, consuming more power. An impor- Task One enters its critical section earlier tant study of this issue was done by Fan and blocks Task Two until it finishes this et al. [2003] who show that, for specific section. This hypothetical example sug- DRAM architectures, the energy versus gests that DVS can affect the order in slowdown curve is “U” shaped. As the pro- which tasks are executed. But the order in cessor slows down, CPU energy decreases which tasks execute affects the state of the but the cumulative energy consumed in cache which, in turn, affects performance. active memory banks increases. Thus the The exact relationship between the cache optimal speed is actually higher than the state and execution time is not clear. Thus, processor’s lowest speed; any speed lower it is too simplistic to assume that execu- than this causes memory energy dissipa- tion time is inversely proportional to clock tion to overshadow the effects of DVS. frequency. A third misconception is that execution All of these issues show that theo- time is inversely proportional to clock fre- retical studies are insufficient for un- quency. It is actually an open problem derstanding how DVS will affect system how slowing down the processor affects state. One needs to develop an experi- the execution time of an arbitrary appli- mental approach driven by heuristics and cation. DVS may result in nonlinearities evaluate the tradeoffs of these heuristics [Buttazzo 2002]. Consider a system with empirically. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 20. 214 V. Venkatachalam and M. Franz Most of the existing DVS approaches formance and energy dissipation. Weissel can be classified as interval-based ap- and Bellosa’s technique involves moni- proaches, intertask approaches, or in- toring these event rates using hardware tratask approaches. performance counters and attributing them to the processes that are executing. At each context switch, the heuristic 3.5.3. Interval-Based Approaches. adjusts the clock frequency for the process Interval-based DVS algorithms mea- being activated based on its previously sure how busy the processor is over measured event rates. To do this, the some interval or intervals, estimate how heuristic refers to a previously con- busy it will be in the next interval, and structed table of event rate combinations, adjust the processor speed accordingly. divided into frequency domains. Weissel These algorithms differ in how they and Bellosa construct this table through estimate future processor utilization. a detailed offline simulation where, for Weiser et al. [1994] developed one of the different combinations of event rates, they earliest interval-based algorithms. This determine the lowest clock frequency that algorithm, Past, periodically measures does not violate a user-specified threshold how long the processor idles. If it idles for performance loss. longer than a threshold, the algorithm Intertask DVS has two drawbacks. slows it down; if it remains busy longer First, task workloads are usually un- than a threshold, it speeds it up. Since known until tasks finish running. Thus this approach makes decisions based traditional algorithms either assume per- only on the most recent interval, it is fect knowledge of these workloads or es- error prone. It is also prone to thrashing timate future workloads based on prior since it computes a CPU speed for every workloads. When workloads are irregular, interval. Govil et al. [1995] examine a estimating them is nontrivial. Flautner number of algorithms that extend Past et al. [2001] address this problem by by considering measurements collected classifying all workloads as interactive over larger windows of time. The Aged episodes and producer-consumer episodes Averages algorithm, for example, esti- and using separate DVS algorithms for mates the CPU usage for the upcoming each. For interactive episodes, the clock interval as a weighted average of usages frequency used depends not only on the measured over all previous intervals. predicted workload but also on a percep- These algorithms save more energy than tion threshold that estimates how fast the Past since they base their decisions on episode would have to run to be accept- more information. However, they all able to a user. To recover from prediction assume that workloads are regular. As we error, the algorithm also includes a panic have mentioned, it is very difficult, if not threshold. If a session lasts longer than impossible, to predict irregular workloads this threshold, the processor immediately using history information alone. switches to full speed. The second drawback of intertask ap- 3.5.4. Intertask Approaches. Intertask proaches is that they are unaware of pro- DVS algorithms assign different speeds gram structure. A program’s structure for different tasks. These speeds remain may provide insight into the work re- fixed for the duration of each task’s exe- quired for it, insight that, in turn, may cution. Weissel and Bellosa [2002] have provide opportunities for techniques such developed an intertask algorithm that as voltage scaling. Within specific program uses hardware events as the basis for regions, for example, the processor may choosing the clock frequencies for differ- spend significant time waiting for mem- ent processes. The motivation is that, as ory or disk accesses to complete. During processes execute, they generate various the execution of these regions, the proces- hardware events (e.g., cache misses, IPC) sor can be slowed down to meet pipeline at varying rates that relate to their per- stall delays. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
  • 21. Power Reduction Techniques for Microprocessor Systems 215 3.5.5. Intratask Approaches. Intratask the first of these algorithms. They no- approaches [Lee and Sakurai 2000; Lorch ticed that programs have multiple exe- and Smith 2001; Gruian 2001; Yuan and cution paths, some more time consum- Nahrstedt 2003] adjust the processor ing than others, and that whenever con- speed and voltage within tasks. Lee and trol flows away from a critical path, there Sakurai [2000] developed one of the earli- are opportunities for the processor to slow est approaches. Their approach, run-time down and still finish the programs within voltage hopping, splits each task into their deadlines. Based on this observation, fixed length timeslots. The algorithm as- Shin and Kim developed a tool that pro- signs each timeslot the lowest speed that files a program offline to determine worst allows it to complete within its preferred case and average cycles for different exe- execution time which is measured as cution paths, and then inserting instruc- the worst case execution time minus the tions to change the processor frequency at elapsed execution time up to the current the start of different paths based on this timeslot. An issue not considered in this information. work is how to divide a task into timeslots. Another approach, program checkpoint- Moreover, the algorithm is pessimistic ing [Azevedo et al. 2002], annotates a pro- since tasks could finish before their worst gram with checkpoints and assigns tim- case execution time. It is also insensitive ing constraints for executing code between to program structure. checkpoints. It then profiles the program There are many variations on this idea. offline to determine the average number Two operating system-level intratask al- of cycles between different checkpoints. gorithms are PACE [Lorch and Smith Based on this profiling information and 2001] and Stochastic DVS [Gruian 2001]. timing constraints, it adjusts the CPU These algorithms choose a speed for ev- speed at each checkpoint. ery cycle of a task’s execution based on the Perhaps the most well known approach probability distribution of the task work- is by Hsu and Kremer [2003]. They pro- load measured over previous cycles. The file a program offline with respect to all difference between PACE and stochastic possible combinations of clock frequencies DVS lies in their cost functions. Stochastic assigned to different regions. They then DVS assumes that energy is proportional build a table describing how each combi- to the square of the supply voltage, while nation affects execution time and power PACE assumes it is proportional to the consumption. Using this table, they select square of the clock frequency. These sim- the combination of regions and frequen- plifying assumptions are not guaranteed cies that saves the most power without in- to hold in real systems. creasing runtime beyond a threshold. Dudani et al. [2002] have also developed intratask policies at the operating system level. Their work aims to combine in- 3.5.6. The Implications of Memory Bounded tratask voltage scaling with the earliest- Code. Memory bounded applications deadline-first scheduling algorithm. have become an attractive target for DVS Dudani et al. split into two subprograms algorithms because the time for a memory each program that the scheduler chooses access, on most commercial systems, is to execute. The first subprogram runs at independent of how fast the processor is the maximum clock frequency, while the running. When frequent memory accesses second runs slow enough to keep the com- dominate the program execution time, bined execution time of both subprograms they limit how fast the program can finish below the average execution time for the executing. Because of this “memory wall”, whole program. the processor can actually run slower In addition to these schemes, there and save a lot of energy without losing have been a number of intratask poli- as much performance as it would if it cies implemented at the compiler level. were slowing down a compute-intensive Shin and Kim [2001] developed one of code. ACM Computing Surveys, Vol. 37, No. 3, September 2005.