SlideShare une entreprise Scribd logo
1  sur  39
Energy Conservation and Thermal
                  Management in High-Performance
                        Server Architectures
                                                           Adam Lewis
                            The Center for Advanced Computer Studies
                                The University of Louisiana at Lafayette




Tuesday, March 15, 2011
Agenda


                    • Background and Related Work
                    • System Modeling
                    • Effective Prediction
                    • Initial Evaluation and Results
                    • Thermally-Aware Scheduling
                    • Status, Plans, and Summary

Tuesday, March 15, 2011
What does this picture tell us?




                                                                  (c) The New York Times, June 14, 2006




                          Source: McKinsey & Company 2008   Source: EPA 2008


                           A 20% projected increase             Only ~50%
                                 in data center             of power consumed
                          emissions over next 5 years       from IT equipment



Tuesday, March 15, 2011
Current Practice

                          Completely Fair Scheduler
                          Domain-based Load Balancing
                          Power-state aware

                          Run-queue scheduling
                          Domain-based Load Balancing
                          Power-state aware (Solaris 11)

                          Run-queue scheduling
                          Interface w/ power manager?



Tuesday, March 15, 2011
Thread Scheduling & Power Management

                           DVFS:
                          P = CV 2 f
                                       SpeedStep



       Multi-core/Many-core                    •   Performance issues [LLBL 2007]

       •Cache affinity                              • Lack of slack

       •Load balancing                             • High load = No gain

       •Opportunity to turn                    •   Reliability issues [Bircher 2008]

        off the lights?
                                                   • Under-clocking & MTBF
                                               •   Reactive rather than proactive



Tuesday, March 15, 2011
Proactively Avoid Thermal Emergencies
                                                                Thermally-
        A Full-System    Effective
                      +                                           Aware
        Energy Model    Prediction
                                                                Scheduling
                • Possible approaches
                 • Heat-and-Run and related approaches
                          [Gomaa2004] [Coskun2009] [Zhou2010]

                 • Memory-resource focused approaches
                          [Merkel2010]

                 • Control-theoretic techniques

Tuesday, March 15, 2011
System Modeling



Tuesday, March 15, 2011
Model: Inputs & Components
                                        Esystem = Eproc + Emem + Ehdd + Eboard + Eem .


                                                    •   Processor
                                                    •   Memory
                                                    •   Hard disk & storage
                                                        devices
            Edc = Esystem
                                                    •   Motherboard &
                                                        peripherals
       •     Three DC voltage
             domains                                •   Electrical &
                                                        Electromechanical
            •   12Vdc, 5.5Vdc, 3.3Vdc
                                                        Components
       •     5.5V and 3.3V domains
             limited to 20% of rated
             voltage



Tuesday, March 15, 2011
Model: Processor
                                                                                                                                                         t2
                                                                                                                                  Eproc =                      (Pproc (t))dt
                                                                                                                                                         t1

                                                                                                                     Memory                 Memory

         DDR2-DRAM                                                                        DDR2-DRAM
                                                                                                                                 QPLC




                                                                                                                                                                     •
              Core 1             Core 2                                Core 2              Core 1




                                                                                                                                                                         Bus transactions
         I-cache D-cache I-cache D-cache                          D-cache I-cache D-cache I-cache

             L2 Cache            L2 cache                            L2 cache             L2 Cache                    Core                   Core
               system request interface                                system request interface
               Crossbar Switch                                                  Crossbar Switch




                                                                                                                                                                         •
                                                 Coherent
         Integrated                              HyperTransport                              Integrated
                        Host bridge              (cHT)                      Host bridge
         Memory                                                                               Memory




                                                                                                                                                                           Reflects amount of
                        HyperTransport                                 HyperTransport
         Controller                                                                         Controller



                                                                                HyperTransport Bus
                                                                                                                     QPLIO                   QPLIO



                                                                                                                                                                           data processed
                                                                                                                                 Input
                                 USB                                                                      VGA
                                                                                                                                Output
                                                                      SouthBridge                                               Handler
                                 HDD




                                                                                                                                                                     •
                                                                                                          Ethernet




                                                                                                                                                                         Die temperature
                                 DVD
                                                                                                          Graphics

                                            Board - Level Power consumers                                                     PCI Express



              AMD Opteron                                                                                            Intel Nehalem
                                                                                                                                                                         • Computation per
                                                                                                                                                                           core
                                          •                Processor as black box
                                                                                                                                                                     •   Processor  system
                                                         •   Power = f(workload)                                                                                         metrics
                                                         •   Manifests as heat




Tuesday, March 15, 2011
Model: Memory
                                                                                      •   DRAM Read/Write
                          t2
                                       N
                                        
                                                                                         power + background
         Emem =
                          t1
                                    (
                                        i=1
                                              CMi (t) + DB(t)) × PDR + Pab       dt       power = known
                                                                                          quantities
                                                                                      •   Performance counters
                                                                                          exist for measuring the
                                                                                          count of highest level
                                                                                          cache miss and bus
                                                                                          transactions
                                                                                      •   Combine these to
                                                                                          compute the energy
                                                                                          consumed




Tuesday, March 15, 2011
Model: Storage
                                                               
                          Ehdd =Pspin−up × Tsu + Pread  N r × Tr
                                                      
                                + Pwrite    N w × Tw +   Pidle × Tid


                                          Parameter                       Value
                                          Interface                     Serial ATA
                                           Capacity                       250 GB
                                       Rotational speed                  7200 rpm
                                       Power (spin up)                 5.25 W (max)
                                  Power (Random read, write)         9.4 W (typical)
                                  Power (Silent read, write)          7 W (typical)
                                         Power (idle)                 5 W (typical)
                                     Power (low RPM idle)      2.3 W (typical for 4500 RPM)
                                       Power (standby)               0.8 W (typical)
                                        Power (sleep)                0.6 W (typical)




Tuesday, March 15, 2011
Model: Board
                                                                         
                                  Eboard =        Vpower−line × Ipower−line × tinterval


                    •     System components that
                          support the operation of
                          the machine
                          •  Typically in the 5.5Vdc
                             and 3.3Vdc power
                             domains
                          •  Measured by current
                             probe




Tuesday, March 15, 2011
Model: Electromechanical
                                                          N
                                                                         
                                  Tp                        
                                                                   i
            Eem =                          V (t) · I(t) +         Pf an (t) dt
                              0                             i=1




                    •     Need to account for
                          energy required to cool
                          • No performance
                            counters
                          • Can measure power
                            drawn by the fans
                          • Derived from log data
                            collected by OS




Tuesday, March 15, 2011
Effective Prediction



Tuesday, March 15, 2011
gobmk         1.7%     9.0%     2.30
                                               zeusmp       TABLE III 8.1%
                                                             2.8%              2.14
                          Linear ODEL ERRORS FOR CAP, AR(1),A good ON AN AMD OPTERON S
                               M AR Time Series - AND MARS idea?

                                                                     AR
                                                           Avg       Max     RMSE
                                             Benchmark    Err %     Err %
                                                 astar     3.1%      8.9%     2.26
                                                games      2.2%      9.3%     2.06
                                               gobmk       1.7%      9.0%     2.30
                                               zeusmp      2.8%      8.1%     2.14
                                                             TABLE IV
                                          M ODEL ERRORS AR Model: AMD Opteron SERVER
                                                  Linear FOR AR ON I NTEL N EHALEM
      •      Linear Regression
            •   Easy, simple                 Benchmark
                                                            Avg
                                                           Err %
                                                                     Max
                                                                    Err %
                                                                              RMSE

            •   Odd mis-predictions              astar      5.9%     28.5%     4.94
            •   Corrective methods              games
                                               gobmk
                                                            5.6%
                                                            5.3%
                                                                     44.3%
                                                                     27.8%
                                                                               5.54
                                                                               4.83
                required                       zeusmp       7.7%     31.8%     7.24
                                                            TABLE IV
                                          M ODEL ERRORS FOR AR ON I NTEL Nehalem SERVER
                                                   Linear AR Model: Intel N EHALEM

                                                           Avg       Max     RMSE
                                             Benchmark    Err %     Err %
                                                astar      5.9%      28.5%    4.94
                                               games       5.6%      44.3%    5.54
Tuesday, March 15, 2011
Prediction w/ Chaotic Time Series
                     TABLE 4
 dications of chaotic behavior in power time series
           Chaotic behavior
                   (AMD, Intel)                                 Chaotic Time Series
         Benchmark      Hurst           Average         •   Time-delay reconstructed state space
                      Parameter
                         (H)
                                       Lyapunov
                                       Exponent
                                                            •   Uses Takens Embedding Theorem:

            bzip2     (0.96,   0.93)   (0.28,   0.35)
                                                                •  Time-delayed partition of
                                                                   observations to build function that
         cactusadm    (0.95,   0.97)   (0.01,   0.04)              preserves the topological and
           gromac     (0.94,   0.95)   (0.02,   0.03)
           leslie3d   (0.93,   0.94)   (0.05,   0.11)              dynamical properties of our original
          omnetpp     (0.96,   0.97)   (0.05,   0.06)              chaotic system
         perlbench    (0.98,   0.95)   (0.06,   0.04)
                                                        •   Find nearest neighbors on attractor to our
                                                            observations
                                                        •   Perform least-square curve fit to find a
ponent can be calculated using:                             polynomial that approximates the attractor

                   1 
                          N −1

           λ = lim     ln|f  (Xn )|.
              N →∞ N
                           n=0

e found a positive Lyapunov exponent when per-
ming this calculation on our data set ranging from
1 to 0.28 (or 0.03 to 0.35) on the AMD (or Intel) test
ver, as listed in Table 4, where each pair indicates
 Tuesday, March 15, 2011
Kernel weighting

                                   1.
                               −m
        K(x) = (2π)            exp(−x /2)
                                2                2
                                                                      3.
                             1    x                            n+p
                                                               
                     Kβ (x) = K( )                                   Op ∗ Kβ (Xt−1 − x)
                             β    β
                                                               t=p+1
                                                     ˆ
                                                     f (x) =      n+p
                                                                  
                                   2.                                    Kβ (Xt−1 − x)
                                  1                            t=p+1
                               4    5
                                                                                     T
                   β=                   σ              Op = (Xt−1 , . . . , Xt−p )
                              3p
         σ = median(|xi − µ|)/0.6745
         ¯                ¯



Tuesday, March 15, 2011
Forward prediction
                •          Start with a Taylor series expansion
                                                         
                                      fˆ(X) = f (x) + f (x)T (X − x)
                                                ˆ       ˆ
                •         Find the coefficients of the polynomial by solving the
                          linear least squares problem for a and b:
                            n+p
                             
                                              T
                                                           2
                                  Xt − a − b (Xt−1 − x) ∗ Kβ (Xt−1 − x)
                          t=p+1


                •         Explicit solution for our linear least squares
                          problem: n+p
                     ˆ       1 
                     f (x) =         (s2 − s1 ∗ (x − Xt−1 ))2 ∗ Kβ ((x − Xt−1 )/β)
                             n t=p+1
                                   n+p
                                   
                                1
                           si =         (x − Xt−1 )i ∗ Kβ ((x − Xt−1 )/β)
                                n t=p+1


Tuesday, March 15, 2011
Time Complexity




                   n future observations               p past observations



                                  Creating a CAP: O(n )      2

                            Predicting with a CAP: O(p)


Tuesday, March 15, 2011
Initial Evaluation and
                                  Results


Tuesday, March 15, 2011
gobmk         C     Artificial Intelligence: Go
                                                                                                                                        ipmito
                                                                         FP Benchmarks
                                                                                                                                        are ava
                                   Initial Evaluation and Results        calculix C++/F90 Structural Mechanics
                                                                         zeusmp     F90   Computational Fluid Dynamics
                                                                                                                                        commo
                                                                                                                                        Solaris
                                                                                                                                        Linux.
                                                                                                 TABLE 7
                                                                                                                                        dtrace
                                                                                        Test hardware configuration
                                                                                                                                        tunable
                                                                                                                                        impact
                                                                                         Sun Fire 2200    Dell PowerEdge R610
                                                                                                                                        consum
                                                                        CPU              2 AMD Opteron    2 Intel Xeon (Nehalem) 5500      The p
                                                                        CPU L2 cache     2x2MB            4MB
                                                                        Memory           8GB              9GM                           power
                                                                        Internal disk    2060GB           500GM                         and th
                                                                        Network          2x1000Mbps       1x1000Mbps                    measur
                                                                        Video            On-board         NVIDA Quadro FX4600
                                                                        Height           1 rack unit      1 rack unit                   ampera
                                                                                                                                        memor
                                                                                                                                        the run
                             TABLE 5                                   CAT increases linearly, as can be obtained in Eq. (15).          downlo
values,                                                                                       TABLE 6
               SPEC CPU2006 benchmarks used for model                                                                                   interna
                                                                                                                                          Open
                     Training Benchmarks
                            calibration                                           Evaluation Benchmarks
                                                                       The actual computation time results for our CAP
                                                                         SPEC CPU2006 benchmarks used for evaluation
                                                                       code implemented using MATLAB run on machines                    the dif
                                                                                                                                          coun
                                                                       (detailed in Table 7) with respect to different n and p          sured u
                                                                                                                                          the O
             Integer Benchmarks                                          Integer Benchmark
                                                                       values are provided in the next section.                         one Ag
                                                                                                                                             In
               bzip2       C      Compression                             astar    C++        Path Finding                              domain
 is de-         mcf        C      Combinatorial Optimization             gobmk      C         Artificial Intelligence: Go
                                                                                                                                          have
                                                                                                                                        from th
average       omnetpp     C++     Discrete Event Simulation            5 FP E VALUATION                                                   ipmi
                                                                            Benchmarks                                                  a bench
sed on
                                                                                                                                        host. a
                                                                                                                                          are
             FP Benchmarks
 . , Xt−p                                                              A calculix C++/F90 Structural Mechanics out to evaluate
                                                                           set of experiments was carried
               gromacs  C/F90 Biochemistry/Molecular Dynamics
                                                                       the performance of power models built using CAP
                                                                          zeusmp     F90    Computational Fluid Dynamics
                                                                                                                                          comm
 (u) =       cacstusADM C/F90 Physics/General Relativity
                                                                                                                                          Solar
as de-         leslie3d  F90 Fluid Dynamics                            techniques to approximate a solution for dynamic                 5.2 R
                 lbm      C   Fluid Dynamics                           systems following Eq. TABLE 7 purpose of the first
                                                                                                 (12). The                                Linu
                                                                       experiment was to confirm the time complexity CAP.                Fig. 8
                                                                                                                                          dtra
                                                                                      Test hardware configuration
                                                                       The behavior of CAP was simulated using MATLAB                   from C
                                                                                                                                          tuna
            using two criteria: sufficient coverage of the functional   on the hardware described in Table 7 withR610     varying        tual po
                                                                                                                                          impa
                                                                                       Sun Fire 2200    Dell PowerEdge                  system
            units in the processor and reasonable applicability        values of n future observation and p past observa-                 cons
            to the problem space. Components of the processor          tions. Fig. 6 illustrates the behavior of (Nehalem) 5500
                                                                        CPU            2 AMD Opteron 2 Intel Xeon CAP as the            three A
                                                                                                                                             Th
            affect the thermal envelope in different ways [40]. This    CPU L2 cache 2x2MB              4MB                             over th
                                                                       value of n is varied and confirms the O(n2 ) behavior
                                                                        Memory         8GB              9GM                               pow
            issue is addressed by balancing the benchmark selec-       of the predictor in this case. The behavior of CAP as
                                                                        Internal disk 2060GB            500GM                           betwee
                                                                                                                                          and
            tion between integer and floating point benchmarks          pNetwork is shown in Fig. 71x1000Mbps
                                                                         is varied     2x1000Mbps        and supports the claim         benchm
                                                                                                                                          meas
 a local in the SPEC CPU2006 benchmark suite. Second, the               Video          On-board         NVIDA Quadro FX4600             polyno
  Tuesday, March 15, 2011                                              of linear behavior.                                                amp
Results: AMD Opteron f10h




                           (a) Astar/CAP.                                 (b) Astar/AR(1)).




                          (c) Zeusmp/CAP.                                 (d) Zeusmp/AR(1).

 Fig. 8. Actual power results versus predicted results for AMD Opteron.



Tuesday, March 15, 2011
Results: Intel Nehalem
                          (c) Zeusmp/CAP.                                        (d) Zeusmp/AR(1).

   Fig. 8. Actual power results versus predicted results for AMD Opteron.




                           (a) Astar/CAP.                                         (b) Astar/AR(1)).




                          (c) Zeusmp/CAP.                                        (d) Zeusmp/AR(1).

   Fig. 9. Actual power results versus predicted results for an Intel Nehalem server.




Tuesday, March 15, 2011
Results: Error - Other Benchmarks




Tuesday, March 15, 2011
Observations and Analysis


                    • Where does maximum error occur?
                    • Choice of performance counters
                     • Difference in behavior between
                            processors?
                          • The right set of performance counters
                    • Benchmark selection

Tuesday, March 15, 2011
Thermally-Aware
                             Scheduling


Tuesday, March 15, 2011
Problem nature
                    • Scheduling...
                     • in time: who runs next
                     • in space: who runs where
                    • Optimization problem
                     • Who runs next: least use of energy with
                            best performance quality of service
                          • Who runs where: best utilization of
                            resources with least increase in
                            processor and/or ambient temperature


Tuesday, March 15, 2011
Thermal Extensions to System Model

                          Applications have
                              a length:                                       2.
           1.                                                                          For which we define
                           L(A, DA , t)                                           Thermal Equivalent of Application
                  and generate workload                                             ΘA (A, DA , T, t) =              U (A,DA ,t)
                                                                                                           lim Je × (T − Tnominal )
                                                                                                          T →Tth
    U (A, DA , t) = lim n × W (pi , di , t) × Ln (An , DAn , t), 1 ≤ i ≤ p
                      n→ke




            3.       Which is used to generate                               4.        That is used to compute
                   Thermal Efficiency to Completion                                Cost of Performance per Unit Power
                                                 ΘA (A,DA ,T,t)                                                    ΘA (A,DA ,T,t)
                          η(A, DA , T, t) =   ΘA (Ae ,DAe ,Tme ,Le )               Cθ (A, DA , T, t) =             Esys (A,DA ,t)




Tuesday, March 15, 2011
Extending CAP for Thermal Prediction


                    • Thermal Chaotic Attractor Predictor
                          (TCAP)
                          • Extends CAP to thermal domain
                          • Created and used in similar manner to
                            CAP
                          • Matching TCAP for each thermal metric


Tuesday, March 15, 2011
Reducing Processor Temperatures



                    • Premise: Processor die temperature can be
                          managed by controlling what threads
                          execute over time
                    • Predict the next thread to run a logical
                          CPU using TCAP for processor die
                          temperature




Tuesday, March 15, 2011
Reducing Ambient Temperature

                    • Premise: Control system ambient
                          temperature by managing load on logical
                          CPUs so that overheated resources have
                          time to recover
                    • Partition resources into categories based
                          on predicted change in temperature
                    • Move workload from “HOT” resources
                          towards “COLD” resources



Tuesday, March 15, 2011
Status, Plans, and
                              Summary


Tuesday, March 15, 2011
Current Status

                                               Thermally-
     A Full-System             Effective
                        +                        Aware
     Energy Model             Prediction
                                               Scheduling
        • Development Complete
        • Evaluation Complete              •   Design complete

           • Intel + AMD processors        •   Prototype under
                                               development
           • OpenSolaris (Solaris 11)
        • Peer-reviewed
           • Conference/Workshop: 3
           • Journal: 1


Tuesday, March 15, 2011
Plan for Completion

                          ID Task

                          1   Respond to review comments for [Lewis 2011]

                          2   Implement scheduler prototype in FreeBSD

                          3   Evaluate scheduler performance using parallel benchmarks

                          4   Document results and submit to archival journal

                          5   Create dissertation from Prospectus + output from previous task

                          6   Defend dissertation

                          7   Respond to comments from committee and Graduate School editor

                          8   Submit final version of document




Tuesday, March 15, 2011
Future Directions


                    • Extend beyond a single blade
                     • Cluster, Grid, and Cloud Scheduling
                     • MPI, OpenMP, and other environments
                    • Impact of operating system virtualization
                    • Extension of the thermal model in terms of
                          the thermodynamics of computation




Tuesday, March 15, 2011
Questions?



Tuesday, March 15, 2011
Additional Material



Tuesday, March 15, 2011
This work was supported in part by the U.S.
                          Department of Energy and by the Louisiana
                                      Board of Regents




Tuesday, March 15, 2011
Publications List
                          Lewis, A., Ghosh, S., and Tzeng, N.-F. 2008. Run-
                          time energy consumption estimation based on
                          workload in server systems. Proceedings of the
                          2008 conference on Power aware computing and
                          systems.

                          Lewis, A., Simon, J., and Tzeng, N.-F. 2010.
                          Chaotic attractor prediction for server run-time
                          energy consumption. Proc. of the 2010 Workshop on
                          Power Aware Computing and Systems (Hotpower’10).

                          Lewis, A., Tzeng, N.-F., and Ghosh, S. 2011. Time
                          series approximation of run-time energy
                          consumption based on server workload. Under
                          review for publication in ACM Transactions on
                          Architecture and Code Optimization.




Tuesday, March 15, 2011

Contenu connexe

Similaire à 2011 Feb07 Lewis Prospectus

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysisRadhegovind
 
Csc1401 lecture06 - internal memory
Csc1401   lecture06 - internal memoryCsc1401   lecture06 - internal memory
Csc1401 lecture06 - internal memoryIIUM
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Linked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityLinked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityConSanFrancisco123
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningScott Jenner
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
Low Power Design and Verification
Low Power Design and VerificationLow Power Design and Verification
Low Power Design and VerificationDVClub
 
Low power design-ver_26_mar08
Low power design-ver_26_mar08Low power design-ver_26_mar08
Low power design-ver_26_mar08Obsidian Software
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Accelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptxAccelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptxOpenStack Foundation
 
20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating ScienceTim Bell
 
20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating Science20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating ScienceTim Bell
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 

Similaire à 2011 Feb07 Lewis Prospectus (20)

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
 
Csc1401 lecture06 - internal memory
Csc1401   lecture06 - internal memoryCsc1401   lecture06 - internal memory
Csc1401 lecture06 - internal memory
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Linked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityLinked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And Scalability
 
MPMC U2 PPT.pdf
MPMC U2 PPT.pdfMPMC U2 PPT.pdf
MPMC U2 PPT.pdf
 
Shultz dallas q108
Shultz dallas q108Shultz dallas q108
Shultz dallas q108
 
Schulz dallas q1_2008
Schulz dallas q1_2008Schulz dallas q1_2008
Schulz dallas q1_2008
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance Tuning
 
SmartBalance-DAC-v2
SmartBalance-DAC-v2SmartBalance-DAC-v2
SmartBalance-DAC-v2
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
Low Power Design and Verification
Low Power Design and VerificationLow Power Design and Verification
Low Power Design and Verification
 
Low power design-ver_26_mar08
Low power design-ver_26_mar08Low power design-ver_26_mar08
Low power design-ver_26_mar08
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Accelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptxAccelerating Science with OpenStack.pptx
Accelerating Science with OpenStack.pptx
 
20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science
 
20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating Science20121017 OpenStack CERN Accelerating Science
20121017 OpenStack CERN Accelerating Science
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
E3 s binghamton
E3 s binghamtonE3 s binghamton
E3 s binghamton
 

2011 Feb07 Lewis Prospectus

  • 1. Energy Conservation and Thermal Management in High-Performance Server Architectures Adam Lewis The Center for Advanced Computer Studies The University of Louisiana at Lafayette Tuesday, March 15, 2011
  • 2. Agenda • Background and Related Work • System Modeling • Effective Prediction • Initial Evaluation and Results • Thermally-Aware Scheduling • Status, Plans, and Summary Tuesday, March 15, 2011
  • 3. What does this picture tell us? (c) The New York Times, June 14, 2006 Source: McKinsey & Company 2008 Source: EPA 2008 A 20% projected increase Only ~50% in data center of power consumed emissions over next 5 years from IT equipment Tuesday, March 15, 2011
  • 4. Current Practice Completely Fair Scheduler Domain-based Load Balancing Power-state aware Run-queue scheduling Domain-based Load Balancing Power-state aware (Solaris 11) Run-queue scheduling Interface w/ power manager? Tuesday, March 15, 2011
  • 5. Thread Scheduling & Power Management DVFS: P = CV 2 f SpeedStep Multi-core/Many-core • Performance issues [LLBL 2007] •Cache affinity • Lack of slack •Load balancing • High load = No gain •Opportunity to turn • Reliability issues [Bircher 2008] off the lights? • Under-clocking & MTBF • Reactive rather than proactive Tuesday, March 15, 2011
  • 6. Proactively Avoid Thermal Emergencies Thermally- A Full-System Effective + Aware Energy Model Prediction Scheduling • Possible approaches • Heat-and-Run and related approaches [Gomaa2004] [Coskun2009] [Zhou2010] • Memory-resource focused approaches [Merkel2010] • Control-theoretic techniques Tuesday, March 15, 2011
  • 8. Model: Inputs & Components Esystem = Eproc + Emem + Ehdd + Eboard + Eem . • Processor • Memory • Hard disk & storage devices Edc = Esystem • Motherboard & peripherals • Three DC voltage domains • Electrical & Electromechanical • 12Vdc, 5.5Vdc, 3.3Vdc Components • 5.5V and 3.3V domains limited to 20% of rated voltage Tuesday, March 15, 2011
  • 9. Model: Processor t2 Eproc = (Pproc (t))dt t1 Memory Memory DDR2-DRAM DDR2-DRAM QPLC • Core 1 Core 2 Core 2 Core 1 Bus transactions I-cache D-cache I-cache D-cache D-cache I-cache D-cache I-cache L2 Cache L2 cache L2 cache L2 Cache Core Core system request interface system request interface Crossbar Switch Crossbar Switch • Coherent Integrated HyperTransport Integrated Host bridge (cHT) Host bridge Memory Memory Reflects amount of HyperTransport HyperTransport Controller Controller HyperTransport Bus QPLIO QPLIO data processed Input USB VGA Output SouthBridge Handler HDD • Ethernet Die temperature DVD Graphics Board - Level Power consumers PCI Express AMD Opteron Intel Nehalem • Computation per core • Processor as black box • Processor system • Power = f(workload) metrics • Manifests as heat Tuesday, March 15, 2011
  • 10. Model: Memory • DRAM Read/Write t2 N power + background Emem = t1 ( i=1 CMi (t) + DB(t)) × PDR + Pab dt power = known quantities • Performance counters exist for measuring the count of highest level cache miss and bus transactions • Combine these to compute the energy consumed Tuesday, March 15, 2011
  • 11. Model: Storage Ehdd =Pspin−up × Tsu + Pread N r × Tr + Pwrite N w × Tw + Pidle × Tid Parameter Value Interface Serial ATA Capacity 250 GB Rotational speed 7200 rpm Power (spin up) 5.25 W (max) Power (Random read, write) 9.4 W (typical) Power (Silent read, write) 7 W (typical) Power (idle) 5 W (typical) Power (low RPM idle) 2.3 W (typical for 4500 RPM) Power (standby) 0.8 W (typical) Power (sleep) 0.6 W (typical) Tuesday, March 15, 2011
  • 12. Model: Board Eboard = Vpower−line × Ipower−line × tinterval • System components that support the operation of the machine • Typically in the 5.5Vdc and 3.3Vdc power domains • Measured by current probe Tuesday, March 15, 2011
  • 13. Model: Electromechanical N Tp i Eem = V (t) · I(t) + Pf an (t) dt 0 i=1 • Need to account for energy required to cool • No performance counters • Can measure power drawn by the fans • Derived from log data collected by OS Tuesday, March 15, 2011
  • 15. gobmk 1.7% 9.0% 2.30 zeusmp TABLE III 8.1% 2.8% 2.14 Linear ODEL ERRORS FOR CAP, AR(1),A good ON AN AMD OPTERON S M AR Time Series - AND MARS idea? AR Avg Max RMSE Benchmark Err % Err % astar 3.1% 8.9% 2.26 games 2.2% 9.3% 2.06 gobmk 1.7% 9.0% 2.30 zeusmp 2.8% 8.1% 2.14 TABLE IV M ODEL ERRORS AR Model: AMD Opteron SERVER Linear FOR AR ON I NTEL N EHALEM • Linear Regression • Easy, simple Benchmark Avg Err % Max Err % RMSE • Odd mis-predictions astar 5.9% 28.5% 4.94 • Corrective methods games gobmk 5.6% 5.3% 44.3% 27.8% 5.54 4.83 required zeusmp 7.7% 31.8% 7.24 TABLE IV M ODEL ERRORS FOR AR ON I NTEL Nehalem SERVER Linear AR Model: Intel N EHALEM Avg Max RMSE Benchmark Err % Err % astar 5.9% 28.5% 4.94 games 5.6% 44.3% 5.54 Tuesday, March 15, 2011
  • 16. Prediction w/ Chaotic Time Series TABLE 4 dications of chaotic behavior in power time series Chaotic behavior (AMD, Intel) Chaotic Time Series Benchmark Hurst Average • Time-delay reconstructed state space Parameter (H) Lyapunov Exponent • Uses Takens Embedding Theorem: bzip2 (0.96, 0.93) (0.28, 0.35) • Time-delayed partition of observations to build function that cactusadm (0.95, 0.97) (0.01, 0.04) preserves the topological and gromac (0.94, 0.95) (0.02, 0.03) leslie3d (0.93, 0.94) (0.05, 0.11) dynamical properties of our original omnetpp (0.96, 0.97) (0.05, 0.06) chaotic system perlbench (0.98, 0.95) (0.06, 0.04) • Find nearest neighbors on attractor to our observations • Perform least-square curve fit to find a ponent can be calculated using: polynomial that approximates the attractor 1 N −1 λ = lim ln|f (Xn )|. N →∞ N n=0 e found a positive Lyapunov exponent when per- ming this calculation on our data set ranging from 1 to 0.28 (or 0.03 to 0.35) on the AMD (or Intel) test ver, as listed in Table 4, where each pair indicates Tuesday, March 15, 2011
  • 17. Kernel weighting 1. −m K(x) = (2π) exp(−x /2) 2 2 3. 1 x n+p Kβ (x) = K( ) Op ∗ Kβ (Xt−1 − x) β β t=p+1 ˆ f (x) = n+p 2. Kβ (Xt−1 − x) 1 t=p+1 4 5 T β= σ Op = (Xt−1 , . . . , Xt−p ) 3p σ = median(|xi − µ|)/0.6745 ¯ ¯ Tuesday, March 15, 2011
  • 18. Forward prediction • Start with a Taylor series expansion fˆ(X) = f (x) + f (x)T (X − x) ˆ ˆ • Find the coefficients of the polynomial by solving the linear least squares problem for a and b: n+p T 2 Xt − a − b (Xt−1 − x) ∗ Kβ (Xt−1 − x) t=p+1 • Explicit solution for our linear least squares problem: n+p ˆ 1 f (x) = (s2 − s1 ∗ (x − Xt−1 ))2 ∗ Kβ ((x − Xt−1 )/β) n t=p+1 n+p 1 si = (x − Xt−1 )i ∗ Kβ ((x − Xt−1 )/β) n t=p+1 Tuesday, March 15, 2011
  • 19. Time Complexity n future observations p past observations Creating a CAP: O(n ) 2 Predicting with a CAP: O(p) Tuesday, March 15, 2011
  • 20. Initial Evaluation and Results Tuesday, March 15, 2011
  • 21. gobmk C Artificial Intelligence: Go ipmito FP Benchmarks are ava Initial Evaluation and Results calculix C++/F90 Structural Mechanics zeusmp F90 Computational Fluid Dynamics commo Solaris Linux. TABLE 7 dtrace Test hardware configuration tunable impact Sun Fire 2200 Dell PowerEdge R610 consum CPU 2 AMD Opteron 2 Intel Xeon (Nehalem) 5500 The p CPU L2 cache 2x2MB 4MB Memory 8GB 9GM power Internal disk 2060GB 500GM and th Network 2x1000Mbps 1x1000Mbps measur Video On-board NVIDA Quadro FX4600 Height 1 rack unit 1 rack unit ampera memor the run TABLE 5 CAT increases linearly, as can be obtained in Eq. (15). downlo values, TABLE 6 SPEC CPU2006 benchmarks used for model interna Open Training Benchmarks calibration Evaluation Benchmarks The actual computation time results for our CAP SPEC CPU2006 benchmarks used for evaluation code implemented using MATLAB run on machines the dif coun (detailed in Table 7) with respect to different n and p sured u the O Integer Benchmarks Integer Benchmark values are provided in the next section. one Ag In bzip2 C Compression astar C++ Path Finding domain is de- mcf C Combinatorial Optimization gobmk C Artificial Intelligence: Go have from th average omnetpp C++ Discrete Event Simulation 5 FP E VALUATION ipmi Benchmarks a bench sed on host. a are FP Benchmarks . , Xt−p A calculix C++/F90 Structural Mechanics out to evaluate set of experiments was carried gromacs C/F90 Biochemistry/Molecular Dynamics the performance of power models built using CAP zeusmp F90 Computational Fluid Dynamics comm (u) = cacstusADM C/F90 Physics/General Relativity Solar as de- leslie3d F90 Fluid Dynamics techniques to approximate a solution for dynamic 5.2 R lbm C Fluid Dynamics systems following Eq. TABLE 7 purpose of the first (12). The Linu experiment was to confirm the time complexity CAP. Fig. 8 dtra Test hardware configuration The behavior of CAP was simulated using MATLAB from C tuna using two criteria: sufficient coverage of the functional on the hardware described in Table 7 withR610 varying tual po impa Sun Fire 2200 Dell PowerEdge system units in the processor and reasonable applicability values of n future observation and p past observa- cons to the problem space. Components of the processor tions. Fig. 6 illustrates the behavior of (Nehalem) 5500 CPU 2 AMD Opteron 2 Intel Xeon CAP as the three A Th affect the thermal envelope in different ways [40]. This CPU L2 cache 2x2MB 4MB over th value of n is varied and confirms the O(n2 ) behavior Memory 8GB 9GM pow issue is addressed by balancing the benchmark selec- of the predictor in this case. The behavior of CAP as Internal disk 2060GB 500GM betwee and tion between integer and floating point benchmarks pNetwork is shown in Fig. 71x1000Mbps is varied 2x1000Mbps and supports the claim benchm meas a local in the SPEC CPU2006 benchmark suite. Second, the Video On-board NVIDA Quadro FX4600 polyno Tuesday, March 15, 2011 of linear behavior. amp
  • 22. Results: AMD Opteron f10h (a) Astar/CAP. (b) Astar/AR(1)). (c) Zeusmp/CAP. (d) Zeusmp/AR(1). Fig. 8. Actual power results versus predicted results for AMD Opteron. Tuesday, March 15, 2011
  • 23. Results: Intel Nehalem (c) Zeusmp/CAP. (d) Zeusmp/AR(1). Fig. 8. Actual power results versus predicted results for AMD Opteron. (a) Astar/CAP. (b) Astar/AR(1)). (c) Zeusmp/CAP. (d) Zeusmp/AR(1). Fig. 9. Actual power results versus predicted results for an Intel Nehalem server. Tuesday, March 15, 2011
  • 24. Results: Error - Other Benchmarks Tuesday, March 15, 2011
  • 25. Observations and Analysis • Where does maximum error occur? • Choice of performance counters • Difference in behavior between processors? • The right set of performance counters • Benchmark selection Tuesday, March 15, 2011
  • 26. Thermally-Aware Scheduling Tuesday, March 15, 2011
  • 27. Problem nature • Scheduling... • in time: who runs next • in space: who runs where • Optimization problem • Who runs next: least use of energy with best performance quality of service • Who runs where: best utilization of resources with least increase in processor and/or ambient temperature Tuesday, March 15, 2011
  • 28. Thermal Extensions to System Model Applications have a length: 2. 1. For which we define L(A, DA , t) Thermal Equivalent of Application and generate workload ΘA (A, DA , T, t) = U (A,DA ,t) lim Je × (T − Tnominal ) T →Tth U (A, DA , t) = lim n × W (pi , di , t) × Ln (An , DAn , t), 1 ≤ i ≤ p n→ke 3. Which is used to generate 4. That is used to compute Thermal Efficiency to Completion Cost of Performance per Unit Power ΘA (A,DA ,T,t) ΘA (A,DA ,T,t) η(A, DA , T, t) = ΘA (Ae ,DAe ,Tme ,Le ) Cθ (A, DA , T, t) = Esys (A,DA ,t) Tuesday, March 15, 2011
  • 29. Extending CAP for Thermal Prediction • Thermal Chaotic Attractor Predictor (TCAP) • Extends CAP to thermal domain • Created and used in similar manner to CAP • Matching TCAP for each thermal metric Tuesday, March 15, 2011
  • 30. Reducing Processor Temperatures • Premise: Processor die temperature can be managed by controlling what threads execute over time • Predict the next thread to run a logical CPU using TCAP for processor die temperature Tuesday, March 15, 2011
  • 31. Reducing Ambient Temperature • Premise: Control system ambient temperature by managing load on logical CPUs so that overheated resources have time to recover • Partition resources into categories based on predicted change in temperature • Move workload from “HOT” resources towards “COLD” resources Tuesday, March 15, 2011
  • 32. Status, Plans, and Summary Tuesday, March 15, 2011
  • 33. Current Status Thermally- A Full-System Effective + Aware Energy Model Prediction Scheduling • Development Complete • Evaluation Complete • Design complete • Intel + AMD processors • Prototype under development • OpenSolaris (Solaris 11) • Peer-reviewed • Conference/Workshop: 3 • Journal: 1 Tuesday, March 15, 2011
  • 34. Plan for Completion ID Task 1 Respond to review comments for [Lewis 2011] 2 Implement scheduler prototype in FreeBSD 3 Evaluate scheduler performance using parallel benchmarks 4 Document results and submit to archival journal 5 Create dissertation from Prospectus + output from previous task 6 Defend dissertation 7 Respond to comments from committee and Graduate School editor 8 Submit final version of document Tuesday, March 15, 2011
  • 35. Future Directions • Extend beyond a single blade • Cluster, Grid, and Cloud Scheduling • MPI, OpenMP, and other environments • Impact of operating system virtualization • Extension of the thermal model in terms of the thermodynamics of computation Tuesday, March 15, 2011
  • 38. This work was supported in part by the U.S. Department of Energy and by the Louisiana Board of Regents Tuesday, March 15, 2011
  • 39. Publications List Lewis, A., Ghosh, S., and Tzeng, N.-F. 2008. Run- time energy consumption estimation based on workload in server systems. Proceedings of the 2008 conference on Power aware computing and systems. Lewis, A., Simon, J., and Tzeng, N.-F. 2010. Chaotic attractor prediction for server run-time energy consumption. Proc. of the 2010 Workshop on Power Aware Computing and Systems (Hotpower’10). Lewis, A., Tzeng, N.-F., and Ghosh, S. 2011. Time series approximation of run-time energy consumption based on server workload. Under review for publication in ACM Transactions on Architecture and Code Optimization. Tuesday, March 15, 2011