SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
[3B2-14]       mmi2010040065.3d              30/7/010       19:43      Page 65




..........................................................................................................................................................................................................................



      GOOGLE-WIDE PROFILING:
      A CONTINUOUS PROFILING
 INFRASTRUCTURE FOR DATA CENTERS
..........................................................................................................................................................................................................................
GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR

DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS. WITH

NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A

DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES. FURTHERMORE,

GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATION-

PLATFORM AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC,

MICROARCHITECTURAL PECULIARITIES.


......       As cloud-based computing grows
in pervasiveness and scale, understanding data-
                                                                         uniquely qualified for performance monitor-
                                                                         ing in the data center. Traditionally, sam-
center applications’ performance and utiliza-                            pling tools run on a single machine,
tion characteristics is critically important,                            monitoring specific processes or the system
because even minor performance improve-                                  as a whole. Profiling can begin on demand                                 Gang Ren
ments translate into huge cost savings. Tradi-                           to analyze a performance problem, or it can
tional performance analysis, which typically                             run continuously.1                                                        Eric Tune
needs to isolate benchmarks, can be too com-                                Google-Wide Profiling can be theoreti-
plicated or even impossible with modern                                  cally viewed as an extension of the Digital                               Tipp Moseley
datacenter applications. It’s easier and more                            Continuous        Profiling      Infrastructure
representative to monitor datacenter applica-                            (DCPI)1 to data centers. GWP is a continu-                                Yixin Shi
tions running on live traffic. However, applica-                         ous profiling infrastructure; it samples across
tion owners won’t tolerate latency degradations                          machines in multiple data centers and col-                                Silvius Rus
of more than a few percent, so these tools must                          lects various events—such as stack traces,
be nonintrusive and have minimal overhead.                               hardware events, lock contention profiles,                                Robert Hundt
As with all profiling tools, observer distortion                         heap profiles, and kernel events—allowing
must be minimized to enable meaningful                                   cross-correlation with job scheduling data,                               Google
analysis. (For additional information on re-                             application-specific data, and other informa-
lated techniques, see the ‘‘Profiling: From sin-                         tion from the data centers.
gle systems to data centers’’ sidebar.)                                     GWP collects daily profiles from several
    Sampling-based tools can bring overhead                              thousand applications running on thousands
and distortion to acceptable levels, so they’re                          of servers, and the compressed profile
                                                                                                                                                     ...................................................................

                           
0272-1732/10/$26.00 c 2010 IEEE                                           Published by the IEEE Computer Society                                                                          65
[3B2-14]       mmi2010040065.3d              30/7/010       19:43      Page 66




                          ...............................................................................................................................................................................................
                          DATACENTER COMPUTING


                          ..............................................................................................................................................................................................
                                                                           Profiling: From single systems to data centers
                                            1
                             From gprof to Intel VTune (http://software.intel.com/en-us/intel-                              as MPI calls) from parallel programs but aren’t suitable for continuous
                          vtune), profiling has long been standard in the development process.                              profiling.
                          Most profiling tools focus on a single execution of a single program.
                          As computing systems have evolved, understanding the bigger picture
                          across multiple machines has become increasingly important.                                       References
                             Continuous, always-on profiling became more important with Morph                                 1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A
                          and DCPI for Digital UNIX. Morph uses low overhead system-wide sam-                                      Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con-
                          pling (approximately 0.3 percent) and binary rewriting to continuously                                   struction (CC 82), ACM Press, 1982, pp. 120-126.
                          evolve programs to adapt to their host architectures.2 DCPI gathers                                 2. X. Zhang et al., ‘‘System Support for Automatic Profiling and
                          much more robust profiles, such as precise stall times and causes,                                       Optimization,’’ Proc. 16th ACM Symp. Operating Systems
                          and focuses on reporting information to users.3                                                          Principles (SOSP 97), ACM Press, 1997, pp. 15-26.
                             OProfile, a DCPI-inspired tool, collects and reports data much in the                            3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All
                          same way for a plethora of architectures, though it offers less sophisti-                                the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys-
                          cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a                                 tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390.
                          profile source. High-performance computing (HPC) shares many profiling                              4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in
                          challenges with cloud computing, because profilers must gather repre-                                    Emerging Petascale Applications,’’ Proc. Conf. High Perfor-
                          sentative samples with minimal overhead over thousands of nodes.                                         mance Computing Networking, Storage and Analysis (SC
                             Cloud computing has additional challenges—vastly disparate applica-                                 09), ACM Press, 2009, no. 51.
                          tions, workloads, and machine configurations. As a result, different profil-                        5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior
                          ing strategies exist for HPC and cloud computing. HPCToolkit uses                                        with Upshot, tech. report, Argonne National Laboratory,
                          lightweight sampling to collect call stacks from optimized programs                                      1991.
                          and compares profiles to identify scaling issues as parallelism increases.4                         6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization
                             Open|SpeedShop (www.openspeedshop.org) provides similar func-                                         with Jumpshot,’’ Int’l J. High Performance Computing Appli-
                          tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such                                     cations, vol. 13, no. 3, 1999, pp. 277-288.




                                                                          database grows by several Gbytes every day.                                   Does a particular memory allocation
                                                                          Profiling at this scale presents significant                                   scheme benefit a particular class of
                                                                          challenges that don’t exist for a single ma-                                   applications?
                                                                          chine. Verifying that the profiles are correct                                What is the cycles per instruction (CPI)
                                                                          is important and challenging because the                                       for applications across platforms?
                                                                          workloads are dynamic. Managing profiling
                                                                          overhead becomes far more important as                                    Additionally, we can derive higher-level data
                                                                          well, as any unnecessary profiling overhead                               to more complex but interesting questions,
                                                                          can cost millions of dollars in additional                                such as which compilers were used for appli-
                                                                          resources. Finally, making the profile data                               cations in the fleet, whether there are more
                                                                          universally accessible is an additional chal-                             32-bit or 64-bit applications running, and
                                                                          lenge. GWP is also a cloud application,                                   how much utilization is being lost by subop-
                                                                          with its own scalability and performance                                  timal job scheduling.
                                                                          issues.
                                                                              With this volume of data, we can answer                               Infrastructure
                                                                          typical performance questions about datacen-                                  Figure 1 provides an overview of the en-
                                                                          ter applications, including the following:                                tire GWP system.

                                                                              What are the hottest processes, rou-                                 Collector
                                                                               tines, or code regions?                                                 GWP samples in two dimensions. At any
                                                                              How does performance differ across                                   moment, profiling occurs only on a small
                                                                               software versions?                                                   subset of all machines in the fleet, and
                                                                              Which locks are most contended?                                      event-based sampling is used at the machine
                                                                              Which processes are memory hogs?                                     level. Sampling in only one dimension would
....................................................................

                          66                      IEEE MICRO
[3B2-14]   mmi2010040065.3d    30/7/010   19:43    Page 67




be unsatisfactory; if event-based profiling
were active on every machine all the time,                                                                                 Data center
at a normal event-sampling rate, we would                                           Machine
                                                                                    database
be using too many resources across the                                                                      Applications
fleet. Alternatively, if the event-sampling                                                                (with profiling
                                                                                                             interfaces)
rate is too low, profiles become too sparse
to drill down to the individual machine                                                                       Daemons
level. For each event type, we choose a                                                 Collectors
sampling rate high enough to provide mean-                     Binary
ingful machine-level data while still                        repository                                      Raw profiles
minimizing the distortion caused by the
                                                                                  Symbolizer
profiling on critical applications. The system                                   (MapReduce)
has been actively profiling nearly all                        Binary                                          Processed
                                                             collector                                         profiles
machines at Google for several years with
only rare complaints of system interference.
    A central machine database manages all
machines in the fleet and lists every machine’s                                                           Profile database
                                                                                   Webserver
name and basic hardware characteristics. The                                                                   (OLAP)
GWP profile collector periodically gets a list
of all machines from that database and selects
a random sample of machines from that pool.          Figure 1. An overview of the Google-Wide Profiling (GWP) infrastructure.
The collector then remotely activates profil-        The whole system consists of collector, symbolizer, profile database,
ing on the selected machines and retrieves           Web server, and other components.
the results. It retrieves different types of
sampled profiles sequentially or concurrently,
depending on the machine and event type.            lower sampling frequencies). Finally, we
For example, the collector might gather             save the profile and metadata in their raw for-
hardware performance counters for several           mat and perform symbolization on a separate
seconds each, then move on to profiling for         set of machines. As a result, the aggregated
lock contention or memory allocation. It            profiling overhead is negligible—less than
takes a few minutes to gather profiles for a        0.01 percent. At the same time, the derived
specific machine.                                   profiles are still meaningful, as we show in
    For robustness, the GWP collector is a          the ‘‘Reliability analysis’’ section.
distributed service. It helps improve availabil-
ity and reduce additional variation from the        Profiles and profiling interfaces
collector itself. To minimize distortion on             GWP collects two categories of profiles:
the machines and the services running on            whole-machine and per-process. Whole-
them, the collector monitors error conditions       machine profiles capture all activities happening
and ceases profiling if the failure rate reaches    on the machine, including user applications,
a predefined threshold. Aside from the collec-      the kernel, kernel modules, daemons,
tor, we monitor all other GWP components            and other background jobs. The whole-
to ensure an always-on service to users.            machine profiles include hardware perfor-
    On the top of the two-dimensional sam-          mance monitoring (HPM) event profiles,
pling approach, we apply several techniques         kernel event traces, and power measurements.
to further reduce the overhead. First, we mea-          Users without root access cannot directly
sure the event-based profiling overhead on          invoke most of the whole-machine profiling
a set of benchmark applications and then con-       systems, so we deploy lightweight daemons
servatively set the maximum rates to ensure         on every machine to let remote users (such
the overhead is always less than a few percent.     as GWP collectors) access those profiles.
Second, we don’t collect whole call stacks for      The daemons act as gate keepers to control
the machine-wide profiles to avoid the high         access, enforce sampling rate limits, and col-
overhead associated with unwinding (but we          lect system variables that must be synchron-
collect call stacks for most server profiles at     ized with the profiles.
                                                                                                         ....................................................................

                                                                                                     JULY/AUGUST 2010                         67
[3B2-14]       mmi2010040065.3d              30/7/010       19:43      Page 68




                          ...............................................................................................................................................................................................
                          DATACENTER COMPUTING



                                                                              We use OProfile (http://oprofile.sourceforge.                         profiles so that we later correlate the profiles
                                                                          net) to collect HPM event profiles. OProfile is                           with job, machine, or datacenter attributes.
                                                                          a system-wide profiler that uses HPM to gen-
                                                                          erate event-based samples for all running                                 Symbolization and binary storage
                                                                          binaries at low overhead. To hide the hetero-                                 After collection, the Google File System
                                                                          geneity of events between architectures, we                               (GFS) stores the profiles.3 To provide mean-
                                                                          define some generic HPM events on top                                     ingful information, the profiles must corre-
                                                                          of the platform-specific events, using an                                 late to source code. However, to save
                                                                          approach similar to PAPI.2 The most com-                                  network bandwidth and disk space, applica-
                                                                          monly used generic events are CPU cycles,                                 tions are usually deployed into data centers
                                                                          retired instructions, L1 and L2 cache misses,                             without any debug or symbolic information,
                                                                          and branch mispredictions. We also provide                                which can make source correlation impossi-
                                                                          access to some architecture-specific events.                              ble. Furthermore, several applications, such
                                                                          Although the aggregated profiles for those                                as Java and QEMU, dynamically generate
                                                                          events are biased to specific architectures,                              and execute code. The code is not available
                                                                          they provide useful information for machine-                              offline and can therefore no longer be sym-
                                                                          specific scenarios.                                                       bolized. The symbolizer must also symbolize
                                                                              In addition to whole-machine profiles, we                             operating system kernels and kernel loadable
                                                                          collect various types of profiles from most                               modules.
                                                                          applications running on a machine using                                       Therefore, the symbolization process be-
                                                                          the Google Performance Tools (http://code.                                comes surprisingly complicated, although
                                                                          google.com/p/google-perftools). Most appli-                               it’s usually trivial for single-machine profil-
                                                                          cations include a common library that                                     ing. Various strategies exist to obtain binaries
                                                                          enables process-wide stacktrace-attributed                                with debug information. For example, we
                                                                          profiling mechanisms for heap allocation,                                 could try to recompile all sampled applica-
                                                                          lock contention, wall time and CPU time,                                  tions at specific source milestones. However,
                                                                          and other performance metrics. The com-                                   it’s too resource-intensive and sometimes im-
                                                                          mon library includes a simple HTTP server                                 possible for applications whose source isn’t
                                                                          linked with handlers for each type of profiler.                           readily available. An alternative is to persis-
                                                                          A handler accepts requests from remote                                    tently store binaries that contain debug infor-
                                                                          users, activates profiling (if it’s not already                           mation before they’re stripped.
                                                                          active), and then sends the profile data back.                                Currently, GWP stores unstripped
                                                                              The GWP collector learns from the cluster-                            binaries in a global repository, which other
                                                                          wide job management system what applica-                                  services use to symbolize stack traces for
                                                                          tions are running on a machine and on                                     automated failure reporting. Since the
                                                                          which port each can be contacted for remote                               binaries are quite large and many unique
                                                                          profiling invocation. A machine lacking re-                               binaries exist, symbolization for a single day
                                                                          mote profiling support has some programs,                                 of profiles would take weeks if run sequen-
                                                                          which the per-process profiling doesn’t cap-                              tially. To reduce the result latency, we dis-
                                                                          ture. However, these are few; comparison                                  tribute symbolization across a few hundred
                                                                          with system-wide profiles shows that remote                               machines using MapReduce.4
                                                                          per-process profiling captures the vast major-
                                                                          ity of Google’s programs. Most programs’                                  Profile storage
                                                                          users have found profiling useful or unobtru-                                Over the past years, GWP has amassed
                                                                          sive enough to leave them enabled.                                        several terabytes of historical performance
                                                                              Together with profiles, GWP collects other                            data. GFS archives the entire performance
                                                                          information about the target machine and                                  logs and corresponding binaries. To make
                                                                          applications. Some of the extra information                               the data useful and accessible, we load the
                                                                          is needed to postprocess the collected profiles,                          samples into a read-only dimensional data-
                                                                          such as a unique identifier for each running                              base that is distributed across hundreds of
                                                                          binary that can be correlated across machines                             machines. That service is accessible to all
                                                                          with unstripped versions for offline symboliza-                           users for ad hoc queries and to systems for
                                                                          tion. The rest are mainly used to tag the                                 automated analyses.
....................................................................

                          68                      IEEE MICRO
[3B2-14]   mmi2010040065.3d   30/7/010   19:43    Page 69




   The database supports a subset of SQL-
like semantics. Although the dimensional
database is well suited to perform queries
that aggregate over the large data set, some
individual queries can take tens of seconds
to complete. Fortunately, most queries are
seen frequently, so the profile server uses ag-
gressive caching to hide the database latency.

User interfaces
    For most users, GWP deploys a webserver
to provide a user interface on top of the pro-
file database. This makes it easy to access
profile data and construct ad hoc queries
for the traditional use of application profiles
(with additional freedom to filter, group, and
aggregate profiles differently).
                                                       (a)

Query view. Several visual interfaces retrieve
information from the profile database, and
all are navigable from a web browser. The
primary interface (see Figure 2) displays
the result entries, such as functions or exe-
cutables, that match the desired query
parameters. This page supplies links that
let users refine the query to more specific
data. For example, the user can restrict the
query to only report samples for a specific
executable collected within a desired time
period. Additionally, the user can modify
or refine any of the parameters to the current
query to create a custom profile view. The
GWP homepage has links to display the
top results, Google-wide, for each perfor-             (b)
mance metric.
                                                    Figure 2. An example query view: an application-level profile (a) and a
Call graph view. For most server profile            function-level profile (b).
samples, the profilers collect full call stacks
with each sample. Call stacks are aggregated
to produce complete dynamic call graphs            describing overall profile information about
for a given profile. Figure 3 shows an exam-       the file and a histogram bar showing the rel-
ple call graph. Each node displays the func-       ative hotness of each source file line. Because
tion name and its percentage of samples,           different versions of each file can exist in
and the nodes are shaded based on this per-        source repositories and branches, we retrieve
centage. The call graph is also displayed          a hash signature from the repository for each
through the web browser, via a Graphviz            source file and aggregate samples on files
plug-in.                                           with identical signatures.

Source annotation. The query and call graph        Profile data API. In addition to the web-
views are useful in directing users to specific    server, we offer a data-access API to read
functions of interest. From there, GWP pro-        profiles directly from the database. It’s
vides a source annotation view that presents       more suitable for automated analyses that
the original source file with a header             must process a large amount of profile
                                                                                                     ....................................................................

                                                                                                JULY/AUGUST 2010                          69
[3B2-14]       mmi2010040065.3d              30/7/010       19:48      Page 70




                          ...............................................................................................................................................................................................
                          DATACENTER COMPUTING



                                                                                                                                                       Application-specific profiling is generic
                                                                                                                                                    and can target any specific set of machines.
                                                                                                                                                    For example, we can use it to profile a set
                                                                                                                                                    of machines deployed with the newest kernel
                                                                                                                                                    version. We can also limit the profiling dura-
                                                                                                                                                    tion to a small time period, such as the appli-
                                                                                                                                                    cation’s running time. It’s useful for batch
                                                                                                                                                    jobs running on data centers, such as MapRe-
                                                                                                                                                    duce, because it facilitates collecting, aggre-
                                                                                                                                                    gating, and exploring profiles collected from
                                                                                                                                                    hundreds or thousands of their workers.

                                                                                                                                                    Reliability analysis
                                                                                                                                                        To conduct continuous profiling on data-
                                                                                                                                                    center machines serving real traffic, extremely
                                                                                                                                                    low overhead is paramount, so we sample in
                                                                                                                                                    both time and machine dimensions. Sam-
                                                                                                                                                    pling introduces variation, so we must mea-
                                                                                                                                                    sure and understand how sampling affects
                                                                                                                                                    the profiles’ quality. But the nature of data-
                                                                                                                                                    center workloads makes this difficult; their
                                                                                                                                                    behavior is continually changing. There’s
                                                                           Figure 3. An example dynamic call graph.                                 no direct way to measure the datacenter
                                                                           Function names are intentionally blurred.                                applications’ profiles’ representativeness. In-
                                                                                                                                                    stead, we use two indirect methods to evalu-
                                                                                                                                                    ate their soundness.
                                                                          data (such as reliability studies) offline. We                                First, we study the stability of aggregated
                                                                          store both raw profiles and symbolized pro-                               profiles themselves using several different
                                                                          files in ProtocolBuffer formats (http://code.                             metrics. Second, we correlate profiles with
                                                                          google.com/apis/protocolbuffers).                                         the performance data from other sources to
                                                                          Advanced users can access and reprocess                                   cross-validate both.
                                                                          them using their preferred programming
                                                                          language.                                                                 Stability of profiles
                                                                                                                                                       We use a single metric, entropy, to mea-
                                                                          Application-specific profiling                                            sure a given profile’s variation. In short, en-
                                                                              Although the default sampling rate is high                            tropy is a measure of the uncertainty
                                                                          enough to derive top-level profiles with high                             associated with a random variable, which in
                                                                          confidence, GWP might not collect enough                                  this case is profile samples. The entropy H
                                                                          samples for applications that consume rela-                               of a profile is defined as
                                                                          tively few cycles Google-wide. Increasing
                                                                          the overall sampling rate to cover those pro-                                                     X
                                                                                                                                                                            n

                                                                          files is too expensive because they’re usually                                 H ðW Þ ¼ À                p ðxi Þ logð p ðxi ÞÞ
                                                                                                                                                                             i¼1
                                                                          sparse.
                                                                              Therefore, we provide an extension to                                 where n is the total number of entries in the
                                                                          GWP for application-specific profiling on                                 profile and p(x) is the fraction of profile
                                                                          the cloud. The machine pool for applica-                                  samples on the entry x.5
                                                                          tion-specific profiling is usually much                                      In general, a high entropy implies a flat
                                                                          smaller than GWP, so we can achieve a                                     profile with many samples. A low entropy
                                                                          high sampling rate on those machines for                                  usually results from a small number of sam-
                                                                          the specific application. Several application                             ples or most samples being concentrated to
                                                                          teams at Google use application-specific                                  few entries. We’re not concerned with en-
                                                                          profiling to continuously monitor their                                   tropy itself. Because entropy is like the signa-
                                                                          applications running on the fleet.                                        ture in a profile, measuring the inherent
....................................................................

                          70                      IEEE MICRO
[3B2-14]                               mmi2010040065.3d   30/7/010     19:43   Page 71




                                  20                                                                                 5.0

                                  18                                                                                 4.5

                                  16                                                                                 4.0
   Number of samples (millions)




                                  14                                                                                 3.5

                                  12                                                                                 3.0




                                                                                                                           Entropy
                                  10                                                                                 2.5

                                   8                                                                                 2.0

                                   6                                                                                 1.5

                                   4                                                                                 1.0

                                   2                                                                                 0.5

                                   0                                                                                 0
                                                                       Random dates

Figure 4. The number of samples and the entropy of daily application-level profiles.
The primary y-axis (bars) is the total number of profile samples. The secondary
y-axis (line) is the entropy of the daily application-level profile.



variation, it should be stable between repre-                                   samples collected for each date. Unless speci-
sentative profiles.                                                             fied, we used CPU cycles in the study, and
   Entropy doesn’t account for differences                                      our conclusions also apply to the other
between entry names. For example, a func-                                       types. As the graph shows, the entropy of
tion profile with x percent on foo and y per-                                   daily application-level profiles is stable be-
cent on bar has the same entropy as a profile                                   tween dates, and it usually falls into a small
with y percent on foo and x percent on bar.                                     interval. The correlation between the num-
So, when we need to identify the changes                                        ber of samples and the profile’s entropy is
on the same entries between profiles, we cal-                                   loose. Once the number of samples reaches
culate the Manhattan distance of two profiles                                   some threshold, it doesn’t necessarily lead
by adding the absolute percentage differences                                   to a lower entropy, partly because GWP
between the top k entries, defined as                                           sometimes samples more machines than nec-
                                                                                essary for daily application-level profiles.
                                             X
k
This is because users frequently must drill
   M ðX ; Y Þ ¼
px ðxi Þ À py ðxi Þ
i¼1
                                                                                down to specific profiles with additional fil-
                                                                                ters on certain tags, such as application
where X and Y are two profiles, k is the num-                                   names, which are a small fraction of all pro-
ber of top entries to count, and py(xi) is 0                                    files collected.
when xi is not in Y. Essentially, the Manhat-                                       We can conduct similar analysis on an
tan distance is a simplified version of relative                                application’s function-level profile (for ex-
entropy between two profiles.                                                   ample, the application in Figure 2b). The re-
                                                                                sult, shown in Figure 5a, is from an
Profiles’ entropy. First, we compare the                                        application whose workload is fairly stable
entropies of application-level profiles                                         when aggregating from many clients. Its
where samples are broken down on individ-                                       entropy is actually more stable. It’s interest-
ual applications. Figure 2a shows an exam-                                      ing to analyze how entropy changes among
ple of such an application-level profile.                                       machines for an application’s function-
   Figure 4 shows daily application-level                                       level profiles. Unlike the aggregated profiles
profiles’ entropies for a series of dates, to-                                  across machines, an application’s per-
gether with the total number of profile                                         machine profiles can vary greatly in terms
                                                                                                                                         ....................................................................

                                                                                                                                     JULY/AUGUST 2010                         71
[3B2-14]       mmi2010040065.3d              30/7/010                                19:43    Page 72




                          ...............................................................................................................................................................................................
                          DATACENTER COMPUTING



                                                                                                              0.36
                                                                                                              0.34                                                                                        6.0
                                                                                                              0.32                                                                                        5.5
                                                                                                              0.30
                                                                                                                                                                                                          5.0




                                                                               Number of samples (millions)
                                                                                                              0.28
                                                                                                              0.26                                                                                        4.5
                                                                                                              0.24
                                                                                                                                                                                                          4.0
                                                                                                              0.22




                                                                                                                                                                                                                 Entropy
                                                                                                              0.20                                                                                        3.5
                                                                                                              0.18
                                                                                                                                                                                                          3.0
                                                                                                              0.16
                                                                                                              0.14                                                                                        2.5
                                                                                                              0.12                                                                                        2.0
                                                                                                              0.10
                                                                                                              0.08                                                                                        1.5
                                                                                                              0.06                                                                                        1.0
                                                                                                              0.04
                                                                                                                                                                                                          0.5
                                                                                                              0.02
                                                                                                                 0                                                                                        0
                                                                                                                                         Random dates
                                                                                                     (a)

                                                                                                              7.0

                                                                                                              6.5

                                                                                                              6.0

                                                                                                              5.5

                                                                                                              5.0

                                                                                                              4.5

                                                                                                              4.0
                                                                                     Entropy




                                                                                                              3.5

                                                                                                              3.0

                                                                                                              2.5

                                                                                                              2.0

                                                                                                              1.5

                                                                                                              1.0

                                                                                                              0.5

                                                                                                              0.0
                                                                                                                    0        0.01   0.02        0.03          0.04                       0.05                 0.06
                                                                                                                                     Number of samples (millions)
                                                                                                    (b)


                                                                           Figure 5. Function-level profiles. The number of samples and the entropy for a single
                                                                           application (a). The correlation between the number of samples and the entropy for all
                                                                           per-machine profiles (b).




....................................................................

                          72                      IEEE MICRO
[3B2-14]   mmi2010040065.3d   30/7/010   19:43    Page 73




of number of samples. Figure 5b plots the          root relationship between the distance and the
relationship between the number of sam-            number of samples,
ples per machine and the entropy of func-                           pffiffiffiffiffiffiffiffiffiffiffiffi
tion-level profiles. As expected, when the            M ðX Þ ¼ C= N ðX Þ
total number of samples is small, the pro-
                                                   where N(X) is the total number of samples
file’s entropy is also small (limited by the
                                                   in a profile and C is a constant that depends
maximum possible uncertainty). But once
                                                   on the profile type.
the threshold is reached at the maximum
number of samples, the entropy becomes
                                                   Derived metrics. We can also indirectly
stable. We can observe two clusters from
                                                   evaluate the profiles’ stability by computing
the graph; some entropies are concentrated
                                                   some derived metrics from multiple pro-
between 5.5 and 6, and the others fall be-
                                                   files. For example, we can derive CPI
tween 4.5 and 5. The application’s two be-
                                                   from HPM profiles containing cycles and
havioral states can explain the two clusters.
                                                   retired instructions. Figure 7 shows that
We’ve seen various clustering patterns on
                                                   the derived CPI is stable across dates. Not
different applications.
                                                   surprisingly, the daily aggregated profiles’
                                                   CPI falls into a small interval between 1.7
The Manhattan distance between profiles. We
                                                   and 1.8 for those days.
use the Manhattan distance to study the
variation between profiles considering
the changes on entry name, where smaller           Comparing with other sources
distance implies less variation. Figure 6a             Beyond measuring profiles’ stability
illustrates the Manhattan distance between         across dates, we also cross-validate the pro-
the daily application-level profiles for a         files with performance and utilization data
series of dates. The results from the Man-         from other Google sources. One example is
hattan distances for both application-level        the utilization data that the data center’s
and function-level profiles are similar to         monitoring system collects. Unlike GWP,
the results with entropy.                          the monitoring system collects data from all
    In Figure 6b, we plot the Manhattan dis-       machines in the data center but at a coarser
tance for several profile types, leading to two    granularity, such as overall CPU utilization.
observations:                                      Its CPU utilization data, in terms of core-
                                                   seconds, matches the measurement from
   In general, memory and thread profiles         GWP’s CPU cycles profile with the follow-
    have smaller distances, and their varia-       ing formula:
    tions appear less correlated with the          CoreSeconds ¼ Cycles * SamplingRatemachine *
    other profiles.                                  SamplingPeriod/CPUFrequencyaverage
   Server CPU time profiles correlate with
    HPM profiles of cycles and instructions
    in terms of variations, which could            Profile uses
    imply that those variations resulted nat-         In each profile, GWP records the samples
    urally from external reasons, such as          of interesting events and a vector of associ-
    workload changes.                              ated information. GWP collects roughly a
                                                   dozen events, such as CPU cycles, retired
    To further understand the correlation be-      instructions, L1 and L2 cache misses, branch
tween the Manhattan distance and the num-          mispredictions, heap memory allocations,
ber of samples, we randomly pick a subset          and lock contention time. The sample defi-
of machines from a specific machine set and        nition varies depending on the event
then compute the Manhattan distance of the         type—it can be CPU cycles or cache misses,
selected subset’s profile against the whole        bytes allocated, or the sampled thread’s lock-
set’s profile. We could use a power function’s     ing time. Note that the sample must be
trend line to capture the change in the Man-       numeric and capable of aggregation.
hattan distance over the number of samples.        The associated vector contains information
The trend line roughly approximates a square       such as application name, function name,
                                                                                                    ....................................................................

                                                                                               JULY/AUGUST 2010                          73
[3B2-14]       mmi2010040065.3d              30/7/010                                19:43        Page 74




                          ...............................................................................................................................................................................................
                          DATACENTER COMPUTING



                                                                                                              22                                                                                        0.20
                                                                                                                                                                                                        0.19
                                                                                                              20                                                                                        0.18
                                                                                                                                                                                                        0.17
                                                                                                              18                                                                                        0.16
                                                                                                                                                                                                        0.15




                                                                               Number of samples (millions)
                                                                                                              16
                                                                                                                                                                                                        0.14




                                                                                                                                                                                                                 Manhattan distance
                                                                                                              14                                                                                        0.13
                                                                                                                                                                                                        0.12
                                                                                                              12                                                                                        0.11
                                                                                                                                                                                                        0.10
                                                                                                              10                                                                                        0.09
                                                                                                                                                                                                        0.08
                                                                                                               8                                                                                        0.07
                                                                                                                                                                                                        0.06
                                                                                                               6
                                                                                                                                                                                                        0.05
                                                                                                               4                                                                                        0.04
                                                                                                                                                                                                        0.03
                                                                                                               2                                                                                        0.02
                                                                                                                                                                                                        0.01
                                                                                                               0                                                                                        0
                                                                                                                                                       Random dates
                                                                                                                         Cycles       Instr       L1_Miss    L2_Miss      CPU       Heap        Threads

                                                                                                   (a)
                                                                                                              0.85
                                                                                                              0.80
                                                                                                              0.75
                                                                                                              0.70
                                                                                                              0.65
                                                                                                              0.60
                                                                                                              0.55
                                                                               Manhattan distance




                                                                                                              0.50
                                                                                                              0.45
                                                                                                              0.40
                                                                                                              0.35
                                                                                                              0.30
                                                                                                              0.25
                                                                                                              0.20
                                                                                                              0.15
                                                                                                              0.10
                                                                                                              0.05
                                                                                                              0.00
                                                                                                                     0            2           4      6       8       10        12      14          16            18
                                                                                                  (b)                                                Number of samples (millions)


                                                                           Figure 6. The Manhattan distance between daily application-level profiles for various
                                                                           profile types (a). The correlation between the number of samples and the Manhattan
                                                                           distance of profiles (b).



....................................................................

                          74                      IEEE MICRO
[3B2-14]                               mmi2010040065.3d   30/7/010   19:43   Page 75




                                  24                                                                               2.0
                                  22
                                                                                                                   1.8
                                  20
                                                                                                                   1.6
   Number of samples (millions)




                                                                                                                         Cycles per instruction (CPI)
                                  18
                                                                                                                   1.4
                                  16
                                  14                                                                               1.2

                                  12                                                                               1.0
                                  10                                                                               0.8
                                   8
                                                                                                                   0.6
                                   6
                                                                                                                   0.4
                                   4
                                   2                                                                               0.2

                                   0                                                                               0
                                                                     Random dates


Figure 7. The correlation between the number of samples and derived cycles per
instruction (CPI).




platform, compiler version, image name,                                       and functions, which is useful for many as-
data center, kernel information, build revi-                                  pects of designing, building, maintaining,
sion, and builder’s name. Assuming that                                       and operating data centers. Infrastructure
the vector contains m elements, we can rep-                                   teams can see the big picture of how their
resent a record GWP collected as a tuple                                      software stacks are being used, aggregated
event, sample counter, m-dimension                                           across every application. This helps identify
vector.                                                                      performance-critical components, create rep-
    When aggregating, GWP lets users                                          resentative benchmark suites, and prioritize
choose k keys from the m dimensions and                                       performance efforts.
groups the samples by the keys. Basically, it                                    At the same time, an application team can
filters the samples by imposing one or                                        use GWP as the first stop for the applica-
more restrictions on the rest of the dimen-                                   tion’s profiles. As an always-on profiling ser-
sions (m—k) and then projects the samples                                     vice, GWP collects a representative sample of
into k key dimensions. GWP finally displays                                   the application’s running instances over time.
the sorted results to users, delivering answers                               Application developers often are surprised
to various performance queries with high                                      by application’s profiles when browsing
confidence. Although not every query                                          GWP results. For example, Google’s speech-
makes sense in practice, even a small subset                                  recognition team quickly optimized a previ-
of them are demonstrably informative in                                       ously unknown hot function that they
identifying performance issues and providing                                  couldn’t have easily located without the
insights into computing resources in the                                      aggregated GWP results. Application teams
cloud.                                                                        also use GWP profiles to design, evaluate,
                                                                              and calibrate their load tests.
Cloud applications’ performance
   GWP profiles provide performance                                           Finding the hottest shared code. Shared code
insights for cloud applications. Users can                                    is remarkably abundant. Profiling each pro-
see how cloud applications are actually con-                                  gram independently might not identify hot
suming machine resources and how the pic-                                     shared code if it’s not hot in any single ap-
ture evolves over time. For example, Figure 2                                 plication, but GWP can identify routines
shows cycles distributed over top executables                                 that don’t account for a significant portion
                                                                                                                                                            ....................................................................

                                                                                                                                                        JULY/AUGUST 2010                         75

Contenu connexe

Tendances

Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsIAEME Publication
 
Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpIJERD Editor
 
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC IJECEIAES
 
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...acijjournal
 
IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...
IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...
IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...IRJET Journal
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 
Scheduling in Virtual Infrastructure for High-Throughput Computing
Scheduling in Virtual Infrastructure for High-Throughput Computing Scheduling in Virtual Infrastructure for High-Throughput Computing
Scheduling in Virtual Infrastructure for High-Throughput Computing IJCSEA Journal
 
A Machine Learning Toolkit for Power Systems Security Analysis
A Machine Learning Toolkit for Power Systems Security AnalysisA Machine Learning Toolkit for Power Systems Security Analysis
A Machine Learning Toolkit for Power Systems Security Analysisbutest
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Energy-aware VM Allocation on An Opportunistic Cloud Infrastructure
Energy-aware VM Allocation on An Opportunistic Cloud InfrastructureEnergy-aware VM Allocation on An Opportunistic Cloud Infrastructure
Energy-aware VM Allocation on An Opportunistic Cloud InfrastructureMario Jose Villamizar Cano
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLéia de Sousa
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 

Tendances (13)

Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applications
 
Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed Up
 
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
 
335 340
335 340335 340
335 340
 
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...
 
IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...
IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...
IRJET- Research Paper on Energy-Aware Virtual Machine Migration for Cloud Com...
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
Scheduling in Virtual Infrastructure for High-Throughput Computing
Scheduling in Virtual Infrastructure for High-Throughput Computing Scheduling in Virtual Infrastructure for High-Throughput Computing
Scheduling in Virtual Infrastructure for High-Throughput Computing
 
A Machine Learning Toolkit for Power Systems Security Analysis
A Machine Learning Toolkit for Power Systems Security AnalysisA Machine Learning Toolkit for Power Systems Security Analysis
A Machine Learning Toolkit for Power Systems Security Analysis
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Energy-aware VM Allocation on An Opportunistic Cloud Infrastructure
Energy-aware VM Allocation on An Opportunistic Cloud InfrastructureEnergy-aware VM Allocation on An Opportunistic Cloud Infrastructure
Energy-aware VM Allocation on An Opportunistic Cloud Infrastructure
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 

En vedette

My views on the concept of social film
My views on the concept of social filmMy views on the concept of social film
My views on the concept of social filmlstandring155
 
Google data centers
Google data centersGoogle data centers
Google data centersAli Al-Ugali
 
4 Easy Steps to the Cloud: Taking the storage path
4 Easy Steps to the Cloud: Taking the storage path4 Easy Steps to the Cloud: Taking the storage path
4 Easy Steps to the Cloud: Taking the storage pathTwinStrata, Inc
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesEunjeong (Lucy) Park
 
Google's internal systems
Google's internal systemsGoogle's internal systems
Google's internal systemsguestcc91d4
 
Ethernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum KeynoteEthernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum Keynotebkoley
 
SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)
SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)
SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)정명훈 Jerry Jeong
 

En vedette (10)

My views on the concept of social film
My views on the concept of social filmMy views on the concept of social film
My views on the concept of social film
 
2012 googles data centers
2012 googles data centers2012 googles data centers
2012 googles data centers
 
Google data centers
Google data centersGoogle data centers
Google data centers
 
4 Easy Steps to the Cloud: Taking the storage path
4 Easy Steps to the Cloud: Taking the storage path4 Easy Steps to the Cloud: Taking the storage path
4 Easy Steps to the Cloud: Taking the storage path
 
Spring in the Cloud
Spring in the CloudSpring in the Cloud
Spring in the Cloud
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for Newbies
 
Google's internal systems
Google's internal systemsGoogle's internal systems
Google's internal systems
 
Google ppt. mis
Google ppt. misGoogle ppt. mis
Google ppt. mis
 
Ethernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum KeynoteEthernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum Keynote
 
SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)
SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)
SDDC(software defined data center)에서 NFV의 역할과 관리도구 (세미나 발표 자료)
 

Similaire à 36575

Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrationsinside-BigData.com
 
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...IRJET Journal
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
 
PI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitecturePI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitectureCSCJournals
 
Presentation
PresentationPresentation
Presentationbutest
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringRafael Ferreira da Silva
 
Introduction to heterogeneous_computing_for_hpc
Introduction to heterogeneous_computing_for_hpcIntroduction to heterogeneous_computing_for_hpc
Introduction to heterogeneous_computing_for_hpcSupasit Kajkamhaeng
 
INOVA GIS Platform - TeleCAD-GIS & IGS (2018)
INOVA GIS Platform - TeleCAD-GIS & IGS (2018)INOVA GIS Platform - TeleCAD-GIS & IGS (2018)
INOVA GIS Platform - TeleCAD-GIS & IGS (2018)Maksim Sestic
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONgerogepatton
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONijaia
 
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU UtilizationUsing Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU Utilizationgerogepatton
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Productioniguazio
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
sp-trajano-april2010
sp-trajano-april2010sp-trajano-april2010
sp-trajano-april2010Axel Trajano
 
Artificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficientlyArtificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficientlyvenkatvajradhar1
 
High-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareHigh-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareMr. Chanuwan
 

Similaire à 36575 (20)

Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
 
Priorities Shift In IC Design
Priorities Shift In IC DesignPriorities Shift In IC Design
Priorities Shift In IC Design
 
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
Benchmarking Techniques for Performance Analysis of Operating Systems and Pro...
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
PI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitecturePI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core Architecture
 
Presentation
PresentationPresentation
Presentation
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
Introduction to heterogeneous_computing_for_hpc
Introduction to heterogeneous_computing_for_hpcIntroduction to heterogeneous_computing_for_hpc
Introduction to heterogeneous_computing_for_hpc
 
Clusetrreport
ClusetrreportClusetrreport
Clusetrreport
 
INOVA GIS Platform - TeleCAD-GIS & IGS (2018)
INOVA GIS Platform - TeleCAD-GIS & IGS (2018)INOVA GIS Platform - TeleCAD-GIS & IGS (2018)
INOVA GIS Platform - TeleCAD-GIS & IGS (2018)
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
 
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU UtilizationUsing Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
50120140505008
5012014050500850120140505008
50120140505008
 
sp-trajano-april2010
sp-trajano-april2010sp-trajano-april2010
sp-trajano-april2010
 
Artificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficientlyArtificial intelligence could help data centers run far more efficiently
Artificial intelligence could help data centers run far more efficiently
 
High-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareHigh-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded Software
 

36575

  • 1. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 65 .......................................................................................................................................................................................................................... GOOGLE-WIDE PROFILING: A CONTINUOUS PROFILING INFRASTRUCTURE FOR DATA CENTERS .......................................................................................................................................................................................................................... GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS. WITH NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES. FURTHERMORE, GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATION- PLATFORM AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC, MICROARCHITECTURAL PECULIARITIES. ...... As cloud-based computing grows in pervasiveness and scale, understanding data- uniquely qualified for performance monitor- ing in the data center. Traditionally, sam- center applications’ performance and utiliza- pling tools run on a single machine, tion characteristics is critically important, monitoring specific processes or the system because even minor performance improve- as a whole. Profiling can begin on demand Gang Ren ments translate into huge cost savings. Tradi- to analyze a performance problem, or it can tional performance analysis, which typically run continuously.1 Eric Tune needs to isolate benchmarks, can be too com- Google-Wide Profiling can be theoreti- plicated or even impossible with modern cally viewed as an extension of the Digital Tipp Moseley datacenter applications. It’s easier and more Continuous Profiling Infrastructure representative to monitor datacenter applica- (DCPI)1 to data centers. GWP is a continu- Yixin Shi tions running on live traffic. However, applica- ous profiling infrastructure; it samples across tion owners won’t tolerate latency degradations machines in multiple data centers and col- Silvius Rus of more than a few percent, so these tools must lects various events—such as stack traces, be nonintrusive and have minimal overhead. hardware events, lock contention profiles, Robert Hundt As with all profiling tools, observer distortion heap profiles, and kernel events—allowing must be minimized to enable meaningful cross-correlation with job scheduling data, Google analysis. (For additional information on re- application-specific data, and other informa- lated techniques, see the ‘‘Profiling: From sin- tion from the data centers. gle systems to data centers’’ sidebar.) GWP collects daily profiles from several Sampling-based tools can bring overhead thousand applications running on thousands and distortion to acceptable levels, so they’re of servers, and the compressed profile ................................................................... 0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 65
  • 2. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 66 ............................................................................................................................................................................................... DATACENTER COMPUTING .............................................................................................................................................................................................. Profiling: From single systems to data centers 1 From gprof to Intel VTune (http://software.intel.com/en-us/intel- as MPI calls) from parallel programs but aren’t suitable for continuous vtune), profiling has long been standard in the development process. profiling. Most profiling tools focus on a single execution of a single program. As computing systems have evolved, understanding the bigger picture across multiple machines has become increasingly important. References Continuous, always-on profiling became more important with Morph 1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A and DCPI for Digital UNIX. Morph uses low overhead system-wide sam- Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con- pling (approximately 0.3 percent) and binary rewriting to continuously struction (CC 82), ACM Press, 1982, pp. 120-126. evolve programs to adapt to their host architectures.2 DCPI gathers 2. X. Zhang et al., ‘‘System Support for Automatic Profiling and much more robust profiles, such as precise stall times and causes, Optimization,’’ Proc. 16th ACM Symp. Operating Systems and focuses on reporting information to users.3 Principles (SOSP 97), ACM Press, 1997, pp. 15-26. OProfile, a DCPI-inspired tool, collects and reports data much in the 3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All same way for a plethora of architectures, though it offers less sophisti- the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys- cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390. profile source. High-performance computing (HPC) shares many profiling 4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in challenges with cloud computing, because profilers must gather repre- Emerging Petascale Applications,’’ Proc. Conf. High Perfor- sentative samples with minimal overhead over thousands of nodes. mance Computing Networking, Storage and Analysis (SC Cloud computing has additional challenges—vastly disparate applica- 09), ACM Press, 2009, no. 51. tions, workloads, and machine configurations. As a result, different profil- 5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior ing strategies exist for HPC and cloud computing. HPCToolkit uses with Upshot, tech. report, Argonne National Laboratory, lightweight sampling to collect call stacks from optimized programs 1991. and compares profiles to identify scaling issues as parallelism increases.4 6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization Open|SpeedShop (www.openspeedshop.org) provides similar func- with Jumpshot,’’ Int’l J. High Performance Computing Appli- tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such cations, vol. 13, no. 3, 1999, pp. 277-288. database grows by several Gbytes every day. Does a particular memory allocation Profiling at this scale presents significant scheme benefit a particular class of challenges that don’t exist for a single ma- applications? chine. Verifying that the profiles are correct What is the cycles per instruction (CPI) is important and challenging because the for applications across platforms? workloads are dynamic. Managing profiling overhead becomes far more important as Additionally, we can derive higher-level data well, as any unnecessary profiling overhead to more complex but interesting questions, can cost millions of dollars in additional such as which compilers were used for appli- resources. Finally, making the profile data cations in the fleet, whether there are more universally accessible is an additional chal- 32-bit or 64-bit applications running, and lenge. GWP is also a cloud application, how much utilization is being lost by subop- with its own scalability and performance timal job scheduling. issues. With this volume of data, we can answer Infrastructure typical performance questions about datacen- Figure 1 provides an overview of the en- ter applications, including the following: tire GWP system. What are the hottest processes, rou- Collector tines, or code regions? GWP samples in two dimensions. At any How does performance differ across moment, profiling occurs only on a small software versions? subset of all machines in the fleet, and Which locks are most contended? event-based sampling is used at the machine Which processes are memory hogs? level. Sampling in only one dimension would .................................................................... 66 IEEE MICRO
  • 3. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 67 be unsatisfactory; if event-based profiling were active on every machine all the time, Data center at a normal event-sampling rate, we would Machine database be using too many resources across the Applications fleet. Alternatively, if the event-sampling (with profiling interfaces) rate is too low, profiles become too sparse to drill down to the individual machine Daemons level. For each event type, we choose a Collectors sampling rate high enough to provide mean- Binary ingful machine-level data while still repository Raw profiles minimizing the distortion caused by the Symbolizer profiling on critical applications. The system (MapReduce) has been actively profiling nearly all Binary Processed collector profiles machines at Google for several years with only rare complaints of system interference. A central machine database manages all machines in the fleet and lists every machine’s Profile database Webserver name and basic hardware characteristics. The (OLAP) GWP profile collector periodically gets a list of all machines from that database and selects a random sample of machines from that pool. Figure 1. An overview of the Google-Wide Profiling (GWP) infrastructure. The collector then remotely activates profil- The whole system consists of collector, symbolizer, profile database, ing on the selected machines and retrieves Web server, and other components. the results. It retrieves different types of sampled profiles sequentially or concurrently, depending on the machine and event type. lower sampling frequencies). Finally, we For example, the collector might gather save the profile and metadata in their raw for- hardware performance counters for several mat and perform symbolization on a separate seconds each, then move on to profiling for set of machines. As a result, the aggregated lock contention or memory allocation. It profiling overhead is negligible—less than takes a few minutes to gather profiles for a 0.01 percent. At the same time, the derived specific machine. profiles are still meaningful, as we show in For robustness, the GWP collector is a the ‘‘Reliability analysis’’ section. distributed service. It helps improve availabil- ity and reduce additional variation from the Profiles and profiling interfaces collector itself. To minimize distortion on GWP collects two categories of profiles: the machines and the services running on whole-machine and per-process. Whole- them, the collector monitors error conditions machine profiles capture all activities happening and ceases profiling if the failure rate reaches on the machine, including user applications, a predefined threshold. Aside from the collec- the kernel, kernel modules, daemons, tor, we monitor all other GWP components and other background jobs. The whole- to ensure an always-on service to users. machine profiles include hardware perfor- On the top of the two-dimensional sam- mance monitoring (HPM) event profiles, pling approach, we apply several techniques kernel event traces, and power measurements. to further reduce the overhead. First, we mea- Users without root access cannot directly sure the event-based profiling overhead on invoke most of the whole-machine profiling a set of benchmark applications and then con- systems, so we deploy lightweight daemons servatively set the maximum rates to ensure on every machine to let remote users (such the overhead is always less than a few percent. as GWP collectors) access those profiles. Second, we don’t collect whole call stacks for The daemons act as gate keepers to control the machine-wide profiles to avoid the high access, enforce sampling rate limits, and col- overhead associated with unwinding (but we lect system variables that must be synchron- collect call stacks for most server profiles at ized with the profiles. .................................................................... JULY/AUGUST 2010 67
  • 4. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 68 ............................................................................................................................................................................................... DATACENTER COMPUTING We use OProfile (http://oprofile.sourceforge. profiles so that we later correlate the profiles net) to collect HPM event profiles. OProfile is with job, machine, or datacenter attributes. a system-wide profiler that uses HPM to gen- erate event-based samples for all running Symbolization and binary storage binaries at low overhead. To hide the hetero- After collection, the Google File System geneity of events between architectures, we (GFS) stores the profiles.3 To provide mean- define some generic HPM events on top ingful information, the profiles must corre- of the platform-specific events, using an late to source code. However, to save approach similar to PAPI.2 The most com- network bandwidth and disk space, applica- monly used generic events are CPU cycles, tions are usually deployed into data centers retired instructions, L1 and L2 cache misses, without any debug or symbolic information, and branch mispredictions. We also provide which can make source correlation impossi- access to some architecture-specific events. ble. Furthermore, several applications, such Although the aggregated profiles for those as Java and QEMU, dynamically generate events are biased to specific architectures, and execute code. The code is not available they provide useful information for machine- offline and can therefore no longer be sym- specific scenarios. bolized. The symbolizer must also symbolize In addition to whole-machine profiles, we operating system kernels and kernel loadable collect various types of profiles from most modules. applications running on a machine using Therefore, the symbolization process be- the Google Performance Tools (http://code. comes surprisingly complicated, although google.com/p/google-perftools). Most appli- it’s usually trivial for single-machine profil- cations include a common library that ing. Various strategies exist to obtain binaries enables process-wide stacktrace-attributed with debug information. For example, we profiling mechanisms for heap allocation, could try to recompile all sampled applica- lock contention, wall time and CPU time, tions at specific source milestones. However, and other performance metrics. The com- it’s too resource-intensive and sometimes im- mon library includes a simple HTTP server possible for applications whose source isn’t linked with handlers for each type of profiler. readily available. An alternative is to persis- A handler accepts requests from remote tently store binaries that contain debug infor- users, activates profiling (if it’s not already mation before they’re stripped. active), and then sends the profile data back. Currently, GWP stores unstripped The GWP collector learns from the cluster- binaries in a global repository, which other wide job management system what applica- services use to symbolize stack traces for tions are running on a machine and on automated failure reporting. Since the which port each can be contacted for remote binaries are quite large and many unique profiling invocation. A machine lacking re- binaries exist, symbolization for a single day mote profiling support has some programs, of profiles would take weeks if run sequen- which the per-process profiling doesn’t cap- tially. To reduce the result latency, we dis- ture. However, these are few; comparison tribute symbolization across a few hundred with system-wide profiles shows that remote machines using MapReduce.4 per-process profiling captures the vast major- ity of Google’s programs. Most programs’ Profile storage users have found profiling useful or unobtru- Over the past years, GWP has amassed sive enough to leave them enabled. several terabytes of historical performance Together with profiles, GWP collects other data. GFS archives the entire performance information about the target machine and logs and corresponding binaries. To make applications. Some of the extra information the data useful and accessible, we load the is needed to postprocess the collected profiles, samples into a read-only dimensional data- such as a unique identifier for each running base that is distributed across hundreds of binary that can be correlated across machines machines. That service is accessible to all with unstripped versions for offline symboliza- users for ad hoc queries and to systems for tion. The rest are mainly used to tag the automated analyses. .................................................................... 68 IEEE MICRO
  • 5. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 69 The database supports a subset of SQL- like semantics. Although the dimensional database is well suited to perform queries that aggregate over the large data set, some individual queries can take tens of seconds to complete. Fortunately, most queries are seen frequently, so the profile server uses ag- gressive caching to hide the database latency. User interfaces For most users, GWP deploys a webserver to provide a user interface on top of the pro- file database. This makes it easy to access profile data and construct ad hoc queries for the traditional use of application profiles (with additional freedom to filter, group, and aggregate profiles differently). (a) Query view. Several visual interfaces retrieve information from the profile database, and all are navigable from a web browser. The primary interface (see Figure 2) displays the result entries, such as functions or exe- cutables, that match the desired query parameters. This page supplies links that let users refine the query to more specific data. For example, the user can restrict the query to only report samples for a specific executable collected within a desired time period. Additionally, the user can modify or refine any of the parameters to the current query to create a custom profile view. The GWP homepage has links to display the top results, Google-wide, for each perfor- (b) mance metric. Figure 2. An example query view: an application-level profile (a) and a Call graph view. For most server profile function-level profile (b). samples, the profilers collect full call stacks with each sample. Call stacks are aggregated to produce complete dynamic call graphs describing overall profile information about for a given profile. Figure 3 shows an exam- the file and a histogram bar showing the rel- ple call graph. Each node displays the func- ative hotness of each source file line. Because tion name and its percentage of samples, different versions of each file can exist in and the nodes are shaded based on this per- source repositories and branches, we retrieve centage. The call graph is also displayed a hash signature from the repository for each through the web browser, via a Graphviz source file and aggregate samples on files plug-in. with identical signatures. Source annotation. The query and call graph Profile data API. In addition to the web- views are useful in directing users to specific server, we offer a data-access API to read functions of interest. From there, GWP pro- profiles directly from the database. It’s vides a source annotation view that presents more suitable for automated analyses that the original source file with a header must process a large amount of profile .................................................................... JULY/AUGUST 2010 69
  • 6. [3B2-14] mmi2010040065.3d 30/7/010 19:48 Page 70 ............................................................................................................................................................................................... DATACENTER COMPUTING Application-specific profiling is generic and can target any specific set of machines. For example, we can use it to profile a set of machines deployed with the newest kernel version. We can also limit the profiling dura- tion to a small time period, such as the appli- cation’s running time. It’s useful for batch jobs running on data centers, such as MapRe- duce, because it facilitates collecting, aggre- gating, and exploring profiles collected from hundreds or thousands of their workers. Reliability analysis To conduct continuous profiling on data- center machines serving real traffic, extremely low overhead is paramount, so we sample in both time and machine dimensions. Sam- pling introduces variation, so we must mea- sure and understand how sampling affects the profiles’ quality. But the nature of data- center workloads makes this difficult; their behavior is continually changing. There’s Figure 3. An example dynamic call graph. no direct way to measure the datacenter Function names are intentionally blurred. applications’ profiles’ representativeness. In- stead, we use two indirect methods to evalu- ate their soundness. data (such as reliability studies) offline. We First, we study the stability of aggregated store both raw profiles and symbolized pro- profiles themselves using several different files in ProtocolBuffer formats (http://code. metrics. Second, we correlate profiles with google.com/apis/protocolbuffers). the performance data from other sources to Advanced users can access and reprocess cross-validate both. them using their preferred programming language. Stability of profiles We use a single metric, entropy, to mea- Application-specific profiling sure a given profile’s variation. In short, en- Although the default sampling rate is high tropy is a measure of the uncertainty enough to derive top-level profiles with high associated with a random variable, which in confidence, GWP might not collect enough this case is profile samples. The entropy H samples for applications that consume rela- of a profile is defined as tively few cycles Google-wide. Increasing the overall sampling rate to cover those pro- X n files is too expensive because they’re usually H ðW Þ ¼ À p ðxi Þ logð p ðxi ÞÞ i¼1 sparse. Therefore, we provide an extension to where n is the total number of entries in the GWP for application-specific profiling on profile and p(x) is the fraction of profile the cloud. The machine pool for applica- samples on the entry x.5 tion-specific profiling is usually much In general, a high entropy implies a flat smaller than GWP, so we can achieve a profile with many samples. A low entropy high sampling rate on those machines for usually results from a small number of sam- the specific application. Several application ples or most samples being concentrated to teams at Google use application-specific few entries. We’re not concerned with en- profiling to continuously monitor their tropy itself. Because entropy is like the signa- applications running on the fleet. ture in a profile, measuring the inherent .................................................................... 70 IEEE MICRO
  • 7. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 71 20 5.0 18 4.5 16 4.0 Number of samples (millions) 14 3.5 12 3.0 Entropy 10 2.5 8 2.0 6 1.5 4 1.0 2 0.5 0 0 Random dates Figure 4. The number of samples and the entropy of daily application-level profiles. The primary y-axis (bars) is the total number of profile samples. The secondary y-axis (line) is the entropy of the daily application-level profile. variation, it should be stable between repre- samples collected for each date. Unless speci- sentative profiles. fied, we used CPU cycles in the study, and Entropy doesn’t account for differences our conclusions also apply to the other between entry names. For example, a func- types. As the graph shows, the entropy of tion profile with x percent on foo and y per- daily application-level profiles is stable be- cent on bar has the same entropy as a profile tween dates, and it usually falls into a small with y percent on foo and x percent on bar. interval. The correlation between the num- So, when we need to identify the changes ber of samples and the profile’s entropy is on the same entries between profiles, we cal- loose. Once the number of samples reaches culate the Manhattan distance of two profiles some threshold, it doesn’t necessarily lead by adding the absolute percentage differences to a lower entropy, partly because GWP between the top k entries, defined as sometimes samples more machines than nec- essary for daily application-level profiles. X
  • 8. k
  • 9. This is because users frequently must drill M ðX ; Y Þ ¼
  • 10. px ðxi Þ À py ðxi Þ
  • 11. i¼1 down to specific profiles with additional fil- ters on certain tags, such as application where X and Y are two profiles, k is the num- names, which are a small fraction of all pro- ber of top entries to count, and py(xi) is 0 files collected. when xi is not in Y. Essentially, the Manhat- We can conduct similar analysis on an tan distance is a simplified version of relative application’s function-level profile (for ex- entropy between two profiles. ample, the application in Figure 2b). The re- sult, shown in Figure 5a, is from an Profiles’ entropy. First, we compare the application whose workload is fairly stable entropies of application-level profiles when aggregating from many clients. Its where samples are broken down on individ- entropy is actually more stable. It’s interest- ual applications. Figure 2a shows an exam- ing to analyze how entropy changes among ple of such an application-level profile. machines for an application’s function- Figure 4 shows daily application-level level profiles. Unlike the aggregated profiles profiles’ entropies for a series of dates, to- across machines, an application’s per- gether with the total number of profile machine profiles can vary greatly in terms .................................................................... JULY/AUGUST 2010 71
  • 12. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 72 ............................................................................................................................................................................................... DATACENTER COMPUTING 0.36 0.34 6.0 0.32 5.5 0.30 5.0 Number of samples (millions) 0.28 0.26 4.5 0.24 4.0 0.22 Entropy 0.20 3.5 0.18 3.0 0.16 0.14 2.5 0.12 2.0 0.10 0.08 1.5 0.06 1.0 0.04 0.5 0.02 0 0 Random dates (a) 7.0 6.5 6.0 5.5 5.0 4.5 4.0 Entropy 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 0.01 0.02 0.03 0.04 0.05 0.06 Number of samples (millions) (b) Figure 5. Function-level profiles. The number of samples and the entropy for a single application (a). The correlation between the number of samples and the entropy for all per-machine profiles (b). .................................................................... 72 IEEE MICRO
  • 13. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 73 of number of samples. Figure 5b plots the root relationship between the distance and the relationship between the number of sam- number of samples, ples per machine and the entropy of func- pffiffiffiffiffiffiffiffiffiffiffiffi tion-level profiles. As expected, when the M ðX Þ ¼ C= N ðX Þ total number of samples is small, the pro- where N(X) is the total number of samples file’s entropy is also small (limited by the in a profile and C is a constant that depends maximum possible uncertainty). But once on the profile type. the threshold is reached at the maximum number of samples, the entropy becomes Derived metrics. We can also indirectly stable. We can observe two clusters from evaluate the profiles’ stability by computing the graph; some entropies are concentrated some derived metrics from multiple pro- between 5.5 and 6, and the others fall be- files. For example, we can derive CPI tween 4.5 and 5. The application’s two be- from HPM profiles containing cycles and havioral states can explain the two clusters. retired instructions. Figure 7 shows that We’ve seen various clustering patterns on the derived CPI is stable across dates. Not different applications. surprisingly, the daily aggregated profiles’ CPI falls into a small interval between 1.7 The Manhattan distance between profiles. We and 1.8 for those days. use the Manhattan distance to study the variation between profiles considering the changes on entry name, where smaller Comparing with other sources distance implies less variation. Figure 6a Beyond measuring profiles’ stability illustrates the Manhattan distance between across dates, we also cross-validate the pro- the daily application-level profiles for a files with performance and utilization data series of dates. The results from the Man- from other Google sources. One example is hattan distances for both application-level the utilization data that the data center’s and function-level profiles are similar to monitoring system collects. Unlike GWP, the results with entropy. the monitoring system collects data from all In Figure 6b, we plot the Manhattan dis- machines in the data center but at a coarser tance for several profile types, leading to two granularity, such as overall CPU utilization. observations: Its CPU utilization data, in terms of core- seconds, matches the measurement from In general, memory and thread profiles GWP’s CPU cycles profile with the follow- have smaller distances, and their varia- ing formula: tions appear less correlated with the CoreSeconds ¼ Cycles * SamplingRatemachine * other profiles. SamplingPeriod/CPUFrequencyaverage Server CPU time profiles correlate with HPM profiles of cycles and instructions in terms of variations, which could Profile uses imply that those variations resulted nat- In each profile, GWP records the samples urally from external reasons, such as of interesting events and a vector of associ- workload changes. ated information. GWP collects roughly a dozen events, such as CPU cycles, retired To further understand the correlation be- instructions, L1 and L2 cache misses, branch tween the Manhattan distance and the num- mispredictions, heap memory allocations, ber of samples, we randomly pick a subset and lock contention time. The sample defi- of machines from a specific machine set and nition varies depending on the event then compute the Manhattan distance of the type—it can be CPU cycles or cache misses, selected subset’s profile against the whole bytes allocated, or the sampled thread’s lock- set’s profile. We could use a power function’s ing time. Note that the sample must be trend line to capture the change in the Man- numeric and capable of aggregation. hattan distance over the number of samples. The associated vector contains information The trend line roughly approximates a square such as application name, function name, .................................................................... JULY/AUGUST 2010 73
  • 14. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 74 ............................................................................................................................................................................................... DATACENTER COMPUTING 22 0.20 0.19 20 0.18 0.17 18 0.16 0.15 Number of samples (millions) 16 0.14 Manhattan distance 14 0.13 0.12 12 0.11 0.10 10 0.09 0.08 8 0.07 0.06 6 0.05 4 0.04 0.03 2 0.02 0.01 0 0 Random dates Cycles Instr L1_Miss L2_Miss CPU Heap Threads (a) 0.85 0.80 0.75 0.70 0.65 0.60 0.55 Manhattan distance 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 12 14 16 18 (b) Number of samples (millions) Figure 6. The Manhattan distance between daily application-level profiles for various profile types (a). The correlation between the number of samples and the Manhattan distance of profiles (b). .................................................................... 74 IEEE MICRO
  • 15. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 75 24 2.0 22 1.8 20 1.6 Number of samples (millions) Cycles per instruction (CPI) 18 1.4 16 14 1.2 12 1.0 10 0.8 8 0.6 6 0.4 4 2 0.2 0 0 Random dates Figure 7. The correlation between the number of samples and derived cycles per instruction (CPI). platform, compiler version, image name, and functions, which is useful for many as- data center, kernel information, build revi- pects of designing, building, maintaining, sion, and builder’s name. Assuming that and operating data centers. Infrastructure the vector contains m elements, we can rep- teams can see the big picture of how their resent a record GWP collected as a tuple software stacks are being used, aggregated event, sample counter, m-dimension across every application. This helps identify vector. performance-critical components, create rep- When aggregating, GWP lets users resentative benchmark suites, and prioritize choose k keys from the m dimensions and performance efforts. groups the samples by the keys. Basically, it At the same time, an application team can filters the samples by imposing one or use GWP as the first stop for the applica- more restrictions on the rest of the dimen- tion’s profiles. As an always-on profiling ser- sions (m—k) and then projects the samples vice, GWP collects a representative sample of into k key dimensions. GWP finally displays the application’s running instances over time. the sorted results to users, delivering answers Application developers often are surprised to various performance queries with high by application’s profiles when browsing confidence. Although not every query GWP results. For example, Google’s speech- makes sense in practice, even a small subset recognition team quickly optimized a previ- of them are demonstrably informative in ously unknown hot function that they identifying performance issues and providing couldn’t have easily located without the insights into computing resources in the aggregated GWP results. Application teams cloud. also use GWP profiles to design, evaluate, and calibrate their load tests. Cloud applications’ performance GWP profiles provide performance Finding the hottest shared code. Shared code insights for cloud applications. Users can is remarkably abundant. Profiling each pro- see how cloud applications are actually con- gram independently might not identify hot suming machine resources and how the pic- shared code if it’s not hot in any single ap- ture evolves over time. For example, Figure 2 plication, but GWP can identify routines shows cycles distributed over top executables that don’t account for a significant portion .................................................................... JULY/AUGUST 2010 75
  • 16. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 76 ............................................................................................................................................................................................... DATACENTER COMPUTING Table 1. Platform affinity example. computation by looking at its percentage of all Google’s computations. As another Random assignment of instructions (CPI in brackets) example, GWP profiles can identify the applications running on old hardware con- Platform 1 Platform 2 figurations and evaluate whether they NumCrunch 100 (1) 100 (1) should be retired for efficiency. MemBench 100 (1) 100 (2) Total cycles 200 300 Optimizing for application affinities Some applications run better on a partic- Optimal assignment of instructions ular hardware platform due to sensitivity to Platform 1 Platform 2 architectural details, such as processor micro- architecture or cache size. It’s generally very NumCrunch 0 (1) 200 (1) hard or impossible to predict which applica- MemBench 200 (1) 0 (2) tion will fare best on which platform. In- Total cycles 200 200 stead, we measure an efficiency metric, CPI, for each application and platform com- of any single application but consume the bination. We can then improve job schedul- most cycles overall. For example, the ing so that applications are scheduled on GWP profiles revealed that the zlib library platforms where they do best, subject to (www.zlib.net) accounted for nearly 5 per- availability. The example in Table 1 shows cent of all CPU cycles consumed. That how the total number of cycles needed to motivated an effort to optimize zlib routines run a fixed number of instructions on a and evaluate compression alternatives. In fixed machine capacity drops from 500 to fact, some users have used GWP numbers 400 using preferential scheduling. Specifi- to calculate the estimated savings for perfor- cally, although the application NumCrunch mance tuning efforts on shared functions. runs just as well on Platform1 as on Plat- Not surprisingly, given the Google fleet’s form2, application MemBench does poorly scale, a single-percent improvement on a on Platform2 because of the smaller L2 core routine could potentially save signifi- cache. Thus, the scheduler should give Mem- cant money per year. Unsurprisingly, the Bench preference to Platform1. new informal metric, ‘‘dollar amount per The overall optimization process has sev- performance change,’’ has become popular eral steps. First, we derive cycle and instruc- among Google engineers. We’re considering tion samples from GWP profiles. Then, we providing a new metric, ‘‘dollar per source compute an improved assignment table by line,’’ in annotated source views. moving instructions away from application- At the same time, some users have used platform combinations with the worst relative GWP profiles as a source of coverage infor- efficiency. We use cycle and instruction sam- mation to assess the feasibility of deprecation ples over a fixed period of time, aggregated and to uncover users of library functions that per job and platform. We then compute are to be deprecated. Due to the profiles’ and normalize CPI by clock rate. We can for- dynamic nature, the users might miss less mulate finding the optimal assignment as a common clients, but the biggest, most linear programming problem. The one un- important callers are easy to find. known is Loadij —the number of instruction samples of application j on platform i. Evaluating hardware features. The low-level The constants are: information GWP provides about how CPU cycles (and other machine resources) CPIij , the measured CPI of application are spent is also used for early evaluation j on platform i. of new hardware features that datacenter TotalLoadj , the total measured number operators might want to introduce. One in- of instruction samples of application j. teresting example has been to evaluate Capacityi , the total capacity for plat- whether it would be beneficial to use a spe- form i, measured as total number of cial coprocessor to accelerate floating-point cycle samples for platform i. .................................................................... 76 IEEE MICRO
  • 17. [3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 77 The equation is: function samples can reconstruct the X entire call graph, showing users how Minimize CPIij à Loadij the CPU cycles (and other events) are i;j distributed in the program. X where Loadij ¼ TotalLoadj j A dramatic change to hot functions in the X daily profiles could trigger a finer-grained and CPIij à Loadij Capacityi j comparison, which eventually points blame to the source code revision number, com- We use a simulated annealing solver that piler, or data center. approximates the optimal solution in seconds for workloads of around 100 jobs running Feedback-directed optimization on thousands of machines of four different Sampled profiles can also be used for platforms over one month. Although appli- feedback-directed compiler optimization cation developers already mapped major (FDO), outlined in work by Chen et al.7 applications to their best platform through GWP collects such sampled profiles and manual assignment, we’ve measured 10 to offers a mechanism to extract profiles for a 15 percent potential improvement in most particular binary in a format that the com- cases where many jobs run on multiple plat- piler understands. This profile will be higher forms. Similarly, users can use GWP data to quality than any profile derived from test identify how to colocate multiple applica- inputs because it was derived from running tions on a single machine to achieve the on live data. Furthermore, as long as devel- best throughput.6 opers make no changes in critical program sections, we can use an aged profile to im- Datacenter performance monitoring prove a freshly released binary’s performance. GWP users can also use GWP queries Many Web companies have release cycles of with computing-resource—related keys, such two weeks or less, so this approach works as data center, platform, compiler, or well in practice. builder, for auditing purposes. For example, Similar to load test calibration, users also when rolling out a new compiler, users in- use GWP for quality assurance for profiles spect the versions that running applications generated from static benchmarks. Bench- are actually compiled with. Users can easily marks can represent some codes well, but not measure such transitions from time-based others, and users use GWP to identify which profiles. Similarly, a user can measure how codes are ill-suited for FDO compilation. soon a new hardware platform becomes Finally, users use GWP to estimate the active or how quickly old ones are retired. quality of HPM events and microarchitec- This also applies to new versions of applica- tural features, such as cache latencies of vari- tions being rolled out. Grouping by data ous incarnations of processors with the same center, GWP displays how the applications, instruction set architecture (ISA). For exam- CPU cycles, and machine types are distrib- ple, if the same application runs on two plat- uted in different locations. forms, we can compare the HPM counters Beyond providing profile snapshots, and identify the relevant differences. GWP can monitor the changes between cho- sen profiles from two queries. The two pro- files should be similar in that they must have the identical keys and events, but differ- B esides adding more types of perfor- mance events to collect, we’re now exploring more directions for using GWP ent in one or more other dimensions. For profiles. These include not only user-interface applications, GWP usually focuses on moni- enhancement but also advanced data-mining toring function profiles. This is useful for techniques to detect interesting patterns in two reasons: profiles. It’s also interesting to mesh-up GWP profiles with performance data from other performance optimization normally sources to address complex performance starts at the function level, and problems in datacenter applications. MICRO .................................................................... JULY/AUGUST 2010 77