36575

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 65

..........................................................................................................................................................................................................................

GOOGLE-WIDE PROFILING:
A CONTINUOUS PROFILING
INFRASTRUCTURE FOR DATA CENTERS
..........................................................................................................................................................................................................................
GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR

DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS. WITH

NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A

DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES. FURTHERMORE,

GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATION-

PLATFORM AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC,

MICROARCHITECTURAL PECULIARITIES.

...... As cloud-based computing grows
in pervasiveness and scale, understanding data-
uniquely qualified for performance monitor-
ing in the data center. Traditionally, sam-
center applications’ performance and utiliza- pling tools run on a single machine,
tion characteristics is critically important, monitoring specific processes or the system
because even minor performance improve- as a whole. Profiling can begin on demand Gang Ren
ments translate into huge cost savings. Tradi- to analyze a performance problem, or it can
tional performance analysis, which typically run continuously.1 Eric Tune
needs to isolate benchmarks, can be too com- Google-Wide Profiling can be theoreti-
plicated or even impossible with modern cally viewed as an extension of the Digital Tipp Moseley
datacenter applications. It’s easier and more Continuous Profiling Infrastructure
representative to monitor datacenter applica- (DCPI)1 to data centers. GWP is a continu- Yixin Shi
tions running on live traffic. However, applica- ous profiling infrastructure; it samples across
tion owners won’t tolerate latency degradations machines in multiple data centers and col- Silvius Rus
of more than a few percent, so these tools must lects various events—such as stack traces,
be nonintrusive and have minimal overhead. hardware events, lock contention profiles, Robert Hundt
As with all profiling tools, observer distortion heap profiles, and kernel events—allowing
must be minimized to enable meaningful cross-correlation with job scheduling data, Google
analysis. (For additional information on re- application-specific data, and other informa-
lated techniques, see the ‘‘Profiling: From sin- tion from the data centers.
gle systems to data centers’’ sidebar.) GWP collects daily profiles from several
Sampling-based tools can bring overhead thousand applications running on thousands
and distortion to acceptable levels, so they’re of servers, and the compressed profile
...................................................................

0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 65

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 66

...............................................................................................................................................................................................
DATACENTER COMPUTING

..............................................................................................................................................................................................
Profiling: From single systems to data centers
1
From gprof to Intel VTune (http://software.intel.com/en-us/intel- as MPI calls) from parallel programs but aren’t suitable for continuous
vtune), profiling has long been standard in the development process. profiling.
Most profiling tools focus on a single execution of a single program.
As computing systems have evolved, understanding the bigger picture
across multiple machines has become increasingly important. References
Continuous, always-on profiling became more important with Morph 1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A
and DCPI for Digital UNIX. Morph uses low overhead system-wide sam- Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con-
pling (approximately 0.3 percent) and binary rewriting to continuously struction (CC 82), ACM Press, 1982, pp. 120-126.
evolve programs to adapt to their host architectures.2 DCPI gathers 2. X. Zhang et al., ‘‘System Support for Automatic Profiling and
much more robust profiles, such as precise stall times and causes, Optimization,’’ Proc. 16th ACM Symp. Operating Systems
and focuses on reporting information to users.3 Principles (SOSP 97), ACM Press, 1997, pp. 15-26.
OProfile, a DCPI-inspired tool, collects and reports data much in the 3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All
same way for a plethora of architectures, though it offers less sophisti- the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys-
cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390.
profile source. High-performance computing (HPC) shares many profiling 4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in
challenges with cloud computing, because profilers must gather repre- Emerging Petascale Applications,’’ Proc. Conf. High Perfor-
sentative samples with minimal overhead over thousands of nodes. mance Computing Networking, Storage and Analysis (SC
Cloud computing has additional challenges—vastly disparate applica- 09), ACM Press, 2009, no. 51.
tions, workloads, and machine configurations. As a result, different profil- 5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior
ing strategies exist for HPC and cloud computing. HPCToolkit uses with Upshot, tech. report, Argonne National Laboratory,
lightweight sampling to collect call stacks from optimized programs 1991.
and compares profiles to identify scaling issues as parallelism increases.4 6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization
Open|SpeedShop (www.openspeedshop.org) provides similar func- with Jumpshot,’’ Int’l J. High Performance Computing Appli-
tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such cations, vol. 13, no. 3, 1999, pp. 277-288.

database grows by several Gbytes every day. Does a particular memory allocation
Profiling at this scale presents significant scheme benefit a particular class of
challenges that don’t exist for a single ma- applications?
chine. Verifying that the profiles are correct What is the cycles per instruction (CPI)
is important and challenging because the for applications across platforms?
workloads are dynamic. Managing profiling
overhead becomes far more important as Additionally, we can derive higher-level data
well, as any unnecessary profiling overhead to more complex but interesting questions,
can cost millions of dollars in additional such as which compilers were used for appli-
resources. Finally, making the profile data cations in the fleet, whether there are more
universally accessible is an additional chal- 32-bit or 64-bit applications running, and
lenge. GWP is also a cloud application, how much utilization is being lost by subop-
with its own scalability and performance timal job scheduling.
issues.
With this volume of data, we can answer Infrastructure
typical performance questions about datacen- Figure 1 provides an overview of the en-
ter applications, including the following: tire GWP system.

What are the hottest processes, rou- Collector
tines, or code regions? GWP samples in two dimensions. At any
How does performance differ across moment, profiling occurs only on a small
software versions? subset of all machines in the fleet, and
Which locks are most contended? event-based sampling is used at the machine
Which processes are memory hogs? level. Sampling in only one dimension would
....................................................................

66 IEEE MICRO

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 67

be unsatisfactory; if event-based profiling
were active on every machine all the time, Data center
at a normal event-sampling rate, we would Machine
database
be using too many resources across the Applications
fleet. Alternatively, if the event-sampling (with profiling
interfaces)
rate is too low, profiles become too sparse
to drill down to the individual machine Daemons
level. For each event type, we choose a Collectors
sampling rate high enough to provide mean- Binary
ingful machine-level data while still repository Raw profiles
minimizing the distortion caused by the
Symbolizer
profiling on critical applications. The system (MapReduce)
has been actively profiling nearly all Binary Processed
collector profiles
machines at Google for several years with
only rare complaints of system interference.
A central machine database manages all
machines in the fleet and lists every machine’s Profile database
Webserver
name and basic hardware characteristics. The (OLAP)
GWP profile collector periodically gets a list
of all machines from that database and selects
a random sample of machines from that pool. Figure 1. An overview of the Google-Wide Profiling (GWP) infrastructure.
The collector then remotely activates profil- The whole system consists of collector, symbolizer, profile database,
ing on the selected machines and retrieves Web server, and other components.
the results. It retrieves different types of
sampled profiles sequentially or concurrently,
depending on the machine and event type. lower sampling frequencies). Finally, we
For example, the collector might gather save the profile and metadata in their raw for-
hardware performance counters for several mat and perform symbolization on a separate
seconds each, then move on to profiling for set of machines. As a result, the aggregated
lock contention or memory allocation. It profiling overhead is negligible—less than
takes a few minutes to gather profiles for a 0.01 percent. At the same time, the derived
specific machine. profiles are still meaningful, as we show in
For robustness, the GWP collector is a the ‘‘Reliability analysis’’ section.
distributed service. It helps improve availabil-
ity and reduce additional variation from the Profiles and profiling interfaces
collector itself. To minimize distortion on GWP collects two categories of profiles:
the machines and the services running on whole-machine and per-process. Whole-
them, the collector monitors error conditions machine profiles capture all activities happening
and ceases profiling if the failure rate reaches on the machine, including user applications,
a predefined threshold. Aside from the collec- the kernel, kernel modules, daemons,
tor, we monitor all other GWP components and other background jobs. The whole-
to ensure an always-on service to users. machine profiles include hardware perfor-
On the top of the two-dimensional sam- mance monitoring (HPM) event profiles,
pling approach, we apply several techniques kernel event traces, and power measurements.
to further reduce the overhead. First, we mea- Users without root access cannot directly
sure the event-based profiling overhead on invoke most of the whole-machine profiling
a set of benchmark applications and then con- systems, so we deploy lightweight daemons
servatively set the maximum rates to ensure on every machine to let remote users (such
the overhead is always less than a few percent. as GWP collectors) access those profiles.
Second, we don’t collect whole call stacks for The daemons act as gate keepers to control
the machine-wide profiles to avoid the high access, enforce sampling rate limits, and col-
overhead associated with unwinding (but we lect system variables that must be synchron-
collect call stacks for most server profiles at ized with the profiles.
....................................................................

JULY/AUGUST 2010 67

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 68

...............................................................................................................................................................................................

We use OProfile (http://oprofile.sourceforge. profiles so that we later correlate the profiles
net) to collect HPM event profiles. OProfile is with job, machine, or datacenter attributes.
a system-wide profiler that uses HPM to gen-
erate event-based samples for all running Symbolization and binary storage
binaries at low overhead. To hide the hetero- After collection, the Google File System
geneity of events between architectures, we (GFS) stores the profiles.3 To provide mean-
define some generic HPM events on top ingful information, the profiles must corre-
of the platform-specific events, using an late to source code. However, to save
approach similar to PAPI.2 The most com- network bandwidth and disk space, applica-
monly used generic events are CPU cycles, tions are usually deployed into data centers
retired instructions, L1 and L2 cache misses, without any debug or symbolic information,
and branch mispredictions. We also provide which can make source correlation impossi-
access to some architecture-specific events. ble. Furthermore, several applications, such
Although the aggregated profiles for those as Java and QEMU, dynamically generate
events are biased to specific architectures, and execute code. The code is not available
they provide useful information for machine- offline and can therefore no longer be sym-
specific scenarios. bolized. The symbolizer must also symbolize
In addition to whole-machine profiles, we operating system kernels and kernel loadable
collect various types of profiles from most modules.
applications running on a machine using Therefore, the symbolization process be-
the Google Performance Tools (http://code. comes surprisingly complicated, although
google.com/p/google-perftools). Most appli- it’s usually trivial for single-machine profil-
cations include a common library that ing. Various strategies exist to obtain binaries
enables process-wide stacktrace-attributed with debug information. For example, we
profiling mechanisms for heap allocation, could try to recompile all sampled applica-
lock contention, wall time and CPU time, tions at specific source milestones. However,
and other performance metrics. The com- it’s too resource-intensive and sometimes im-
mon library includes a simple HTTP server possible for applications whose source isn’t
linked with handlers for each type of profiler. readily available. An alternative is to persis-
A handler accepts requests from remote tently store binaries that contain debug infor-
users, activates profiling (if it’s not already mation before they’re stripped.
active), and then sends the profile data back. Currently, GWP stores unstripped
The GWP collector learns from the cluster- binaries in a global repository, which other
wide job management system what applica- services use to symbolize stack traces for
tions are running on a machine and on automated failure reporting. Since the
which port each can be contacted for remote binaries are quite large and many unique
profiling invocation. A machine lacking re- binaries exist, symbolization for a single day
mote profiling support has some programs, of profiles would take weeks if run sequen-
which the per-process profiling doesn’t cap- tially. To reduce the result latency, we dis-
ture. However, these are few; comparison tribute symbolization across a few hundred
with system-wide profiles shows that remote machines using MapReduce.4
per-process profiling captures the vast major-
ity of Google’s programs. Most programs’ Profile storage
users have found profiling useful or unobtru- Over the past years, GWP has amassed
sive enough to leave them enabled. several terabytes of historical performance
Together with profiles, GWP collects other data. GFS archives the entire performance
information about the target machine and logs and corresponding binaries. To make
applications. Some of the extra information the data useful and accessible, we load the
is needed to postprocess the collected profiles, samples into a read-only dimensional data-
such as a unique identifier for each running base that is distributed across hundreds of
binary that can be correlated across machines machines. That service is accessible to all
with unstripped versions for offline symboliza- users for ad hoc queries and to systems for
tion. The rest are mainly used to tag the automated analyses.
....................................................................

68 IEEE MICRO

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 69

The database supports a subset of SQL-
like semantics. Although the dimensional
database is well suited to perform queries
that aggregate over the large data set, some
individual queries can take tens of seconds
to complete. Fortunately, most queries are
seen frequently, so the profile server uses ag-
gressive caching to hide the database latency.

User interfaces
For most users, GWP deploys a webserver
to provide a user interface on top of the pro-
file database. This makes it easy to access
profile data and construct ad hoc queries
for the traditional use of application profiles
(with additional freedom to filter, group, and
aggregate profiles differently).
(a)

Query view. Several visual interfaces retrieve
information from the profile database, and
all are navigable from a web browser. The
primary interface (see Figure 2) displays
the result entries, such as functions or exe-
cutables, that match the desired query
parameters. This page supplies links that
let users refine the query to more specific
data. For example, the user can restrict the
query to only report samples for a specific
executable collected within a desired time
period. Additionally, the user can modify
or refine any of the parameters to the current
query to create a custom profile view. The
GWP homepage has links to display the
top results, Google-wide, for each perfor- (b)
mance metric.
Figure 2. An example query view: an application-level profile (a) and a
Call graph view. For most server profile function-level profile (b).
samples, the profilers collect full call stacks
with each sample. Call stacks are aggregated
to produce complete dynamic call graphs describing overall profile information about
for a given profile. Figure 3 shows an exam- the file and a histogram bar showing the rel-
ple call graph. Each node displays the func- ative hotness of each source file line. Because
tion name and its percentage of samples, different versions of each file can exist in
and the nodes are shaded based on this per- source repositories and branches, we retrieve
centage. The call graph is also displayed a hash signature from the repository for each
through the web browser, via a Graphviz source file and aggregate samples on files
plug-in. with identical signatures.

Source annotation. The query and call graph Profile data API. In addition to the web-
views are useful in directing users to specific server, we offer a data-access API to read
functions of interest. From there, GWP pro- profiles directly from the database. It’s
vides a source annotation view that presents more suitable for automated analyses that
the original source file with a header must process a large amount of profile
....................................................................

JULY/AUGUST 2010 69

[3B2-14] mmi2010040065.3d 30/7/010 19:48 Page 70

...............................................................................................................................................................................................

Application-specific profiling is generic
and can target any specific set of machines.
For example, we can use it to profile a set
of machines deployed with the newest kernel
version. We can also limit the profiling dura-
tion to a small time period, such as the appli-
cation’s running time. It’s useful for batch
jobs running on data centers, such as MapRe-
duce, because it facilitates collecting, aggre-
gating, and exploring profiles collected from
hundreds or thousands of their workers.

Reliability analysis
To conduct continuous profiling on data-
center machines serving real traffic, extremely
low overhead is paramount, so we sample in
both time and machine dimensions. Sam-
pling introduces variation, so we must mea-
sure and understand how sampling affects
the profiles’ quality. But the nature of data-
center workloads makes this difficult; their
behavior is continually changing. There’s
Figure 3. An example dynamic call graph. no direct way to measure the datacenter
Function names are intentionally blurred. applications’ profiles’ representativeness. In-
stead, we use two indirect methods to evalu-
ate their soundness.
data (such as reliability studies) offline. We First, we study the stability of aggregated
store both raw profiles and symbolized pro- profiles themselves using several different
files in ProtocolBuffer formats (http://code. metrics. Second, we correlate profiles with
google.com/apis/protocolbuffers). the performance data from other sources to
Advanced users can access and reprocess cross-validate both.
them using their preferred programming
language. Stability of profiles
We use a single metric, entropy, to mea-
Application-specific profiling sure a given profile’s variation. In short, en-
Although the default sampling rate is high tropy is a measure of the uncertainty
enough to derive top-level profiles with high associated with a random variable, which in
confidence, GWP might not collect enough this case is profile samples. The entropy H
samples for applications that consume rela- of a profile is defined as
tively few cycles Google-wide. Increasing
the overall sampling rate to cover those pro- X
n

files is too expensive because they’re usually H ðW Þ ¼ À p ðxi Þ logð p ðxi ÞÞ
i¼1
sparse.
Therefore, we provide an extension to where n is the total number of entries in the
GWP for application-specific profiling on profile and p(x) is the fraction of profile
the cloud. The machine pool for applica- samples on the entry x.5
tion-specific profiling is usually much In general, a high entropy implies a flat
smaller than GWP, so we can achieve a profile with many samples. A low entropy
high sampling rate on those machines for usually results from a small number of sam-
the specific application. Several application ples or most samples being concentrated to
teams at Google use application-specific few entries. We’re not concerned with en-
profiling to continuously monitor their tropy itself. Because entropy is like the signa-
applications running on the fleet. ture in a profile, measuring the inherent
....................................................................

70 IEEE MICRO

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 71

20 5.0

18 4.5

16 4.0
Number of samples (millions)

14 3.5

12 3.0

Entropy
10 2.5

8 2.0

6 1.5

4 1.0

2 0.5

0 0
Random dates

Figure 4. The number of samples and the entropy of daily application-level profiles.
The primary y-axis (bars) is the total number of profile samples. The secondary
y-axis (line) is the entropy of the daily application-level profile.

variation, it should be stable between repre- samples collected for each date. Unless speci-
sentative profiles. fied, we used CPU cycles in the study, and
Entropy doesn’t account for differences our conclusions also apply to the other
between entry names. For example, a func- types. As the graph shows, the entropy of
tion profile with x percent on foo and y per- daily application-level profiles is stable be-
cent on bar has the same entropy as a profile tween dates, and it usually falls into a small
with y percent on foo and x percent on bar. interval. The correlation between the num-
So, when we need to identify the changes ber of samples and the profile’s entropy is
on the same entries between profiles, we cal- loose. Once the number of samples reaches
culate the Manhattan distance of two profiles some threshold, it doesn’t necessarily lead
by adding the absolute percentage differences to a lower entropy, partly because GWP
between the top k entries, defined as sometimes samples more machines than nec-
essary for daily application-level profiles.
X

This is because users frequently must drill
M ðX ; Y Þ ¼

i¼1
down to specific profiles with additional fil-
ters on certain tags, such as application
where X and Y are two profiles, k is the num- names, which are a small fraction of all pro-
ber of top entries to count, and py(xi) is 0 files collected.
when xi is not in Y. Essentially, the Manhat- We can conduct similar analysis on an
tan distance is a simplified version of relative application’s function-level profile (for ex-
entropy between two profiles. ample, the application in Figure 2b). The re-
sult, shown in Figure 5a, is from an
Profiles’ entropy. First, we compare the application whose workload is fairly stable
entropies of application-level profiles when aggregating from many clients. Its
where samples are broken down on individ- entropy is actually more stable. It’s interest-
ual applications. Figure 2a shows an exam- ing to analyze how entropy changes among
ple of such an application-level profile. machines for an application’s function-
Figure 4 shows daily application-level level profiles. Unlike the aggregated profiles
profiles’ entropies for a series of dates, to- across machines, an application’s per-
gether with the total number of profile machine profiles can vary greatly in terms
....................................................................

JULY/AUGUST 2010 71

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 72

...............................................................................................................................................................................................

0.36
0.34 6.0
0.32 5.5
0.30
5.0

0.28
0.26 4.5
0.24
4.0
0.22

Entropy
0.20 3.5
0.18
3.0
0.16
0.14 2.5
0.12 2.0
0.10
0.08 1.5
0.06 1.0
0.04
0.5
0.02
0 0
Random dates
(a)

7.0

6.5

6.0

5.5

5.0

4.5

4.0
Entropy

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0
0 0.01 0.02 0.03 0.04 0.05 0.06
(b)

Figure 5. Function-level profiles. The number of samples and the entropy for a single
application (a). The correlation between the number of samples and the entropy for all
per-machine profiles (b).

....................................................................

72 IEEE MICRO

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 73

of number of samples. Figure 5b plots the root relationship between the distance and the
relationship between the number of sam- number of samples,
ples per machine and the entropy of func- pffiffiffiffiffiffiffiffiffiffiffiffi
tion-level profiles. As expected, when the M ðX Þ ¼ C= N ðX Þ
total number of samples is small, the pro-
where N(X) is the total number of samples
file’s entropy is also small (limited by the
in a profile and C is a constant that depends
maximum possible uncertainty). But once
on the profile type.
the threshold is reached at the maximum
number of samples, the entropy becomes
Derived metrics. We can also indirectly
stable. We can observe two clusters from
evaluate the profiles’ stability by computing
the graph; some entropies are concentrated
some derived metrics from multiple pro-
between 5.5 and 6, and the others fall be-
files. For example, we can derive CPI
tween 4.5 and 5. The application’s two be-
from HPM profiles containing cycles and
havioral states can explain the two clusters.
retired instructions. Figure 7 shows that
We’ve seen various clustering patterns on
the derived CPI is stable across dates. Not
different applications.
surprisingly, the daily aggregated profiles’
CPI falls into a small interval between 1.7
The Manhattan distance between profiles. We
and 1.8 for those days.
use the Manhattan distance to study the
variation between profiles considering
the changes on entry name, where smaller Comparing with other sources
distance implies less variation. Figure 6a Beyond measuring profiles’ stability
illustrates the Manhattan distance between across dates, we also cross-validate the pro-
the daily application-level profiles for a files with performance and utilization data
series of dates. The results from the Man- from other Google sources. One example is
hattan distances for both application-level the utilization data that the data center’s
and function-level profiles are similar to monitoring system collects. Unlike GWP,
the results with entropy. the monitoring system collects data from all
In Figure 6b, we plot the Manhattan dis- machines in the data center but at a coarser
tance for several profile types, leading to two granularity, such as overall CPU utilization.
observations: Its CPU utilization data, in terms of core-
seconds, matches the measurement from
In general, memory and thread profiles GWP’s CPU cycles profile with the follow-
have smaller distances, and their varia- ing formula:
tions appear less correlated with the CoreSeconds ¼ Cycles * SamplingRatemachine *
other profiles. SamplingPeriod/CPUFrequencyaverage
Server CPU time profiles correlate with
HPM profiles of cycles and instructions
in terms of variations, which could Profile uses
imply that those variations resulted nat- In each profile, GWP records the samples
urally from external reasons, such as of interesting events and a vector of associ-
workload changes. ated information. GWP collects roughly a
dozen events, such as CPU cycles, retired
To further understand the correlation be- instructions, L1 and L2 cache misses, branch
tween the Manhattan distance and the num- mispredictions, heap memory allocations,
ber of samples, we randomly pick a subset and lock contention time. The sample defi-
of machines from a specific machine set and nition varies depending on the event
then compute the Manhattan distance of the type—it can be CPU cycles or cache misses,
selected subset’s profile against the whole bytes allocated, or the sampled thread’s lock-
set’s profile. We could use a power function’s ing time. Note that the sample must be
trend line to capture the change in the Man- numeric and capable of aggregation.
hattan distance over the number of samples. The associated vector contains information
The trend line roughly approximates a square such as application name, function name,
....................................................................

JULY/AUGUST 2010 73

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 74

...............................................................................................................................................................................................

22 0.20
0.19
20 0.18
0.17
18 0.16
0.15

16
0.14

Manhattan distance
14 0.13
0.12
12 0.11
0.10
10 0.09
0.08
8 0.07
0.06
6
0.05
4 0.04
0.03
2 0.02
0.01
0 0
Random dates
Cycles Instr L1_Miss L2_Miss CPU Heap Threads

(a)
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Manhattan distance

0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18
(b) Number of samples (millions)

Figure 6. The Manhattan distance between daily application-level profiles for various
profile types (a). The correlation between the number of samples and the Manhattan
distance of profiles (b).

....................................................................

74 IEEE MICRO

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 75

24 2.0
22
1.8
20
1.6

Cycles per instruction (CPI)
18
1.4
16
14 1.2

12 1.0
10 0.8
8
0.6
6
0.4
4
2 0.2

0 0
Random dates

Figure 7. The correlation between the number of samples and derived cycles per
instruction (CPI).

platform, compiler version, image name, and functions, which is useful for many as-
data center, kernel information, build revi- pects of designing, building, maintaining,
sion, and builder’s name. Assuming that and operating data centers. Infrastructure
the vector contains m elements, we can rep- teams can see the big picture of how their
resent a record GWP collected as a tuple software stacks are being used, aggregated
event, sample counter, m-dimension across every application. This helps identify
vector. performance-critical components, create rep-
When aggregating, GWP lets users resentative benchmark suites, and prioritize
choose k keys from the m dimensions and performance efforts.
groups the samples by the keys. Basically, it At the same time, an application team can
filters the samples by imposing one or use GWP as the first stop for the applica-
more restrictions on the rest of the dimen- tion’s profiles. As an always-on profiling ser-
sions (m—k) and then projects the samples vice, GWP collects a representative sample of
into k key dimensions. GWP finally displays the application’s running instances over time.
the sorted results to users, delivering answers Application developers often are surprised
to various performance queries with high by application’s profiles when browsing
confidence. Although not every query GWP results. For example, Google’s speech-
makes sense in practice, even a small subset recognition team quickly optimized a previ-
of them are demonstrably informative in ously unknown hot function that they
identifying performance issues and providing couldn’t have easily located without the
insights into computing resources in the aggregated GWP results. Application teams
cloud. also use GWP profiles to design, evaluate,
and calibrate their load tests.
Cloud applications’ performance
GWP profiles provide performance Finding the hottest shared code. Shared code
insights for cloud applications. Users can is remarkably abundant. Profiling each pro-
see how cloud applications are actually con- gram independently might not identify hot
suming machine resources and how the pic- shared code if it’s not hot in any single ap-
ture evolves over time. For example, Figure 2 plication, but GWP can identify routines
shows cycles distributed over top executables that don’t account for a significant portion
....................................................................

JULY/AUGUST 2010 75

36575

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (13)

En vedette

En vedette (10)

Similaire à 36575

Similaire à 36575 (20)

36575