2. benchmarks, and did not model the diverse mix of jobs run- Table 1. Task History Statistics
ning on Yahoo!’s production cluster well. GridMix3 [5] is Metric Description
a latest enhancement over previous versions that accepts a HDFS Bytes Bytes read/written to HDFS
timestamped stream (trace) of job descriptions. For each File Bytes Bytes read/written to local disk
job in the trace, the GridMix3 client submits a correspond- Combine Ratio of Combiner output/input
ing, synthetic job to the target cluster at the same rate as in Records Ratio records
the original trace, and tries to model the diverse mix of jobs Shuffle Bytes The number of bytes shuffled after
running in Hadoop’s environment. the map phase. This is only for the
We propose a new way to do performance benchmark reduce tasks
of Hadoop cluster, by learning the characteristic jobs being
run on it. We use unsupervised learning to cluster the real
production workload, and determine centroids and densities
of these job clusters. The centroid jobs reflect the repre- Table 2. Job History Statistics
sentative jobs among the real workload. We use K-Means Metric Description
clustering algorithm for this purpose. Running these repre- Number of Number of tasks in the Map Phase
sentative jobs, and computing a weighted sum of their per- Maps
formance, where the weights correspond with the size of Number of Number of tasks in the Reduce
the job-cluster, gives us a measure of Hadoop cluster per- Reduces phase
formance, within a small margin of error to the measure Input format Specifies the format of the input file
computed by GridMix3. which is parsed to generate the key-
value pairs. This is a categorical fea-
3. Hadoop Job Features ture
Output format Specifies the format of the output
file. This is a categorical feature
Our input data set comprised of metrics generated by
the Hadoop MapReduce framework, collected by the job Type of output Specifies compression for output of
tracker while the job is executing. After the end of job exe- compression the application.This is a categorical
cution, these metrics are stored on the Hadoop cluster in Job feature
history files in the form of per-job and per-task counters. Map Phase The number of map slots occupied
Job counters keep track of application progress in both the slots by each map task in the Hadoop
map and reduce stages of processing. By default, Hadoop cluster
Mapreduce framework emits a number of standard counters Reduce Phase The number of reduce slots occu-
such as Map input records, Map output records which we slots pied by each reduce task in the
use as features in our dataset. Please see Table 1 and Table Hadoop cluster
2 for more information on the features.
The various parameters used to measure the performance
of a job are divided into two levels - Job level and task level. 4. Clustering Methodology
A MapReduce job usually splits the input data-set into inde-
pendent chunks which are processed by the map and reduce We used the statistical package R [7] for clustering. R
tasks in parallel. For the task level parameters, we use statis- is an open source language and environment for statistical
tical descriptors like mean, standard deviation, and range of computing and graphics.
the counters for all tasks in map and reduce phase respec- We implemented traditional K-means algorithm for our
tively. We also include job-specific configuration features clustering purpose. We estimated the K in K-means us-
such as type of data compression used, and formats of input ing within groups sum of squares. Also, to find the ini-
and output files as job features from our input data-set. tial seeds we randomly picked sqrt(n) jobs from the entire
We use non-correlated feature set from the counters, collection and ran Hierarchal Agglomerative Clustering on
since we did not want to give an increased weight to any them. Then we used these results as the initial seeds for the
of the features. Also, we did not use features which are de- K-Means algorithm.
pendent on the Hadoop cluster hardware configuration such
as the time taken to execute the job etc. Cluster-specific 4.1. Data Collection
features would differ when the same MapReduce job is ex-
ecuted on different Hadoop clusters. Thus, considering ab- The job metrics we collected spanned 24 hours from
solute CPU of walltime as job feature would not allow us to one of Yahoo!’s production Hadoop clusters, comprising of
correlate jobs executed on different clusters.
749
3. 11686 jobs. We did not take into account the jobs which
failed on the Hadoop cluster. By nature of production
Hadoop jobs at Yahooo!, these jobs are executed repeatedly
with specific periodicity, on different data partitions, as they
become available. We parsed the JobTracker logs to obtain
the feature vector set mention in Table 1 and 2, using a mod-
ified version of Hadoop Vaidya [1]. Vaidya performs a post
execution analysis of MapReduce jobs by parsing and col-
lecting execution statistics through job history and job con-
figuration files. We generated our initial input using Vaidya,
before we normalized it for clustering.
4.2. Pre-processing
Prior to clustering, we rescaled variables for comparabil-
ity. We standardized the data to have mean of 0 and standard
deviation of 1. Since we use Euclidean distance to com-
pute per-feature similarity between different jobs, the clus-
ters will be influenced strongly by the magnitudes of the
variables, especially by outliers. Normalizing all features
to have the same mean and standard deviation removes this
bias. Numeric variables were standardized and nominal at-
tributes were converted into binary. We made scatter plots
and calculated co-variance to check dependencies between
the features, to get rid of heavily correlated variables that
tend to artificially bias the clusters toward natural groupings Figure 1. Estimating the number of clusters.
of those variables. For example, we observed that format of This graph is a plot of within-groups sum of
input files of the jobs was strongly related to the format of squares with the number of clusters.
output files.
4.3. Estimating number of clusters 4.5. Results
The heuristic we used to estimate the number of clusters
in our dataset is to take the number of clusters where we We obtained 8 clusters from K-Means clustering algo-
see the largest drop in the sum of the within-groups sum of rithm. Table 4 and Table 5 describe the centroids of these
squares. We iterate through multiple clusters and observe clusters. The task-level features listed are obtained by tak-
the sum of the within-groups sum of squares. A plot of ing the mean of each feature metric over all the Map or Re-
the within-groups sum of squares by number of clusters ex- duce tasks of these jobs. These centroids are the character-
tracted helped us determine the appropriate number of clus- istic jobs running on Hadoop cluster. Table 3 and Figure 3
ters. The plot is shown in Figure 1. We looks for a bend show the densities of these clusters. Figure 2 shows the
in the plot. There is very little variation in within sum of distance between the centers after they have been scaled to
squares after cluster 8, which suggests there are a maxi- two dimensions. We used Multidimensional scaling (MDS)
mum of 8 clusters in the data. For two arbitrarily chosen to map our high-dimensional centers into two dimensional
coordinates (i.e. features), the centroids of these clusters vectors, preserving all the relevant distances. These 8 clus-
are shown in Figure 2. ters differ significantly in the number of map and reduce
tasks and the bytes being read/written and processed on the
4.4. K-Means Algorithm HDFS. Most of the jobs on the Hadoop cluster (90%) can
be modeled to have close to 79 Map Tasks and 28 Reduce
We then estimated the main seeds by Hierarchal Ag- Tasks. There are a few jobs (approx. 0.003%) which have
glomerative Clustering and performed the K-means algo- as large as 2487 Map Tasks. Most jobs tend to have signifi-
rithm with the chosen seeds. We used Euclidean distance cantly lesser Reduce tasks than Map Tasks. These centroid
as a distance metric. The total distance between two jobs is jobs, run the number of times as the size of their clusters,
the square root of the squared sum of the individual feature represent the jobs being run on the Hadoop clusters within
distances. a small margin of error.
750
4. Table 4. Centroids of Job Clusters (Means of Features in Map Phase over all Map Tasks)
Cluster Number Number of Map HDFS Map HDFS Map File Map File Map Input Map Output
Maps Bytes Read Bytes Writ- Bytes Read Bytes Writ- Records Records
ten ten
1 456 63 MB 0.22 MB 80.8 MB 166 MB 214,116 312,037
2 863 478.69 MB 84 B 721.5MB 1403 MB 387,661 387,661
3 572 100.5 MB 0.2 MB 71 MB 65.19 MB 936,015 1,600,945
4 191 90 MB 85 B 25.78 MB 44.26 MB 1,040,112 1,946,530
5 1080 86.6 MB 85 B 81.3 MB 81.24 MB 595,183 512,653
6 79 44.82 MB 42 MB 22 MB 39.414 MB 334,144 813,425
7 2487 122 MB 84 B 226 MB 319 MB 958,604 1,155,927
8 316 169.6 MB 86 B 210 MB 434.25 MB 513,999 513,913
Table 5. Centroid of Clusters (Means of Features in Reduce Phase over all Reduce Tasks)
Cluster Number Number of Reduce HDFS Reduce Reduce File Reduce File Reduce In- Reduce Out-
Reduces Byte Read HDFS Bytes Bytes Read Bytes Writ- put Records put Records
Written ten
1 20 335 KB 9.64 GB 1.2886 GB 1.2GB 16,330,242 9,815,480
2 29 6B 2.09 GB 5.421 GB 5.42GB 2,395,949 2,395,955
3 73 419 B 650.2 MB 390 MB 384.5 MB 58,155,54 3,226,467
4 71 494 KB 65 MB 70.5 MB 70.5 MB 5,087,889 1,630,275
5 62 330 B 586.6 MB 759 MB 759 MB 10,919,233 5,754,906
6 28 57 KB 54.85 MB 17.7 MB 17.6 MB 505,995 235,897
7 67.5 31.6 MB 2.04 GB 3.2 GB 3.2 GB 154,633,619 19,670,261
8 14.5 6B 1.9 GB 5.57 GB 5.57 GB 14,113,424 14,113,430
sents the real workload it tries to emulate. We used a three
Table 3. Size of Job Clusters hour trace from the daily run of production cluster for this
Cluster Number Number of jobs in cluster
purpose. The three hour trace consisted of 1203 jobs. We
1 205 processed the same three hour trace using Gridmix3, which
2 21 generated synthetic workload, and executed it in controlled
3 195 environment. We only used quantitative features in our
4 1143 analysis, since GridMix3 does not emulate categorical fea-
5 82 tures like compression type etc. We parsed the job counter
6 9991 logs using Rumen [2]. Rumen processes job history logs
7 35 to produce job and detailed task information in the form of
8 14 JSON file.
We obtained 5 clusters in both the actual production jobs
and the GridMix jobs with similar distribution and centers.
This reflects that GridMix3 works effectively to model the
5. Comparison of Production cluster jobs with actual production jobs and our clustering study has been ef-
GridMix3 jobs fective in understanding the clustering of these jobs.
GridMix3 attempts to emulate the workload of real jobs 6. Summary and Future Work
in the Hadoop cluster by generating synthetic workload.
Our objective in this study was to validate whether this syn- We obtained the characteristics of the jobs running on
thetic workload generated by GridMix3 accurately repre- Yahoo’s Hadoop production clusters. We identified 8
751
5. Figure 2. Graph showing the centers of
each clusters scales in two dimension. Co-
Figure 3. Graph showing the logarithm of
ordinate 1 and 2 depict the coordinates in
number of jobs in each cluster.
two dimension of the centroid clusters after
MDS.The number above the point depicts the
cluster number
elling. This analysis would help us identify if there exists
any underlying pattern of jobs being run on the experimen-
tal clusters. We plan to expand the set of features being con-
groups of jobs, and found their centroids to obtain our sidered, such as the language being used to develop actual
characteristic jobs, and densities to determine the weight map and reduce tasks, use of metadata server(s), number of
that should be given to each representative job. This way, input files etc in our future study.
we present a new methodology to develop a performance
benchmark for Hadoop. Instead of emulating all the jobs
in the workload of a real Hadoop cluster in benchmarking, 7. Acknowledgments
we only emulate the representative jobs, denoted by the cen-
troids of the job clusters. We also did a comparative analysis We would like the extend our thanks to Ryota Egashira,
of actual production jobs and the equivalent synthetic Grid- Rajesh Balamohan, and Srigurunath Chakravarthi for their
Mix jobs. We obtain a remarkable similarity in the clusters, help with data collection . We would also like to thank Li-
with the centroids of both coinciding and a similar distribu- hong Li for his input on clustering methodology.
tion of clusters. This suggests that GridMix is effective in
emulating the mix of jobs being run on the Hadoop clusters.
We see several other uses of the clustering methodology References
we have used. We intend to extend the work to learn how
the jobs are changing over time, by studying the distribution [1] Apache Software Foundation. Hadoop Vaidya.
of these job clusters obtained by analyzing workload across http://hadoop.apache.org/common/docs/
various time periods. In addition, we would like to compare current/vaidya.html.
the jobs being run on production cluster to the ad-hoc jobs [2] Apache Software Foundation. Rumen - A Tool to Ex-
being run on the clusters used for data mining and mod- tract Job Characterization Data from Job Tracker Logs.
http://issues.apache.org/jira/browse/
MAPREDUCE-751.
752
6. [3] Apache Software Foundation. Welcome to Apache Hadoop!
http://hadoop.apache.org.
[4] J. Dean and S. Ghemawat. Mapreduce: simplified data pro-
cessing on large clusters. Commun. ACM, 51(1):107–113,
2008.
[5] C. Douglas and H. Tang. Gridmix3 Emulating Production
Workload for Apache Hadoop. http://developer.
yahoo.com/blogs/hadoop/posts/2010/04/
gridmix3_emulating_production/.
[6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifica-
tion. Wiley, New York, USA, 2001.
[7] The R Foundation for Statistical Computing. The R Project
for Statistical Computing. http://www.r-project.
org.
753