SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
2nd IEEE International Conference on Cloud Computing Technology and Science



                Characterization of Hadoop Jobs using Unsupervised Learning


                Sonali Aggarwal                    Shashank Phadke                           Milind Bhandarkar
               Stanford University                    Yahoo! Inc.                                Yahoo! Inc.
             sonali@cs.stanford.edu             sphadke@yahoo-inc.com                      milindb@yahoo-inc.com


                            Abstract                                   of workloads running in MapReduce environments benefits
                                                                       both the cloud service providers and their users. Hadoop
     MapReduce programming paradigm [4] and its open                   clusters are used for a variety of research and development
 source implementation, Apache Hadoop [3],is increasingly              projects, and for a growing number of production processes
 being used for data-intensive applications in cloud comput-           at Yahoo!. Yahoo! has world’s largest Hadoop production
 ing environments. An understanding of the characteristics             clusters, with the size of some clusters as large as 4000 ma-
 of workloads running in MapReduce environments benefits                chines. With the increasing size of Hadoop clusters and
 both the cloud service providers and their users. This work           the jobs being run on them, keeping track of the perfor-
 characterizes Hadoop jobs running on production clusters              mance of Hadoop clusters is critical. In this paper, we pro-
 at Yahoo! using unsupervised learning [6]. Unsupervised               pose a methodology to characterize jobs running on Hadoop
 clustering techniques have been applied to many impor-                clusters, which can be used to measure performance of the
 tant problems - ranging from Social Network Analysis to               Hadoop environment.
 Biomedical Research. We use these techniques to cluster                  The paper is organized as follows. We discuss the
 Hadoop MapReduce jobs that are similar in characteristics.            background of Hadoop’s existing performance benchmark
 Hadoop framework generates metrics for every MapReduce                - GridMix and its enhancements in section 2. In section 3,
 job, such as number of map and reduce tasks, number of                we describe our data set and its features. In section 4, we
 bytes read/written to local file system and HDFS etc. We use           present our clustering methodology and also introduce the
 these metrics and job configuration features such as format            characteristic jobs running on the Hadoop cluster at Yahoo!.
 of the input/output files, type of compression used etc to find         In section 5, we present a comparative analysis of a 3 hour
 similarity among Hadoop jobs. We study the centroids and              trace of production jobs and benchmark jobs. In section 6,
 densities of these job clusters. We also perform compara-             we conclude with a summary and some directions for future
 tive analysis of real production workload and workload em-            research.
 ulated by our benchmark tool, GridMix, by comparing job
 clusters of both workloads.
                                                                       2 Background and Objectives

                                                                          Hadoop clusters at Yahoo! run several thousand jobs ev-
    Keywords: performance benchmark, workload char-                    ery day, and each of these jobs execute hundreds of map and
     acterization                                                      reduce tasks and process several terabytes of data. With
                                                                       such a large scale usage, it is essential to track metrics
                                                                       on throughput, utilization, and most importantly, perceived
 1 Introduction                                                        job latencies on these clusters. Performance evaluation and
                                                                       benchmarking of Hadoop software is a critical job, not only
     Apache Hadoop [3] is an open-source framework for re-             to optimize the full range of job execution on these clusters,
 liable, scalable, distributed computing. It primarily con-            but also to reproduce load-related bottlenecks.
 sists of HDFS - a distributed file system that provides                   The previous work on Hadoop performance benchmark-
 high throughput access to application data, and MapRe-                ing is named GridMix, and has undergone several enhance-
 duce programming framework for processing large data                  ments over the years. GridMix aims to represent real appli-
 sets. MapReduce programming paradigm [4] and its open                 cation workloads on Hadoop clusters, and has been used to
 source implementation, Apache Hadoop [3],is increasingly              verify and quantify optimizations across different Hadoop
 being used for data-intensive applications in cloud comput-           releases. The first two versions of GridMix defined repre-
 ing environments. An understanding of the characteristics             sentative Hadoop workloads through a fixed set of micro-

978-0-7695-4302-4/10 $26.00 © 2010 IEEE                          748
DOI 10.1109/CloudCom.2010.20
benchmarks, and did not model the diverse mix of jobs run-                          Table 1. Task History Statistics
ning on Yahoo!’s production cluster well. GridMix3 [5] is                   Metric          Description
a latest enhancement over previous versions that accepts a                  HDFS Bytes      Bytes read/written to HDFS
timestamped stream (trace) of job descriptions. For each                    File Bytes      Bytes read/written to local disk
job in the trace, the GridMix3 client submits a correspond-                 Combine         Ratio of Combiner output/input
ing, synthetic job to the target cluster at the same rate as in             Records Ratio records
the original trace, and tries to model the diverse mix of jobs              Shuffle Bytes    The number of bytes shuffled after
running in Hadoop’s environment.                                                            the map phase. This is only for the
   We propose a new way to do performance benchmark                                         reduce tasks
of Hadoop cluster, by learning the characteristic jobs being
run on it. We use unsupervised learning to cluster the real
production workload, and determine centroids and densities
of these job clusters. The centroid jobs reflect the repre-                          Table 2. Job History Statistics
sentative jobs among the real workload. We use K-Means                     Metric          Description
clustering algorithm for this purpose. Running these repre-                Number      of Number of tasks in the Map Phase
sentative jobs, and computing a weighted sum of their per-                 Maps
formance, where the weights correspond with the size of                    Number      of Number of tasks in the Reduce
the job-cluster, gives us a measure of Hadoop cluster per-                 Reduces         phase
formance, within a small margin of error to the measure                    Input format    Specifies the format of the input file
computed by GridMix3.                                                                      which is parsed to generate the key-
                                                                                           value pairs. This is a categorical fea-
3. Hadoop Job Features                                                                     ture
                                                                           Output format Specifies the format of the output
                                                                                           file. This is a categorical feature
   Our input data set comprised of metrics generated by
the Hadoop MapReduce framework, collected by the job                       Type of output Specifies compression for output of
tracker while the job is executing. After the end of job exe-              compression     the application.This is a categorical
cution, these metrics are stored on the Hadoop cluster in Job                              feature
history files in the form of per-job and per-task counters.                 Map      Phase The number of map slots occupied
Job counters keep track of application progress in both the                slots           by each map task in the Hadoop
map and reduce stages of processing. By default, Hadoop                                    cluster
Mapreduce framework emits a number of standard counters                    Reduce Phase The number of reduce slots occu-
such as Map input records, Map output records which we                     slots           pied by each reduce task in the
use as features in our dataset. Please see Table 1 and Table                               Hadoop cluster
2 for more information on the features.
   The various parameters used to measure the performance
of a job are divided into two levels - Job level and task level.         4. Clustering Methodology
A MapReduce job usually splits the input data-set into inde-
pendent chunks which are processed by the map and reduce                     We used the statistical package R [7] for clustering. R
tasks in parallel. For the task level parameters, we use statis-         is an open source language and environment for statistical
tical descriptors like mean, standard deviation, and range of            computing and graphics.
the counters for all tasks in map and reduce phase respec-                   We implemented traditional K-means algorithm for our
tively. We also include job-specific configuration features                clustering purpose. We estimated the K in K-means us-
such as type of data compression used, and formats of input              ing within groups sum of squares. Also, to find the ini-
and output files as job features from our input data-set.                 tial seeds we randomly picked sqrt(n) jobs from the entire
   We use non-correlated feature set from the counters,                  collection and ran Hierarchal Agglomerative Clustering on
since we did not want to give an increased weight to any                 them. Then we used these results as the initial seeds for the
of the features. Also, we did not use features which are de-             K-Means algorithm.
pendent on the Hadoop cluster hardware configuration such
as the time taken to execute the job etc. Cluster-specific                4.1. Data Collection
features would differ when the same MapReduce job is ex-
ecuted on different Hadoop clusters. Thus, considering ab-                 The job metrics we collected spanned 24 hours from
solute CPU of walltime as job feature would not allow us to              one of Yahoo!’s production Hadoop clusters, comprising of
correlate jobs executed on different clusters.

                                                                   749
11686 jobs. We did not take into account the jobs which
failed on the Hadoop cluster. By nature of production
Hadoop jobs at Yahooo!, these jobs are executed repeatedly
with specific periodicity, on different data partitions, as they
become available. We parsed the JobTracker logs to obtain
the feature vector set mention in Table 1 and 2, using a mod-
ified version of Hadoop Vaidya [1]. Vaidya performs a post
execution analysis of MapReduce jobs by parsing and col-
lecting execution statistics through job history and job con-
figuration files. We generated our initial input using Vaidya,
before we normalized it for clustering.

4.2. Pre-processing

    Prior to clustering, we rescaled variables for comparabil-
ity. We standardized the data to have mean of 0 and standard
deviation of 1. Since we use Euclidean distance to com-
pute per-feature similarity between different jobs, the clus-
ters will be influenced strongly by the magnitudes of the
variables, especially by outliers. Normalizing all features
to have the same mean and standard deviation removes this
bias. Numeric variables were standardized and nominal at-
tributes were converted into binary. We made scatter plots
and calculated co-variance to check dependencies between
the features, to get rid of heavily correlated variables that
tend to artificially bias the clusters toward natural groupings             Figure 1. Estimating the number of clusters.
of those variables. For example, we observed that format of                This graph is a plot of within-groups sum of
input files of the jobs was strongly related to the format of               squares with the number of clusters.
output files.

4.3. Estimating number of clusters                                      4.5. Results
   The heuristic we used to estimate the number of clusters
in our dataset is to take the number of clusters where we                   We obtained 8 clusters from K-Means clustering algo-
see the largest drop in the sum of the within-groups sum of             rithm. Table 4 and Table 5 describe the centroids of these
squares. We iterate through multiple clusters and observe               clusters. The task-level features listed are obtained by tak-
the sum of the within-groups sum of squares. A plot of                  ing the mean of each feature metric over all the Map or Re-
the within-groups sum of squares by number of clusters ex-              duce tasks of these jobs. These centroids are the character-
tracted helped us determine the appropriate number of clus-             istic jobs running on Hadoop cluster. Table 3 and Figure 3
ters. The plot is shown in Figure 1. We looks for a bend                show the densities of these clusters. Figure 2 shows the
in the plot. There is very little variation in within sum of            distance between the centers after they have been scaled to
squares after cluster 8, which suggests there are a maxi-               two dimensions. We used Multidimensional scaling (MDS)
mum of 8 clusters in the data. For two arbitrarily chosen               to map our high-dimensional centers into two dimensional
coordinates (i.e. features), the centroids of these clusters            vectors, preserving all the relevant distances. These 8 clus-
are shown in Figure 2.                                                  ters differ significantly in the number of map and reduce
                                                                        tasks and the bytes being read/written and processed on the
4.4. K-Means Algorithm                                                  HDFS. Most of the jobs on the Hadoop cluster (90%) can
                                                                        be modeled to have close to 79 Map Tasks and 28 Reduce
    We then estimated the main seeds by Hierarchal Ag-                  Tasks. There are a few jobs (approx. 0.003%) which have
glomerative Clustering and performed the K-means algo-                  as large as 2487 Map Tasks. Most jobs tend to have signifi-
rithm with the chosen seeds. We used Euclidean distance                 cantly lesser Reduce tasks than Map Tasks. These centroid
as a distance metric. The total distance between two jobs is            jobs, run the number of times as the size of their clusters,
the square root of the squared sum of the individual feature            represent the jobs being run on the Hadoop clusters within
distances.                                                              a small margin of error.

                                                                  750
Table 4. Centroids of Job Clusters (Means of Features in Map Phase over all Map Tasks)

 Cluster Number     Number of    Map HDFS          Map HDFS           Map     File   Map     File   Map Input      Map Output
                    Maps         Bytes Read        Bytes Writ-        Bytes Read     Bytes Writ-    Records        Records
                                                   ten                               ten
         1          456          63 MB             0.22 MB            80.8 MB        166 MB         214,116        312,037
         2          863          478.69 MB         84 B               721.5MB        1403 MB        387,661        387,661
         3          572          100.5 MB          0.2 MB             71 MB          65.19 MB       936,015        1,600,945
         4          191          90 MB             85 B               25.78 MB       44.26 MB       1,040,112      1,946,530
         5          1080         86.6 MB           85 B               81.3 MB        81.24 MB       595,183        512,653
         6          79           44.82 MB          42 MB              22 MB          39.414 MB      334,144        813,425
         7          2487         122 MB            84 B               226 MB         319 MB         958,604        1,155,927
         8          316          169.6 MB          86 B               210 MB         434.25 MB      513,999        513,913



         Table 5. Centroid of Clusters (Means of Features in Reduce Phase over all Reduce Tasks)

 Cluster Number     Number of     Reduce HDFS      Reduce             Reduce File    Reduce File    Reduce In-     Reduce Out-
                    Reduces       Byte Read        HDFS Bytes         Bytes Read     Bytes Writ-    put Records    put Records
                                                   Written                           ten
         1          20            335 KB           9.64 GB            1.2886 GB      1.2GB          16,330,242     9,815,480
         2          29            6B               2.09 GB            5.421 GB       5.42GB         2,395,949      2,395,955
         3          73            419 B            650.2 MB           390 MB         384.5 MB       58,155,54      3,226,467
         4          71            494 KB           65 MB              70.5 MB        70.5 MB        5,087,889      1,630,275
         5          62            330 B            586.6 MB           759 MB         759 MB         10,919,233     5,754,906
         6          28            57 KB            54.85 MB           17.7 MB        17.6 MB        505,995        235,897
         7          67.5          31.6 MB          2.04 GB            3.2 GB         3.2 GB         154,633,619    19,670,261
         8          14.5          6B               1.9 GB             5.57 GB        5.57 GB        14,113,424     14,113,430



                                                                      sents the real workload it tries to emulate. We used a three
           Table 3. Size of Job Clusters                              hour trace from the daily run of production cluster for this
      Cluster Number Number of jobs in cluster
                                                                      purpose. The three hour trace consisted of 1203 jobs. We
             1                  205                                   processed the same three hour trace using Gridmix3, which
             2                   21                                   generated synthetic workload, and executed it in controlled
             3                  195                                   environment. We only used quantitative features in our
             4                  1143                                  analysis, since GridMix3 does not emulate categorical fea-
             5                   82                                   tures like compression type etc. We parsed the job counter
             6                  9991                                  logs using Rumen [2]. Rumen processes job history logs
             7                   35                                   to produce job and detailed task information in the form of
             8                   14                                   JSON file.
                                                                         We obtained 5 clusters in both the actual production jobs
                                                                      and the GridMix jobs with similar distribution and centers.
                                                                      This reflects that GridMix3 works effectively to model the
5. Comparison of Production cluster jobs with                         actual production jobs and our clustering study has been ef-
    GridMix3 jobs                                                     fective in understanding the clustering of these jobs.

   GridMix3 attempts to emulate the workload of real jobs             6. Summary and Future Work
in the Hadoop cluster by generating synthetic workload.
Our objective in this study was to validate whether this syn-           We obtained the characteristics of the jobs running on
thetic workload generated by GridMix3 accurately repre-               Yahoo’s Hadoop production clusters. We identified 8


                                                                751
Figure 2. Graph showing the centers of
   each clusters scales in two dimension. Co-
                                                                          Figure 3. Graph showing the logarithm of
   ordinate 1 and 2 depict the coordinates in
                                                                          number of jobs in each cluster.
   two dimension of the centroid clusters after
   MDS.The number above the point depicts the
   cluster number
                                                                       elling. This analysis would help us identify if there exists
                                                                       any underlying pattern of jobs being run on the experimen-
                                                                       tal clusters. We plan to expand the set of features being con-
groups of jobs, and found their centroids to obtain our                sidered, such as the language being used to develop actual
characteristic jobs, and densities to determine the weight             map and reduce tasks, use of metadata server(s), number of
that should be given to each representative job. This way,             input files etc in our future study.
we present a new methodology to develop a performance
benchmark for Hadoop. Instead of emulating all the jobs
in the workload of a real Hadoop cluster in benchmarking,              7. Acknowledgments
we only emulate the representative jobs, denoted by the cen-
troids of the job clusters. We also did a comparative analysis            We would like the extend our thanks to Ryota Egashira,
of actual production jobs and the equivalent synthetic Grid-           Rajesh Balamohan, and Srigurunath Chakravarthi for their
Mix jobs. We obtain a remarkable similarity in the clusters,           help with data collection . We would also like to thank Li-
with the centroids of both coinciding and a similar distribu-          hong Li for his input on clustering methodology.
tion of clusters. This suggests that GridMix is effective in
emulating the mix of jobs being run on the Hadoop clusters.
   We see several other uses of the clustering methodology             References
we have used. We intend to extend the work to learn how
the jobs are changing over time, by studying the distribution          [1] Apache Software Foundation.           Hadoop Vaidya.
of these job clusters obtained by analyzing workload across                http://hadoop.apache.org/common/docs/
various time periods. In addition, we would like to compare                current/vaidya.html.
the jobs being run on production cluster to the ad-hoc jobs            [2] Apache Software Foundation. Rumen - A Tool to Ex-
being run on the clusters used for data mining and mod-                    tract Job Characterization Data from Job Tracker Logs.
                                                                           http://issues.apache.org/jira/browse/
                                                                           MAPREDUCE-751.

                                                                 752
[3] Apache Software Foundation. Welcome to Apache Hadoop!
    http://hadoop.apache.org.
[4] J. Dean and S. Ghemawat. Mapreduce: simplified data pro-
    cessing on large clusters. Commun. ACM, 51(1):107–113,
    2008.
[5] C. Douglas and H. Tang. Gridmix3 Emulating Production
    Workload for Apache Hadoop. http://developer.
    yahoo.com/blogs/hadoop/posts/2010/04/
    gridmix3_emulating_production/.
[6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifica-
    tion. Wiley, New York, USA, 2001.
[7] The R Foundation for Statistical Computing. The R Project
    for Statistical Computing. http://www.r-project.
    org.




                                                                  753

Contenu connexe

Tendances

Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 

Tendances (19)

MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDB
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 

Similaire à Characterization of hadoop jobs using unsupervised learning

Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 

Similaire à Characterization of hadoop jobs using unsupervised learning (20)

B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
A hadoop map reduce
A hadoop map reduceA hadoop map reduce
A hadoop map reduce
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
H04502048051
H04502048051H04502048051
H04502048051
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
L017656475
L017656475L017656475
L017656475
 
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
 

Plus de João Gabriel Lima

Plus de João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Characterization of hadoop jobs using unsupervised learning

  • 1. 2nd IEEE International Conference on Cloud Computing Technology and Science Characterization of Hadoop Jobs using Unsupervised Learning Sonali Aggarwal Shashank Phadke Milind Bhandarkar Stanford University Yahoo! Inc. Yahoo! Inc. sonali@cs.stanford.edu sphadke@yahoo-inc.com milindb@yahoo-inc.com Abstract of workloads running in MapReduce environments benefits both the cloud service providers and their users. Hadoop MapReduce programming paradigm [4] and its open clusters are used for a variety of research and development source implementation, Apache Hadoop [3],is increasingly projects, and for a growing number of production processes being used for data-intensive applications in cloud comput- at Yahoo!. Yahoo! has world’s largest Hadoop production ing environments. An understanding of the characteristics clusters, with the size of some clusters as large as 4000 ma- of workloads running in MapReduce environments benefits chines. With the increasing size of Hadoop clusters and both the cloud service providers and their users. This work the jobs being run on them, keeping track of the perfor- characterizes Hadoop jobs running on production clusters mance of Hadoop clusters is critical. In this paper, we pro- at Yahoo! using unsupervised learning [6]. Unsupervised pose a methodology to characterize jobs running on Hadoop clustering techniques have been applied to many impor- clusters, which can be used to measure performance of the tant problems - ranging from Social Network Analysis to Hadoop environment. Biomedical Research. We use these techniques to cluster The paper is organized as follows. We discuss the Hadoop MapReduce jobs that are similar in characteristics. background of Hadoop’s existing performance benchmark Hadoop framework generates metrics for every MapReduce - GridMix and its enhancements in section 2. In section 3, job, such as number of map and reduce tasks, number of we describe our data set and its features. In section 4, we bytes read/written to local file system and HDFS etc. We use present our clustering methodology and also introduce the these metrics and job configuration features such as format characteristic jobs running on the Hadoop cluster at Yahoo!. of the input/output files, type of compression used etc to find In section 5, we present a comparative analysis of a 3 hour similarity among Hadoop jobs. We study the centroids and trace of production jobs and benchmark jobs. In section 6, densities of these job clusters. We also perform compara- we conclude with a summary and some directions for future tive analysis of real production workload and workload em- research. ulated by our benchmark tool, GridMix, by comparing job clusters of both workloads. 2 Background and Objectives Hadoop clusters at Yahoo! run several thousand jobs ev- Keywords: performance benchmark, workload char- ery day, and each of these jobs execute hundreds of map and acterization reduce tasks and process several terabytes of data. With such a large scale usage, it is essential to track metrics on throughput, utilization, and most importantly, perceived 1 Introduction job latencies on these clusters. Performance evaluation and benchmarking of Hadoop software is a critical job, not only Apache Hadoop [3] is an open-source framework for re- to optimize the full range of job execution on these clusters, liable, scalable, distributed computing. It primarily con- but also to reproduce load-related bottlenecks. sists of HDFS - a distributed file system that provides The previous work on Hadoop performance benchmark- high throughput access to application data, and MapRe- ing is named GridMix, and has undergone several enhance- duce programming framework for processing large data ments over the years. GridMix aims to represent real appli- sets. MapReduce programming paradigm [4] and its open cation workloads on Hadoop clusters, and has been used to source implementation, Apache Hadoop [3],is increasingly verify and quantify optimizations across different Hadoop being used for data-intensive applications in cloud comput- releases. The first two versions of GridMix defined repre- ing environments. An understanding of the characteristics sentative Hadoop workloads through a fixed set of micro- 978-0-7695-4302-4/10 $26.00 © 2010 IEEE 748 DOI 10.1109/CloudCom.2010.20
  • 2. benchmarks, and did not model the diverse mix of jobs run- Table 1. Task History Statistics ning on Yahoo!’s production cluster well. GridMix3 [5] is Metric Description a latest enhancement over previous versions that accepts a HDFS Bytes Bytes read/written to HDFS timestamped stream (trace) of job descriptions. For each File Bytes Bytes read/written to local disk job in the trace, the GridMix3 client submits a correspond- Combine Ratio of Combiner output/input ing, synthetic job to the target cluster at the same rate as in Records Ratio records the original trace, and tries to model the diverse mix of jobs Shuffle Bytes The number of bytes shuffled after running in Hadoop’s environment. the map phase. This is only for the We propose a new way to do performance benchmark reduce tasks of Hadoop cluster, by learning the characteristic jobs being run on it. We use unsupervised learning to cluster the real production workload, and determine centroids and densities of these job clusters. The centroid jobs reflect the repre- Table 2. Job History Statistics sentative jobs among the real workload. We use K-Means Metric Description clustering algorithm for this purpose. Running these repre- Number of Number of tasks in the Map Phase sentative jobs, and computing a weighted sum of their per- Maps formance, where the weights correspond with the size of Number of Number of tasks in the Reduce the job-cluster, gives us a measure of Hadoop cluster per- Reduces phase formance, within a small margin of error to the measure Input format Specifies the format of the input file computed by GridMix3. which is parsed to generate the key- value pairs. This is a categorical fea- 3. Hadoop Job Features ture Output format Specifies the format of the output file. This is a categorical feature Our input data set comprised of metrics generated by the Hadoop MapReduce framework, collected by the job Type of output Specifies compression for output of tracker while the job is executing. After the end of job exe- compression the application.This is a categorical cution, these metrics are stored on the Hadoop cluster in Job feature history files in the form of per-job and per-task counters. Map Phase The number of map slots occupied Job counters keep track of application progress in both the slots by each map task in the Hadoop map and reduce stages of processing. By default, Hadoop cluster Mapreduce framework emits a number of standard counters Reduce Phase The number of reduce slots occu- such as Map input records, Map output records which we slots pied by each reduce task in the use as features in our dataset. Please see Table 1 and Table Hadoop cluster 2 for more information on the features. The various parameters used to measure the performance of a job are divided into two levels - Job level and task level. 4. Clustering Methodology A MapReduce job usually splits the input data-set into inde- pendent chunks which are processed by the map and reduce We used the statistical package R [7] for clustering. R tasks in parallel. For the task level parameters, we use statis- is an open source language and environment for statistical tical descriptors like mean, standard deviation, and range of computing and graphics. the counters for all tasks in map and reduce phase respec- We implemented traditional K-means algorithm for our tively. We also include job-specific configuration features clustering purpose. We estimated the K in K-means us- such as type of data compression used, and formats of input ing within groups sum of squares. Also, to find the ini- and output files as job features from our input data-set. tial seeds we randomly picked sqrt(n) jobs from the entire We use non-correlated feature set from the counters, collection and ran Hierarchal Agglomerative Clustering on since we did not want to give an increased weight to any them. Then we used these results as the initial seeds for the of the features. Also, we did not use features which are de- K-Means algorithm. pendent on the Hadoop cluster hardware configuration such as the time taken to execute the job etc. Cluster-specific 4.1. Data Collection features would differ when the same MapReduce job is ex- ecuted on different Hadoop clusters. Thus, considering ab- The job metrics we collected spanned 24 hours from solute CPU of walltime as job feature would not allow us to one of Yahoo!’s production Hadoop clusters, comprising of correlate jobs executed on different clusters. 749
  • 3. 11686 jobs. We did not take into account the jobs which failed on the Hadoop cluster. By nature of production Hadoop jobs at Yahooo!, these jobs are executed repeatedly with specific periodicity, on different data partitions, as they become available. We parsed the JobTracker logs to obtain the feature vector set mention in Table 1 and 2, using a mod- ified version of Hadoop Vaidya [1]. Vaidya performs a post execution analysis of MapReduce jobs by parsing and col- lecting execution statistics through job history and job con- figuration files. We generated our initial input using Vaidya, before we normalized it for clustering. 4.2. Pre-processing Prior to clustering, we rescaled variables for comparabil- ity. We standardized the data to have mean of 0 and standard deviation of 1. Since we use Euclidean distance to com- pute per-feature similarity between different jobs, the clus- ters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing all features to have the same mean and standard deviation removes this bias. Numeric variables were standardized and nominal at- tributes were converted into binary. We made scatter plots and calculated co-variance to check dependencies between the features, to get rid of heavily correlated variables that tend to artificially bias the clusters toward natural groupings Figure 1. Estimating the number of clusters. of those variables. For example, we observed that format of This graph is a plot of within-groups sum of input files of the jobs was strongly related to the format of squares with the number of clusters. output files. 4.3. Estimating number of clusters 4.5. Results The heuristic we used to estimate the number of clusters in our dataset is to take the number of clusters where we We obtained 8 clusters from K-Means clustering algo- see the largest drop in the sum of the within-groups sum of rithm. Table 4 and Table 5 describe the centroids of these squares. We iterate through multiple clusters and observe clusters. The task-level features listed are obtained by tak- the sum of the within-groups sum of squares. A plot of ing the mean of each feature metric over all the Map or Re- the within-groups sum of squares by number of clusters ex- duce tasks of these jobs. These centroids are the character- tracted helped us determine the appropriate number of clus- istic jobs running on Hadoop cluster. Table 3 and Figure 3 ters. The plot is shown in Figure 1. We looks for a bend show the densities of these clusters. Figure 2 shows the in the plot. There is very little variation in within sum of distance between the centers after they have been scaled to squares after cluster 8, which suggests there are a maxi- two dimensions. We used Multidimensional scaling (MDS) mum of 8 clusters in the data. For two arbitrarily chosen to map our high-dimensional centers into two dimensional coordinates (i.e. features), the centroids of these clusters vectors, preserving all the relevant distances. These 8 clus- are shown in Figure 2. ters differ significantly in the number of map and reduce tasks and the bytes being read/written and processed on the 4.4. K-Means Algorithm HDFS. Most of the jobs on the Hadoop cluster (90%) can be modeled to have close to 79 Map Tasks and 28 Reduce We then estimated the main seeds by Hierarchal Ag- Tasks. There are a few jobs (approx. 0.003%) which have glomerative Clustering and performed the K-means algo- as large as 2487 Map Tasks. Most jobs tend to have signifi- rithm with the chosen seeds. We used Euclidean distance cantly lesser Reduce tasks than Map Tasks. These centroid as a distance metric. The total distance between two jobs is jobs, run the number of times as the size of their clusters, the square root of the squared sum of the individual feature represent the jobs being run on the Hadoop clusters within distances. a small margin of error. 750
  • 4. Table 4. Centroids of Job Clusters (Means of Features in Map Phase over all Map Tasks) Cluster Number Number of Map HDFS Map HDFS Map File Map File Map Input Map Output Maps Bytes Read Bytes Writ- Bytes Read Bytes Writ- Records Records ten ten 1 456 63 MB 0.22 MB 80.8 MB 166 MB 214,116 312,037 2 863 478.69 MB 84 B 721.5MB 1403 MB 387,661 387,661 3 572 100.5 MB 0.2 MB 71 MB 65.19 MB 936,015 1,600,945 4 191 90 MB 85 B 25.78 MB 44.26 MB 1,040,112 1,946,530 5 1080 86.6 MB 85 B 81.3 MB 81.24 MB 595,183 512,653 6 79 44.82 MB 42 MB 22 MB 39.414 MB 334,144 813,425 7 2487 122 MB 84 B 226 MB 319 MB 958,604 1,155,927 8 316 169.6 MB 86 B 210 MB 434.25 MB 513,999 513,913 Table 5. Centroid of Clusters (Means of Features in Reduce Phase over all Reduce Tasks) Cluster Number Number of Reduce HDFS Reduce Reduce File Reduce File Reduce In- Reduce Out- Reduces Byte Read HDFS Bytes Bytes Read Bytes Writ- put Records put Records Written ten 1 20 335 KB 9.64 GB 1.2886 GB 1.2GB 16,330,242 9,815,480 2 29 6B 2.09 GB 5.421 GB 5.42GB 2,395,949 2,395,955 3 73 419 B 650.2 MB 390 MB 384.5 MB 58,155,54 3,226,467 4 71 494 KB 65 MB 70.5 MB 70.5 MB 5,087,889 1,630,275 5 62 330 B 586.6 MB 759 MB 759 MB 10,919,233 5,754,906 6 28 57 KB 54.85 MB 17.7 MB 17.6 MB 505,995 235,897 7 67.5 31.6 MB 2.04 GB 3.2 GB 3.2 GB 154,633,619 19,670,261 8 14.5 6B 1.9 GB 5.57 GB 5.57 GB 14,113,424 14,113,430 sents the real workload it tries to emulate. We used a three Table 3. Size of Job Clusters hour trace from the daily run of production cluster for this Cluster Number Number of jobs in cluster purpose. The three hour trace consisted of 1203 jobs. We 1 205 processed the same three hour trace using Gridmix3, which 2 21 generated synthetic workload, and executed it in controlled 3 195 environment. We only used quantitative features in our 4 1143 analysis, since GridMix3 does not emulate categorical fea- 5 82 tures like compression type etc. We parsed the job counter 6 9991 logs using Rumen [2]. Rumen processes job history logs 7 35 to produce job and detailed task information in the form of 8 14 JSON file. We obtained 5 clusters in both the actual production jobs and the GridMix jobs with similar distribution and centers. This reflects that GridMix3 works effectively to model the 5. Comparison of Production cluster jobs with actual production jobs and our clustering study has been ef- GridMix3 jobs fective in understanding the clustering of these jobs. GridMix3 attempts to emulate the workload of real jobs 6. Summary and Future Work in the Hadoop cluster by generating synthetic workload. Our objective in this study was to validate whether this syn- We obtained the characteristics of the jobs running on thetic workload generated by GridMix3 accurately repre- Yahoo’s Hadoop production clusters. We identified 8 751
  • 5. Figure 2. Graph showing the centers of each clusters scales in two dimension. Co- Figure 3. Graph showing the logarithm of ordinate 1 and 2 depict the coordinates in number of jobs in each cluster. two dimension of the centroid clusters after MDS.The number above the point depicts the cluster number elling. This analysis would help us identify if there exists any underlying pattern of jobs being run on the experimen- tal clusters. We plan to expand the set of features being con- groups of jobs, and found their centroids to obtain our sidered, such as the language being used to develop actual characteristic jobs, and densities to determine the weight map and reduce tasks, use of metadata server(s), number of that should be given to each representative job. This way, input files etc in our future study. we present a new methodology to develop a performance benchmark for Hadoop. Instead of emulating all the jobs in the workload of a real Hadoop cluster in benchmarking, 7. Acknowledgments we only emulate the representative jobs, denoted by the cen- troids of the job clusters. We also did a comparative analysis We would like the extend our thanks to Ryota Egashira, of actual production jobs and the equivalent synthetic Grid- Rajesh Balamohan, and Srigurunath Chakravarthi for their Mix jobs. We obtain a remarkable similarity in the clusters, help with data collection . We would also like to thank Li- with the centroids of both coinciding and a similar distribu- hong Li for his input on clustering methodology. tion of clusters. This suggests that GridMix is effective in emulating the mix of jobs being run on the Hadoop clusters. We see several other uses of the clustering methodology References we have used. We intend to extend the work to learn how the jobs are changing over time, by studying the distribution [1] Apache Software Foundation. Hadoop Vaidya. of these job clusters obtained by analyzing workload across http://hadoop.apache.org/common/docs/ various time periods. In addition, we would like to compare current/vaidya.html. the jobs being run on production cluster to the ad-hoc jobs [2] Apache Software Foundation. Rumen - A Tool to Ex- being run on the clusters used for data mining and mod- tract Job Characterization Data from Job Tracker Logs. http://issues.apache.org/jira/browse/ MAPREDUCE-751. 752
  • 6. [3] Apache Software Foundation. Welcome to Apache Hadoop! http://hadoop.apache.org. [4] J. Dean and S. Ghemawat. Mapreduce: simplified data pro- cessing on large clusters. Commun. ACM, 51(1):107–113, 2008. [5] C. Douglas and H. Tang. Gridmix3 Emulating Production Workload for Apache Hadoop. http://developer. yahoo.com/blogs/hadoop/posts/2010/04/ gridmix3_emulating_production/. [6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifica- tion. Wiley, New York, USA, 2001. [7] The R Foundation for Statistical Computing. The R Project for Statistical Computing. http://www.r-project. org. 753