SlideShare a Scribd company logo
1 of 12
Web Server Workload Characterization and Generation
                                        Archana Ganapathi
                                    archanag@cs.berkeley.edu

Abstract

Workload characterization and generation are essential tools to assist in building and maintaining
web services. We discuss a framework that allows us to take advantage of trace data, process it
using Machine Learning algorithms, and generate workload that produces specific effects on the
target system. We performed clustering analysis to characterize a specific web server workload
and discuss the analysis results and action items to fortify our workload generation framework.


1. Introduction

The plethora of web services in today’s cyberspace has introduced a new set of challenges for
systems developers. The scale and variety of workload handled by these web services can
drastically increase and/or change over a short period of time. Currently, there are no tools to help
operators predict scaling issues and system behavior after hardware and/or software configuration
changes. The absence of such a tool results in unanticipated downtime, which ultimately reduces
to a significant loss of revenue for the company.

A representative, scalable and adaptive workload generator is the missing component that will
enable web service operators to predict system behavior during the development and maintenance
process of online services. Workload generation helps identify bottlenecks in the system and
allows the developer to test particular parts of the system. It is also useful in predicting scaling
issues and understanding system behavior under various configurations and saturation states.
Currently, there is no uniform methodology for workload generation and existing workload
generators are very system-specific.

Furthermore, there is no effective way to predict the behavior of a system that is exposed to
unexpected request patterns. The first obvious step in understanding how the target system works
is to interpret the workload it handles. It is crucial to understand the characteristics of workload
input and observe its effects on the target system. Since we are looking at web services, we use
traces (typically http traces) that were collected at various internet services and try to characterize
the consequent workload. Once we characterize the workload, we can tie it in with low-level
system metrics (such as power consumption and CPU utilization), and correlate workload
patterns with variations in system metrics. Our goal is to provide a framework for workload
characterization and generation.

Our proposed framework must be generic enough that it can accommodate existing commonly
used workload generators. It must also be reconfigurable to allow black-box workload generation
(where it learns properties of the system based on responses) as well as task-specific workload
generation where we can target which features of the system we’d like to evaluate.

One of the major obstacles of building such a framework based on trace data is privacy.
Companies often disallow sharing their data without explicit Non-Disclosure Agreements and
significant restrictions on how we use the data. We attempt to address such privacy concerns
related to sharing data and hope to take advantage of Machine Learning to condense trace
information into a concise, anonymized digest that has no pointers to the internal workings of the
system.

In this paper, we specifically elaborate upon the workload characterization process using various
clustering algorithms to partition trace data into unique working sets. We have made limited
progress in clustering the data; however, the process revealed the limitations of our dataset and
methodology and has helped us identify a set of action items to pursue to achieve our long-term
goals.


2. Related Work:

We evaluate related work in the workload generation frameworks, workload generators and
workload characterization domains.

Special-purpose Workload Generation Frameworks:

•   Filebench (Sun Microsystems): This framework models/generates application interactions
    with file systems. The desired behavior can be specified in a domain-specific language and is
    then translated using pre-defined macros to determine the quantity and type of i/o requests.
•   MACE (University of Wisconsin): This framework is used to specify and generate malicious
    network traffic. This tool is commonly used in the network security domain.
•   StreamGen (Georgia Institue of Technology): This framework allows users to generate
    distributed events and data streams. However, they do not do much in terms of
    sophisticatedly characterizing the workload.

Workload Generators:

•   SURGE: (short form for Scalable URL Reference generator), captures file size and request
    size distributions and relative popularity of requested files. It also models think time and
    allows users to configure specific models of interest.
•   SPECweb: Caters to web server. This workload generator generates http requests only and
    only captures the 200 (or OK) response. It is very difficult to configure and has been known
    to have request distributions that are quite different from actual trace data (especially at high
    numbers)
•   TPC-W: This workload generator models an online bookstore, which allows web serving,
    browsing and shopping cart facilities. However, the set up overhead for this workload
    generator is extremely high.
•   SLAMD: Java-based, tests network-based applications (specifically LDAP directory servers).
    This workload generator can also be used for Web servers and Web-based applications,
    relational databases, and mail servers. SLAMD is becoming more and more popular due its
    diverse nature of workload generation.
•   Harpoon: A flow-level Traffic Generator, mimics internet traffic and generates representative
    background traffic for application and protocol testing.
•   Optixia: This is one of few hardware-based workload generators. It is commonly used as an
    IP Performance Test platform, and can create and transmit any type of Layer 2-3 traffic
    patterns at up to line rate over a network.
•   Hammer: This is another rare instance of a hardware-based traffic generator. However, this
    workload generator is specific to VoIP and PSTN telephone call generation.
While all the above workload generation tools are widely used in their specific application
   domains, they do not perform well in the presence of a black-box system. They require a certain
   degree of prior knowledge of the system, for which the workload generator is custom-configured.

   Workload Characterization:

   Arlitt et al. [AJ00] analyze the same data set we used for our study from the World Cup 1998.
   However, they manually sort the data to identify prominent patterns of the traces and do not do
   much in terms of applying maching learning algorithms. Moore et al. [MC+05], on the other
   hand, consider various trends in data center workload and attempt a framework to instrument and
   analyze workload in a data center and effectively emulate it. Eeckhout et al. [EV+02] consider
   workload characterization for energy and power consumption in the Java Virtual Machine. They
   instrument the virtual machine to observe these characteristics and analyze them. There has also
   been some work on using Neural Networks for characterizing 3-tier web service workload that
   follows non-linear behavior [YL+06]. Again, we see the predominance of a model-based
   approach where a preconceived notion of the system has a dominant effect on the analysis results.
   We hope to create a less biased characterization methodology for our future experiments.


   3. Workload Characterization and Generation Methodology:

                                 Traces                                                    Project future
                                                                   at t=n-2                behavior given
                                                                at t=n-1                   historical trends
Metrics          clustering     Wstatic      stratified      Wdynamic
                                             sampling
                                                               at t=n
                                                                                           Statistically
                                                                                           Scale Up

                                                                                                  Workload
                                             Behind the Firewall              Synthetic
                                                                                                  Model
                                                                              Workload            open/closed/ajax..
                                                                              Generation

   Figure 1: Workload Generation Framework. The diagram outlines each component we will use to
   implement our methodology.


   The figure above shows our proposed framework for capturing traces, condensing essential
   features of the data and using the information to replay workload. Below we elaborate on each
   step of the process outlined above, and discuss how we validate our generated workload.

   Methodology:

   1. Derive metrics of interest: We must identify what features of the trace data are most useful in
          capturing the relationship between web server traces and system behavior. We can use a
          combination of feature-selection algorithms and low-level metric data gathered as per
          suggestions           of         knowledgeable         domain         experts.
          Formally speaking:
          Metrics = set of feature vectors

   2. Clustering: Once we have identified useful features, we must cluster the workload trace data
based on system behavioral features of interest. Each cluster will consist of a specific set of
    requests that produce particular system effects. In this step, we must identify the most
    appropriate clustering algorithm for our data set and define distance metrics between data
    points. Once we have clusters, we can summarize the features of the cluster and use the
    summary as an anonyzed version of the cluster, which we refer to as an equivalence class.
    We elaborate on this particular step of the process in subsequent sections of the paper.
    Formally speaking:
    Cluster = set of related metrics given pair-wise distance and clustering algorithm
    Let N = number of equivalence classes

3. Understanding the static workload: Workload tends to be of two types-static and dynamic.
    At this phase, we study static workload. Typically, there are several relationships between the
    workload and system metrics that are static, such as database-intensive requests and CPU-
    intensive requests. The clustering analysis helps us identify static effects of the workload on
    the system (or Wstatic as enumerated in the system). These properties can be formally described
    as follows:
    Wstatic = {cluster centroidi, cluster radiusi}   where 1≤ i ≤ number of equivalence classes

4. Stratified sampling: Once we determine the various clusters (and/or equivalence classes), we
    must sample data from the original trace proportional to the specific workload mix we would
    like from each cluster/equivalence class. With this information, we can generate the dynamic
    component                        of                       the                      workload.

5. Dynamic workload: There are several aspects to the workload that may not be present in the
    traces (but we would like to simulate). For example, load spikes may not be captured in a
    given trace but we may choose to simulate it by scaling up trace data and adding
    representative synthetic filler data. We can generate a variety of Wdynamic mixes based on the
    task we would like to perform. For example, each Wdynamic can represent a different usage
    pattern (request sequences, inter-arrival rates etc). Furthermore, as Wdynamic is a simple
    transition matrix between equivalence classes, it contains no proprietary data and can be
    easily shared outside a corporate domain without privacy concerns. More formally:
    Wdynamic         =             NxN           transition           probability           matrix

6. Projecting future behavior: Given various Wdynamic matrices for workload of different time
    points, we can use past trends to extrapolate future workload characteristics. This step would
    be useful to understand the evolving nature of the system and its workload but is not
    mandatory               for             simple              workload                generation.

7. Statistically scaling up: We can use Wdynamic to seed our synthetic workload and scale up to the
    desired quantity to stress-test our system. This step would help us determine the behavior of
    the target system under high load, determine the saturation point, and observe how the system
    behaves                   at                 its               saturated                 state.

8. Synthetic workload generation: Lastly, we would like to perform the actual load generation
    based on all the above 7 steps and statistical learning processes. We would like the user of the
    workload generator to specify certain characteristics of the synthetic workload and custom-
    configure the generator based on their testing goals.

Closely following the above 8 steps would guarantee that our workload generator is scalable and
adaptable. However, as the saying goes, “the proof is in the pudding” as far as being able to
generate realistic workload.

Validating the synthetic workload:

Success of our workload generator is not just defined by the ability to generate synthetic
workload. It also assumes that the generated workload is realistic. To evaluate this aspect of our
workload generate, we must compare the behavior of the system under trace-based workload to
the same under our generated synthetic workload. We plan to take advantage of virtual machine
technology for this purpose. As virtual machines provide us observability, we can run a variety of
applications on these virtual machines, replay the traces, observe system behavior, and repeat the
process by replaying our synthetically generated workload. Comparing the two behavioral
patterns will close the feedback loop and inform us about how accurately our synthetic workload
reproduces system behavior.


4. Clustering for Workload Characterization:

As a proof of concept for applying machine learning for workload characterization, we
implemented a few clustering algorithms and evaluated their performance against a set of http
traces. The algorithms, data set, and experimental results are described below.

Clustering Algorithms:

Clustering algorithms approximate a loosely-defined optimal assignment problem. In many
cases, including ours, the number of desired clusters is unknown, making it difficult to select a
concrete number of clusters and formulate the clustering task as an integer program. We
implemented two approximate solutions to clustering:

               K-Means/K-Median Clustering           Agglomerative Clustering

Initialization Choose k data points uniformly at     Each data point is the center of a singleton
               random from the full data set         cluster

Iteration      1) Assign each data point to the      1) Merge two clusters whose centers are
               nearest cluster center                closest to each other
               2) Recompute cluster centers as the   2) Recompute cluster centers as the
               “average” of all member data points   “average” of all member data points

Runtime        O(n2)                                 O(n) (one new cluster per iteration)
Figure 2: Comparison of Clustering Algorithms




           Iterative k-means/medians clustering: This algorithm chooses k data points uniformly at random
           from the data set as the initial cluster centroids. At each iteration, each data point joins the cluster
           whose centroid is closest to it relative to all other cluster centroids, and the centroid of each
           cluster is recomputed as the average/median of all member data points. This algorithm iterates
           until cluster membership stabilizes. In addition, we forbid the existence of singleton clusters: a
           cluster containing only one data point ceases to exist, and its constituent joins the nearest cluster.
           k-means/medians clustering implements the expectation-maximization (EM) algorithm for a
           uniform prior distribution over cluster membership. While easy to compute, it is sensitive to the
           choice of initial cluster centroids (which can be alleviated by clever seeding techniques as
           discussed later). The random seeding approach also causes non-deterministic cluster formation.
           Furthermore, this algorithm does not provide much information about structure within clusters,
           and hence is not as informative as the next algorithm we discuss.

           Hierarchical agglomerative clustering: This algorithm starts with n singleton clusters, one for
           each data point. At each iteration, it identifies the pair of clusters that are closest to each other
           and merges them together. The centroid of the newly merged cluster is the average of all data
           points in both original clusters. Agglomerative clustering implements a deterministic bottom-up
           approximation to the optimal assignment; it provides more consistent behavior than k-means
           clustering, and returns a binary tree that provides some insight into the internal structure of the
           clusters it finds. However, the drawback of this algorithm is that greedy clustering based on
           minimizing inter-cluster distance can lead to odd groupings in the outcome.

           Both algorithms required us to define data-specific clustering features and distance metrics. We
           discuss the data and relevant features and metrics in subsequent sections.

           Raw Data:

           We used the 1998 World Cup Web traces to apply our clustering techniques for characterizing
           workload. These traces were obtained from www.france98.com and parsed from compressed
           binary logs to user-readable format (as shown below). Each log file contained 1 day of data
           (which was around 1.5 million requests). Due to the large scale of data in each log file, we were
           only able to use one day of data at a time to prevent memory exhaustion.

ClientID    TimeStamp      Method    URL/Object Requested                              Type     Protocol     Resp      Size   Server
    595      894146590     GET       /images/10538.jpg                                 jpg      HTTP/1.0      200     15770       32
   1002      894146590     GET       /english/images/archives.gif                      gif      HTTP/1.0      200       869       32
   2546      894146590     GET       /english/frntpage.htm                             htm      HTTP/1.0      200     12518       32
  26637      894146590     GET       /english/history/images/reading.GIF               GIF      HTTP/1.0      200      1171       32
  26637      894146590     GET       /english/history/images/history_hm_header.gif     gif      HTTP/1.0      200       688       32
  33534      894146590     GET       /images/saintetienne.gif                          gif      HTTP/1.1      304         0       32
  33808      894146590     GET       /french/venues/images/venue_bu_city_off.gif       gif      HTTP/1.0      200       781       32
  33808      894146590     GET       /french/venues/images/venue_bu_statium_on.gif     gif      HTTP/1.0      200      1017       32
              Figure 3: Raw Data Format
Below is an explanation of each of the fields in the raw data:

•   clientID - a unique integer identifier for the client that issued the request (includes proxies).
    Mappings are one to one and consistent across all traces in this data. However, they are
    anonymized for privacy concerns.

•   timestamp - the time of the request (converted to GMT), stored as the number of seconds
    since the Epoch. Local time can be determined by adding 2 hours to the existing timestamp.

•   method - the method contained in the client's request (e.g., GET). Mappings for this are
    contained in src/*/definitions.h.

•   objectID – URL for specific object requested. These mappings are also 1-to-1 and are
    preserved across the entire data set.

•   objectType - the type of file requested (e.g., HTML, IMAGE, etc), generally based on the file
    extension (.html), or the presence of a parameter list (e.g., '?' indicates a DYNAMIC request).
    If the url ends with '/', it is considered a DIRECTORY. Mappings from the integer ID to the
    generic file type are contained in definitions.h. If more specific mappings are required this
    information can be obtained from analyzing the object mappings file
    (state/object_mappings.sort).

•   Protocol/version – HTTP version indicated in the client's request (e.g., HTTP/1.0)

•   Response - response status code (e.g., 200 OK).

•   size - the number of bytes in the response sent from server back to client.

•   serverID - indicates which server handled the request. We can further parse this field to
    identify exact server and location. The mappings are one-to-one and consistent throughout the
    traces.

Derived Features:

      LastReq    LastReq
         Time    ObjType     LastReqURL
            0    gif         /images/102338.gif
           11    gif         /english/images/news_btn_letter_off.gif
            2    gif         /images/hm_brdr.gif
      FALSE       FALSE                           FALSE
            0    GIF         /english/history/images/reading.GIF
            2    gif         /images/lyon.gif
            1    html        /french/venues/venues/bordeaux.html
            0    gif         /french/venues/images/venue_bu_city_off.gif

    While the raw data was in itself very descriptive, we did not have any information on the
    implications of the traces on the system. To gather some structure about trace sequences, we
    tried to gather certain temporal patterns by including historical data for each client request.
We derived the following features and added them to the request data:

    •   Time since last request: For each request, we obtained the timestamp for the previous
        request made by the same client and obtained the difference between the current and
        previous time stamp. This information helps us identify (to a small degree) user-behavior
        and think-time models.

    •   Last object requested: For each request, we also recorded the previous object requested
        by the same client so we can determine a request transition graph and identify what
        common request streams look like.

    •   Last requested object type: In addition to the path to the specific object, we extracted the
        type of the previous object request made by the same client. This data helps us identify
        patterns pertaining to sequences of object types that are requested together. For example,
        if I am requesting a gif, we can extrapolate what my most likely next request will be for
        (perhaps another gif or jpg).

Clustering Features:

Eventually plan to use feature selection once we have traces of requests’ effects on system. For
now, we manually choose the following features to cluster on:
   • Timestamp
   • URL of requested object
   • Type of object requested
   • Response size
   • Time since last request
   • Previous request objectURL and objectType
   • Server handling response

As and when we have more low-level metric data for the system, we plan to add those metrics as
features to our data by temporally correlating the metrics with requests input to the system.

Distance Metrics:

Clustering with numerical data permits a range of optimizations. For one thing, one can
explicitly represent the centroid of a cluster as the (probability-weighted or unweighted) average
of all data points in that cluster. This method works for numeric and binary vectors of data. For
example, the cluster centroid would be a vector of probabilities representing the likelihood that an
article contained a particular word given its membership in the cluster. However, many of our
features were discrete-valued and lacked a natural ordering consistent with the difference
operators we used. String identifiers are the best example of this problem. Specifically, in our
data, in addition to string, we also had structural information embedded in strings. The object
URL field contained a string with path information for the requested object. This forced us to
represent cluster centroids implicitly: the distance between a data point and a cluster centroid is
the average over all members of that cluster of the pairwise distance. This affects both the
performance of our clustering algorithms (agglomerative clustering must compute the distance
between a new cluster and an existing cluster using all pairs of data points from the two clusters,
instead of reducing the new cluster to a single data point) and the interpretability of the results
(we cannot synthesize a single data point that stands for an entire cluster). Here are some
examples of non-numeric data that was present in our dataset and how we calculated the
appropriate distance measure.

    •   Type of object requested: 0 or 1 based on if they’re the same or not. Later extensions:
        categorize based on media, text, ajax code etc.

    •   Server handling the response: 0 (same server, same location), 1(diff server, same
        location), 2(diff server, diff location)

    •   Time since last request: currently Euclidian distance (subtraction). Later we can break
        this down by time of day, day of week etc. once we start using more data.

    •   Path to requested object: currently using string distance. However, this should logically
        use “path distance” where by we calculate the number of hops between two requested
        urls
        e.g. dist between /france/images/blah.gif /usa/images/blee.gif should can be defined as 2
        (common root is ~ and distance from both roots to ~ is 2 )
        e.g. dist between /english/help/image/faq_on.GIF and /english/images/fpnewstop.gif
        should be definied as 2 (common root is /english and max dist to /english from leaf is 2)
        e.g. dist between /english/image/blah1.gif /english/image/blah2.gif is 1)
        e.g. dist = 0 when both object URLs are the same.


Experimental Results

This section discusses the output of the automated clustering algorithms. We did not have a clear
sense of how many clusters we should expect to see, so we ran agglomerative clustering to
completion, and completed multiple runs of the k-means clustering algorithm with different initial
centroids and cluster counts. Both of these algorithms output cluster members, but as described
above, the raw data format precluded construction of an “average” cluster member.

Below, we describe the various experiments and observations for the produced clusters:

•   Equal weight for all features: We started about by equally weighting all features in our data
    set. Running the clustering algorithms consistently produced very randomly distributed
    clusters. There seemed to be no consistent pattern for requests within each cluster. The results
    for this experiment were not meaningful to us. So we decided we should increase the weight
    of individual features to see if that helps provide interpretable results.

•   Double weight for type of requested object type: This experiment produced clusters that
    predominantly contained requests for the same object type (e.g. all gif, all html). However,
    there was no clear correlation between requests within clusters other that the matching object
    types. While these clusters were better than that of the previous experiment, it seemed as
    though the information it produced can easily be determined by sorting the data and there was
    not much value added by the clustering process as compared to the overhead.

•   Double weight for time since last request: In this experiment, random requests were clustered
    together based on similar values for time since last request from the same client. This
    information is meaningful to understand usage behavior. We observed what types of requests
    come after a significant amount of think-time and which ones stem from very short think-
times. However, the experiment did not shed much light on characterizing the raw workload
    itself.

•   Increase weight for size + object type: Increasing the priority for object type and response
    size together while leaving all other features at lower weights produced some meaningful
    insights. Requests for similar objects (for example, images) was further classified by similar
    response sizes so we noticed clusters with similar request patterns. However, we did not get
    any further insights beyond what’s mentioned above.

In the absence of cluster exemplars, we are left with the task of manually inspecting the clusters
found by our application. Working our way from the leaves to the root of the cluster tree
generated by the agglomerative algorithm, the clusters become harder to interpret. Regardless, for
the most part, our data had very scattered requests in each cluster and we did not find very
meaningful characterization results. Our concern is that the choice of features might bias the
results in favor of a particular hypothesis. One particular axis of the data could end up
significantly influencing the clustering results and the absence of sufficient feature vectors could
exaggerate this problem.


5. Discussion and Future Work:

The biggest drawback of our data set was the lack of measurable and quantifiable features that
reveal information about the target system for which the requests were generated. We plan to
alleviate this problem by replaying the traces on a virtual machine and recording various low-
level system metrics such as power consumption and CPU utilization. Such a task would enrich
the data with more observed and meaningful features and help the workload characterization
process significantly.

It was also difficult to validate “distance” between non-numeric data and we had to invent clever
workarounds for this problem. The most important of these non-numeric distance metrics is path
distance between two given URLs for requested objects. Once we implement this distance metric,
the data set and features will be richer and more informative. Another major improvement with
regards to the way the algorithms handle the data is to perform seeded clustering analysis. For
example, currently, the k-means algorithm randomly selects initial centroids to use for clustering.
However, if we intelligently select initial cluster centroids that are sparse across the data space,
our clustering results might prove to be more meaningful.

The last and probably most prohibitive aspect of our data was the sheer volume of requests for
each day. There were over 1.5 Million requests generated in a single day of trace data and our
tools were not catered to handle this volume. Thus, our analysis was limited to a single day of
data and did not span multiple days as we would have liked. As a result, the analysis is skewed by
behavioral patterns observed in a single day and there is no way to normalize the clustering
results based on other days of data. As a temporary workaround, we can experiment with stream
processing the log files to cluster across multiple days. However, there are practical limitations to
this approach due to available memory on the machines we use. The long-term solution to this
problem might be to stratify the data per day and sample data from each day to represent requests
spanning multiple days.

Perhaps we should reconsider the entire clustering approach and find an alternative algorithm that
does not involve tweaking the data so much. One conceivable (and appropriate) alternative is to
use Kernelized Canonical Correlation Analysis. This algorithm is a hybrid approach that projects
the data onto lower dimensions as well as clusters it to result in somewhat spherical clusters. It
can use tuples of request data and measured metrics and preserve their relationship across the
clustering process. This algorithm is the next action item we plan to pursue for our project.


6. Conclusions:

There are various benefits to our proposed workload characterization and generation
methodology. First and foremost, the methodology takes advantage of Machine Learning to
automatically distill important parameters of the workload. This fact guarantees that we do not
neglect important features based on any biases we have to system-specific behavior. We can also
generate models specific to our metric of interest, for example, performance, power, throughput,
user-response latency, or any combination of the above. Next, we can compare different
generations of models (e.g. different systems, different time points etc.) to identify data trends
over time and extrapolate to discover future behavior of the system. This ability to “fast-forward”
will prove extremely useful to web service developers and operators.

As we distinguish between model creation (inside companies, operates on sensitive data) and
model publishing (sanitized data for use by researchers), we address the privacy concern and
allow organizations to readily contribute trace data for us to process and use. This methodology
and mechanism are by no means restricted to web services and can easily be generalized to other
applications.

References:
[AJ00]      M Arlitt, T Jin, A workload characterization study of the 1998 World Cup Web site.
                                                                                             .
Network, IEEE, 2000

[BC98] Paul Barford, Mark Crovella, Generating representative Web workloads for network and
server performance evaluation. Proceedings of the 1998 ACM SIGMETRICS joint international
conference on Measurement and modeling of computer systems SIGMETRICS '98, June 1998

[CM06]           Gilberto Contreras, Margaret Martonosi, Techniques for Real-System
Characterization of Java Virtual Machine Energy and Power Behavior. IISWC 2006

[EV+02]     L Eeckhout, H Vandierendonck, K De Bosschere, How Input Data Sets Change Program
Behaviour .Workshop on Computer Architecture Evaluation using …, 2002

[Filebench] www.solarisinternals.com/si/tools/filebench/index.php

[Hammer] www.empirix.com/

[Harpoon]       J. Sommers and P. Barford. Self-Configuring Network Traffic Generation. In
Proceedings of ACM SIGCOMM Internet Measurement Conference, 2004.

[MACE]           Joel Sommers, Vinod Yegneswaran, Paul Barford, A framework for malicious
workload generation. Proceedings of the 4th ACM SIGCOMM conference on Internet measurement,
Taormina, Sicily, Italy

[MC+05]     J Moore, J Chase, K Farkas, P Ranganathan, Data Center Workload Monitoring,
Analysis, and Emulation. Eighth Workshop on Computer Architecture Evaluation using …, 2005

[Optixia]   www.ixiacom.com/products/optixia/

[SLAMD]     www.slamd.com/

[SPECweb] http://www.spec.org/web2005/

[StreamGen]       Mohamed Mansour, Matthew Wolf, Karsten Schwan, A Workload Generation Tool
for Distributed Information Flow Applications Proceedings of the 2004 International Conference on
Parallel Processing (ICPP'04), August 2004

[TPC-W]     www.tpc.org/tpcw/default.asp

[YL+06]        Richard Yoo, Han Lee, Kingsum Chow, Hsien-Hsin Lee, Constructing a Non-Linear
Model with Neural Networks for Workload Characterization. IISWC, 2006

More Related Content

What's hot

High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
 
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUESTOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUESShakas Technologies
 
Towards effective bug triage with software
Towards effective bug triage with softwareTowards effective bug triage with software
Towards effective bug triage with softwareShakas Technologies
 
Towards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniquesTowards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniquesPvrtechnologies Nellore
 
Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios ijcsa
 
Towards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniquesTowards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniquesredpel dot com
 
Towards effective bug triage with software
Towards effective bug triage with softwareTowards effective bug triage with software
Towards effective bug triage with softwareNexgen Technology
 
Dynamic autoselection and autotuning of machine learning models forcloud netw...
Dynamic autoselection and autotuning of machine learning models forcloud netw...Dynamic autoselection and autotuning of machine learning models forcloud netw...
Dynamic autoselection and autotuning of machine learning models forcloud netw...Venkat Projects
 
Data mining for_java_and_dot_net 2016-17
Data mining for_java_and_dot_net 2016-17Data mining for_java_and_dot_net 2016-17
Data mining for_java_and_dot_net 2016-17redpel dot com
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningGuido A. Ciollaro
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014Roger Barga
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern ApplicationsAmazon Web Services
 
DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...
DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...
DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...Migrant Systems
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET Journal
 
Software Maintenance Bug Triaging
Software Maintenance Bug TriagingSoftware Maintenance Bug Triaging
Software Maintenance Bug TriagingRamis Khan
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkAnastasios Theodosiou
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
 
Exploiting Service Similarity for Privacy in Location Based Search Queries
Exploiting Service Similarity for Privacy in Location Based Search QueriesExploiting Service Similarity for Privacy in Location Based Search Queries
Exploiting Service Similarity for Privacy in Location Based Search QueriesMigrant Systems
 
data Fusion and log correlation
data Fusion and log correlationdata Fusion and log correlation
data Fusion and log correlationMahdi Sayyad
 

What's hot (20)

High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayes
 
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUESTOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES
 
Towards effective bug triage with software
Towards effective bug triage with softwareTowards effective bug triage with software
Towards effective bug triage with software
 
Towards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniquesTowards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniques
 
Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios
 
Towards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniquesTowards effective bug triage with software data reduction techniques
Towards effective bug triage with software data reduction techniques
 
Towards effective bug triage with software
Towards effective bug triage with softwareTowards effective bug triage with software
Towards effective bug triage with software
 
Dynamic autoselection and autotuning of machine learning models forcloud netw...
Dynamic autoselection and autotuning of machine learning models forcloud netw...Dynamic autoselection and autotuning of machine learning models forcloud netw...
Dynamic autoselection and autotuning of machine learning models forcloud netw...
 
Data mining for_java_and_dot_net 2016-17
Data mining for_java_and_dot_net 2016-17Data mining for_java_and_dot_net 2016-17
Data mining for_java_and_dot_net 2016-17
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern Applications
 
DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...
DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...
DECENTRALIZED ACCESS CONTROL OF DATA STORED IN CLOUD USING KEY POLICY ATTRIBU...
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Software Maintenance Bug Triaging
Software Maintenance Bug TriagingSoftware Maintenance Bug Triaging
Software Maintenance Bug Triaging
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache Spark
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Exploiting Service Similarity for Privacy in Location Based Search Queries
Exploiting Service Similarity for Privacy in Location Based Search QueriesExploiting Service Similarity for Privacy in Location Based Search Queries
Exploiting Service Similarity for Privacy in Location Based Search Queries
 
data Fusion and log correlation
data Fusion and log correlationdata Fusion and log correlation
data Fusion and log correlation
 

Viewers also liked

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
Innovation in the Programmable Web: Characterizing the Mashup Ecosystem
Innovation in the Programmable Web: Characterizing the Mashup EcosystemInnovation in the Programmable Web: Characterizing the Mashup Ecosystem
Innovation in the Programmable Web: Characterizing the Mashup EcosystemShuli Yu
 
Characterizing Forest Degradation using Multiple SAR Approaches
Characterizing Forest Degradation using Multiple SAR ApproachesCharacterizing Forest Degradation using Multiple SAR Approaches
Characterizing Forest Degradation using Multiple SAR ApproachesCIFOR-ICRAF
 
MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...
MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...
MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...Abdallah El Ali
 
Connecticut CMG - Demystifying Oracle database capacity management with wor...
Connecticut CMG - Demystifying Oracle database  capacity management with  wor...Connecticut CMG - Demystifying Oracle database  capacity management with  wor...
Connecticut CMG - Demystifying Oracle database capacity management with wor...Renato Bonomini
 
CHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURCHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURcsk selva
 

Viewers also liked (7)

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
Innovation in the Programmable Web: Characterizing the Mashup Ecosystem
Innovation in the Programmable Web: Characterizing the Mashup EcosystemInnovation in the Programmable Web: Characterizing the Mashup Ecosystem
Innovation in the Programmable Web: Characterizing the Mashup Ecosystem
 
Characterizing Forest Degradation using Multiple SAR Approaches
Characterizing Forest Degradation using Multiple SAR ApproachesCharacterizing Forest Degradation using Multiple SAR Approaches
Characterizing Forest Degradation using Multiple SAR Approaches
 
Chi.talk
Chi.talkChi.talk
Chi.talk
 
MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...
MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...
MobileHCI 2016 - Technology Literacy in Poor Infrastructure Environments: Cha...
 
Connecticut CMG - Demystifying Oracle database capacity management with wor...
Connecticut CMG - Demystifying Oracle database  capacity management with  wor...Connecticut CMG - Demystifying Oracle database  capacity management with  wor...
Connecticut CMG - Demystifying Oracle database capacity management with wor...
 
CHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURCHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOUR
 

Similar to report

Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud ComputingHybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud ComputingEswar Publications
 
A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...
A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...
A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...IRJET Journal
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONijaia
 
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU UtilizationUsing Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU Utilizationgerogepatton
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONgerogepatton
 
friction factor modelling.pptx
friction factor modelling.pptxfriction factor modelling.pptx
friction factor modelling.pptxOKORIE1
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementtsysglobalsolutions
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesFinalyear Projects
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...Finalyear Projects
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsRavi Yogesh
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET Journal
 
Cloud Computing Task Scheduling Algorithm Based on Modified Genetic Algorithm
Cloud Computing Task Scheduling Algorithm Based on Modified Genetic AlgorithmCloud Computing Task Scheduling Algorithm Based on Modified Genetic Algorithm
Cloud Computing Task Scheduling Algorithm Based on Modified Genetic AlgorithmIRJET Journal
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource AllocationSujith Jay Nair
 
A Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud ComputingA Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud ComputingIRJET Journal
 
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Eswar Publications
 
Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Shyam Hajare
 
IRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten CharactersIRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten CharactersIRJET Journal
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulatorgystell
 

Similar to report (20)

Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud ComputingHybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
 
A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...
A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...
A SURVEY ON STATIC AND DYNAMIC LOAD BALANCING ALGORITHMS FOR DISTRIBUTED MULT...
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
 
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU UtilizationUsing Semi-supervised Classifier to Forecast Extreme CPU Utilization
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
 
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATIONUSING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
 
friction factor modelling.pptx
friction factor modelling.pptxfriction factor modelling.pptx
friction factor modelling.pptx
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service management
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
 
Cloud Computing Task Scheduling Algorithm Based on Modified Genetic Algorithm
Cloud Computing Task Scheduling Algorithm Based on Modified Genetic AlgorithmCloud Computing Task Scheduling Algorithm Based on Modified Genetic Algorithm
Cloud Computing Task Scheduling Algorithm Based on Modified Genetic Algorithm
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource Allocation
 
A Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud ComputingA Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud Computing
 
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
 
Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud
 
IRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten CharactersIRJET- Intelligent Character Recognition of Handwritten Characters
IRJET- Intelligent Character Recognition of Handwritten Characters
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulator
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

report

  • 1. Web Server Workload Characterization and Generation Archana Ganapathi archanag@cs.berkeley.edu Abstract Workload characterization and generation are essential tools to assist in building and maintaining web services. We discuss a framework that allows us to take advantage of trace data, process it using Machine Learning algorithms, and generate workload that produces specific effects on the target system. We performed clustering analysis to characterize a specific web server workload and discuss the analysis results and action items to fortify our workload generation framework. 1. Introduction The plethora of web services in today’s cyberspace has introduced a new set of challenges for systems developers. The scale and variety of workload handled by these web services can drastically increase and/or change over a short period of time. Currently, there are no tools to help operators predict scaling issues and system behavior after hardware and/or software configuration changes. The absence of such a tool results in unanticipated downtime, which ultimately reduces to a significant loss of revenue for the company. A representative, scalable and adaptive workload generator is the missing component that will enable web service operators to predict system behavior during the development and maintenance process of online services. Workload generation helps identify bottlenecks in the system and allows the developer to test particular parts of the system. It is also useful in predicting scaling issues and understanding system behavior under various configurations and saturation states. Currently, there is no uniform methodology for workload generation and existing workload generators are very system-specific. Furthermore, there is no effective way to predict the behavior of a system that is exposed to unexpected request patterns. The first obvious step in understanding how the target system works is to interpret the workload it handles. It is crucial to understand the characteristics of workload input and observe its effects on the target system. Since we are looking at web services, we use traces (typically http traces) that were collected at various internet services and try to characterize the consequent workload. Once we characterize the workload, we can tie it in with low-level system metrics (such as power consumption and CPU utilization), and correlate workload patterns with variations in system metrics. Our goal is to provide a framework for workload characterization and generation. Our proposed framework must be generic enough that it can accommodate existing commonly used workload generators. It must also be reconfigurable to allow black-box workload generation (where it learns properties of the system based on responses) as well as task-specific workload generation where we can target which features of the system we’d like to evaluate. One of the major obstacles of building such a framework based on trace data is privacy. Companies often disallow sharing their data without explicit Non-Disclosure Agreements and significant restrictions on how we use the data. We attempt to address such privacy concerns
  • 2. related to sharing data and hope to take advantage of Machine Learning to condense trace information into a concise, anonymized digest that has no pointers to the internal workings of the system. In this paper, we specifically elaborate upon the workload characterization process using various clustering algorithms to partition trace data into unique working sets. We have made limited progress in clustering the data; however, the process revealed the limitations of our dataset and methodology and has helped us identify a set of action items to pursue to achieve our long-term goals. 2. Related Work: We evaluate related work in the workload generation frameworks, workload generators and workload characterization domains. Special-purpose Workload Generation Frameworks: • Filebench (Sun Microsystems): This framework models/generates application interactions with file systems. The desired behavior can be specified in a domain-specific language and is then translated using pre-defined macros to determine the quantity and type of i/o requests. • MACE (University of Wisconsin): This framework is used to specify and generate malicious network traffic. This tool is commonly used in the network security domain. • StreamGen (Georgia Institue of Technology): This framework allows users to generate distributed events and data streams. However, they do not do much in terms of sophisticatedly characterizing the workload. Workload Generators: • SURGE: (short form for Scalable URL Reference generator), captures file size and request size distributions and relative popularity of requested files. It also models think time and allows users to configure specific models of interest. • SPECweb: Caters to web server. This workload generator generates http requests only and only captures the 200 (or OK) response. It is very difficult to configure and has been known to have request distributions that are quite different from actual trace data (especially at high numbers) • TPC-W: This workload generator models an online bookstore, which allows web serving, browsing and shopping cart facilities. However, the set up overhead for this workload generator is extremely high. • SLAMD: Java-based, tests network-based applications (specifically LDAP directory servers). This workload generator can also be used for Web servers and Web-based applications, relational databases, and mail servers. SLAMD is becoming more and more popular due its diverse nature of workload generation. • Harpoon: A flow-level Traffic Generator, mimics internet traffic and generates representative background traffic for application and protocol testing. • Optixia: This is one of few hardware-based workload generators. It is commonly used as an IP Performance Test platform, and can create and transmit any type of Layer 2-3 traffic patterns at up to line rate over a network. • Hammer: This is another rare instance of a hardware-based traffic generator. However, this workload generator is specific to VoIP and PSTN telephone call generation.
  • 3. While all the above workload generation tools are widely used in their specific application domains, they do not perform well in the presence of a black-box system. They require a certain degree of prior knowledge of the system, for which the workload generator is custom-configured. Workload Characterization: Arlitt et al. [AJ00] analyze the same data set we used for our study from the World Cup 1998. However, they manually sort the data to identify prominent patterns of the traces and do not do much in terms of applying maching learning algorithms. Moore et al. [MC+05], on the other hand, consider various trends in data center workload and attempt a framework to instrument and analyze workload in a data center and effectively emulate it. Eeckhout et al. [EV+02] consider workload characterization for energy and power consumption in the Java Virtual Machine. They instrument the virtual machine to observe these characteristics and analyze them. There has also been some work on using Neural Networks for characterizing 3-tier web service workload that follows non-linear behavior [YL+06]. Again, we see the predominance of a model-based approach where a preconceived notion of the system has a dominant effect on the analysis results. We hope to create a less biased characterization methodology for our future experiments. 3. Workload Characterization and Generation Methodology: Traces Project future at t=n-2 behavior given at t=n-1 historical trends Metrics clustering Wstatic stratified Wdynamic sampling at t=n Statistically Scale Up Workload Behind the Firewall Synthetic Model Workload open/closed/ajax.. Generation Figure 1: Workload Generation Framework. The diagram outlines each component we will use to implement our methodology. The figure above shows our proposed framework for capturing traces, condensing essential features of the data and using the information to replay workload. Below we elaborate on each step of the process outlined above, and discuss how we validate our generated workload. Methodology: 1. Derive metrics of interest: We must identify what features of the trace data are most useful in capturing the relationship between web server traces and system behavior. We can use a combination of feature-selection algorithms and low-level metric data gathered as per suggestions of knowledgeable domain experts. Formally speaking: Metrics = set of feature vectors 2. Clustering: Once we have identified useful features, we must cluster the workload trace data
  • 4. based on system behavioral features of interest. Each cluster will consist of a specific set of requests that produce particular system effects. In this step, we must identify the most appropriate clustering algorithm for our data set and define distance metrics between data points. Once we have clusters, we can summarize the features of the cluster and use the summary as an anonyzed version of the cluster, which we refer to as an equivalence class. We elaborate on this particular step of the process in subsequent sections of the paper. Formally speaking: Cluster = set of related metrics given pair-wise distance and clustering algorithm Let N = number of equivalence classes 3. Understanding the static workload: Workload tends to be of two types-static and dynamic. At this phase, we study static workload. Typically, there are several relationships between the workload and system metrics that are static, such as database-intensive requests and CPU- intensive requests. The clustering analysis helps us identify static effects of the workload on the system (or Wstatic as enumerated in the system). These properties can be formally described as follows: Wstatic = {cluster centroidi, cluster radiusi} where 1≤ i ≤ number of equivalence classes 4. Stratified sampling: Once we determine the various clusters (and/or equivalence classes), we must sample data from the original trace proportional to the specific workload mix we would like from each cluster/equivalence class. With this information, we can generate the dynamic component of the workload. 5. Dynamic workload: There are several aspects to the workload that may not be present in the traces (but we would like to simulate). For example, load spikes may not be captured in a given trace but we may choose to simulate it by scaling up trace data and adding representative synthetic filler data. We can generate a variety of Wdynamic mixes based on the task we would like to perform. For example, each Wdynamic can represent a different usage pattern (request sequences, inter-arrival rates etc). Furthermore, as Wdynamic is a simple transition matrix between equivalence classes, it contains no proprietary data and can be easily shared outside a corporate domain without privacy concerns. More formally: Wdynamic = NxN transition probability matrix 6. Projecting future behavior: Given various Wdynamic matrices for workload of different time points, we can use past trends to extrapolate future workload characteristics. This step would be useful to understand the evolving nature of the system and its workload but is not mandatory for simple workload generation. 7. Statistically scaling up: We can use Wdynamic to seed our synthetic workload and scale up to the desired quantity to stress-test our system. This step would help us determine the behavior of the target system under high load, determine the saturation point, and observe how the system behaves at its saturated state. 8. Synthetic workload generation: Lastly, we would like to perform the actual load generation based on all the above 7 steps and statistical learning processes. We would like the user of the workload generator to specify certain characteristics of the synthetic workload and custom- configure the generator based on their testing goals. Closely following the above 8 steps would guarantee that our workload generator is scalable and adaptable. However, as the saying goes, “the proof is in the pudding” as far as being able to
  • 5. generate realistic workload. Validating the synthetic workload: Success of our workload generator is not just defined by the ability to generate synthetic workload. It also assumes that the generated workload is realistic. To evaluate this aspect of our workload generate, we must compare the behavior of the system under trace-based workload to the same under our generated synthetic workload. We plan to take advantage of virtual machine technology for this purpose. As virtual machines provide us observability, we can run a variety of applications on these virtual machines, replay the traces, observe system behavior, and repeat the process by replaying our synthetically generated workload. Comparing the two behavioral patterns will close the feedback loop and inform us about how accurately our synthetic workload reproduces system behavior. 4. Clustering for Workload Characterization: As a proof of concept for applying machine learning for workload characterization, we implemented a few clustering algorithms and evaluated their performance against a set of http traces. The algorithms, data set, and experimental results are described below. Clustering Algorithms: Clustering algorithms approximate a loosely-defined optimal assignment problem. In many cases, including ours, the number of desired clusters is unknown, making it difficult to select a concrete number of clusters and formulate the clustering task as an integer program. We implemented two approximate solutions to clustering: K-Means/K-Median Clustering Agglomerative Clustering Initialization Choose k data points uniformly at Each data point is the center of a singleton random from the full data set cluster Iteration 1) Assign each data point to the 1) Merge two clusters whose centers are nearest cluster center closest to each other 2) Recompute cluster centers as the 2) Recompute cluster centers as the “average” of all member data points “average” of all member data points Runtime O(n2) O(n) (one new cluster per iteration)
  • 6. Figure 2: Comparison of Clustering Algorithms Iterative k-means/medians clustering: This algorithm chooses k data points uniformly at random from the data set as the initial cluster centroids. At each iteration, each data point joins the cluster whose centroid is closest to it relative to all other cluster centroids, and the centroid of each cluster is recomputed as the average/median of all member data points. This algorithm iterates until cluster membership stabilizes. In addition, we forbid the existence of singleton clusters: a cluster containing only one data point ceases to exist, and its constituent joins the nearest cluster. k-means/medians clustering implements the expectation-maximization (EM) algorithm for a uniform prior distribution over cluster membership. While easy to compute, it is sensitive to the choice of initial cluster centroids (which can be alleviated by clever seeding techniques as discussed later). The random seeding approach also causes non-deterministic cluster formation. Furthermore, this algorithm does not provide much information about structure within clusters, and hence is not as informative as the next algorithm we discuss. Hierarchical agglomerative clustering: This algorithm starts with n singleton clusters, one for each data point. At each iteration, it identifies the pair of clusters that are closest to each other and merges them together. The centroid of the newly merged cluster is the average of all data points in both original clusters. Agglomerative clustering implements a deterministic bottom-up approximation to the optimal assignment; it provides more consistent behavior than k-means clustering, and returns a binary tree that provides some insight into the internal structure of the clusters it finds. However, the drawback of this algorithm is that greedy clustering based on minimizing inter-cluster distance can lead to odd groupings in the outcome. Both algorithms required us to define data-specific clustering features and distance metrics. We discuss the data and relevant features and metrics in subsequent sections. Raw Data: We used the 1998 World Cup Web traces to apply our clustering techniques for characterizing workload. These traces were obtained from www.france98.com and parsed from compressed binary logs to user-readable format (as shown below). Each log file contained 1 day of data (which was around 1.5 million requests). Due to the large scale of data in each log file, we were only able to use one day of data at a time to prevent memory exhaustion. ClientID TimeStamp Method URL/Object Requested Type Protocol Resp Size Server 595 894146590 GET /images/10538.jpg jpg HTTP/1.0 200 15770 32 1002 894146590 GET /english/images/archives.gif gif HTTP/1.0 200 869 32 2546 894146590 GET /english/frntpage.htm htm HTTP/1.0 200 12518 32 26637 894146590 GET /english/history/images/reading.GIF GIF HTTP/1.0 200 1171 32 26637 894146590 GET /english/history/images/history_hm_header.gif gif HTTP/1.0 200 688 32 33534 894146590 GET /images/saintetienne.gif gif HTTP/1.1 304 0 32 33808 894146590 GET /french/venues/images/venue_bu_city_off.gif gif HTTP/1.0 200 781 32 33808 894146590 GET /french/venues/images/venue_bu_statium_on.gif gif HTTP/1.0 200 1017 32 Figure 3: Raw Data Format
  • 7. Below is an explanation of each of the fields in the raw data: • clientID - a unique integer identifier for the client that issued the request (includes proxies). Mappings are one to one and consistent across all traces in this data. However, they are anonymized for privacy concerns. • timestamp - the time of the request (converted to GMT), stored as the number of seconds since the Epoch. Local time can be determined by adding 2 hours to the existing timestamp. • method - the method contained in the client's request (e.g., GET). Mappings for this are contained in src/*/definitions.h. • objectID – URL for specific object requested. These mappings are also 1-to-1 and are preserved across the entire data set. • objectType - the type of file requested (e.g., HTML, IMAGE, etc), generally based on the file extension (.html), or the presence of a parameter list (e.g., '?' indicates a DYNAMIC request). If the url ends with '/', it is considered a DIRECTORY. Mappings from the integer ID to the generic file type are contained in definitions.h. If more specific mappings are required this information can be obtained from analyzing the object mappings file (state/object_mappings.sort). • Protocol/version – HTTP version indicated in the client's request (e.g., HTTP/1.0) • Response - response status code (e.g., 200 OK). • size - the number of bytes in the response sent from server back to client. • serverID - indicates which server handled the request. We can further parse this field to identify exact server and location. The mappings are one-to-one and consistent throughout the traces. Derived Features: LastReq LastReq Time ObjType LastReqURL 0 gif /images/102338.gif 11 gif /english/images/news_btn_letter_off.gif 2 gif /images/hm_brdr.gif FALSE FALSE FALSE 0 GIF /english/history/images/reading.GIF 2 gif /images/lyon.gif 1 html /french/venues/venues/bordeaux.html 0 gif /french/venues/images/venue_bu_city_off.gif While the raw data was in itself very descriptive, we did not have any information on the implications of the traces on the system. To gather some structure about trace sequences, we tried to gather certain temporal patterns by including historical data for each client request.
  • 8. We derived the following features and added them to the request data: • Time since last request: For each request, we obtained the timestamp for the previous request made by the same client and obtained the difference between the current and previous time stamp. This information helps us identify (to a small degree) user-behavior and think-time models. • Last object requested: For each request, we also recorded the previous object requested by the same client so we can determine a request transition graph and identify what common request streams look like. • Last requested object type: In addition to the path to the specific object, we extracted the type of the previous object request made by the same client. This data helps us identify patterns pertaining to sequences of object types that are requested together. For example, if I am requesting a gif, we can extrapolate what my most likely next request will be for (perhaps another gif or jpg). Clustering Features: Eventually plan to use feature selection once we have traces of requests’ effects on system. For now, we manually choose the following features to cluster on: • Timestamp • URL of requested object • Type of object requested • Response size • Time since last request • Previous request objectURL and objectType • Server handling response As and when we have more low-level metric data for the system, we plan to add those metrics as features to our data by temporally correlating the metrics with requests input to the system. Distance Metrics: Clustering with numerical data permits a range of optimizations. For one thing, one can explicitly represent the centroid of a cluster as the (probability-weighted or unweighted) average of all data points in that cluster. This method works for numeric and binary vectors of data. For example, the cluster centroid would be a vector of probabilities representing the likelihood that an article contained a particular word given its membership in the cluster. However, many of our features were discrete-valued and lacked a natural ordering consistent with the difference operators we used. String identifiers are the best example of this problem. Specifically, in our data, in addition to string, we also had structural information embedded in strings. The object URL field contained a string with path information for the requested object. This forced us to represent cluster centroids implicitly: the distance between a data point and a cluster centroid is the average over all members of that cluster of the pairwise distance. This affects both the performance of our clustering algorithms (agglomerative clustering must compute the distance between a new cluster and an existing cluster using all pairs of data points from the two clusters, instead of reducing the new cluster to a single data point) and the interpretability of the results (we cannot synthesize a single data point that stands for an entire cluster). Here are some examples of non-numeric data that was present in our dataset and how we calculated the
  • 9. appropriate distance measure. • Type of object requested: 0 or 1 based on if they’re the same or not. Later extensions: categorize based on media, text, ajax code etc. • Server handling the response: 0 (same server, same location), 1(diff server, same location), 2(diff server, diff location) • Time since last request: currently Euclidian distance (subtraction). Later we can break this down by time of day, day of week etc. once we start using more data. • Path to requested object: currently using string distance. However, this should logically use “path distance” where by we calculate the number of hops between two requested urls e.g. dist between /france/images/blah.gif /usa/images/blee.gif should can be defined as 2 (common root is ~ and distance from both roots to ~ is 2 ) e.g. dist between /english/help/image/faq_on.GIF and /english/images/fpnewstop.gif should be definied as 2 (common root is /english and max dist to /english from leaf is 2) e.g. dist between /english/image/blah1.gif /english/image/blah2.gif is 1) e.g. dist = 0 when both object URLs are the same. Experimental Results This section discusses the output of the automated clustering algorithms. We did not have a clear sense of how many clusters we should expect to see, so we ran agglomerative clustering to completion, and completed multiple runs of the k-means clustering algorithm with different initial centroids and cluster counts. Both of these algorithms output cluster members, but as described above, the raw data format precluded construction of an “average” cluster member. Below, we describe the various experiments and observations for the produced clusters: • Equal weight for all features: We started about by equally weighting all features in our data set. Running the clustering algorithms consistently produced very randomly distributed clusters. There seemed to be no consistent pattern for requests within each cluster. The results for this experiment were not meaningful to us. So we decided we should increase the weight of individual features to see if that helps provide interpretable results. • Double weight for type of requested object type: This experiment produced clusters that predominantly contained requests for the same object type (e.g. all gif, all html). However, there was no clear correlation between requests within clusters other that the matching object types. While these clusters were better than that of the previous experiment, it seemed as though the information it produced can easily be determined by sorting the data and there was not much value added by the clustering process as compared to the overhead. • Double weight for time since last request: In this experiment, random requests were clustered together based on similar values for time since last request from the same client. This information is meaningful to understand usage behavior. We observed what types of requests come after a significant amount of think-time and which ones stem from very short think-
  • 10. times. However, the experiment did not shed much light on characterizing the raw workload itself. • Increase weight for size + object type: Increasing the priority for object type and response size together while leaving all other features at lower weights produced some meaningful insights. Requests for similar objects (for example, images) was further classified by similar response sizes so we noticed clusters with similar request patterns. However, we did not get any further insights beyond what’s mentioned above. In the absence of cluster exemplars, we are left with the task of manually inspecting the clusters found by our application. Working our way from the leaves to the root of the cluster tree generated by the agglomerative algorithm, the clusters become harder to interpret. Regardless, for the most part, our data had very scattered requests in each cluster and we did not find very meaningful characterization results. Our concern is that the choice of features might bias the results in favor of a particular hypothesis. One particular axis of the data could end up significantly influencing the clustering results and the absence of sufficient feature vectors could exaggerate this problem. 5. Discussion and Future Work: The biggest drawback of our data set was the lack of measurable and quantifiable features that reveal information about the target system for which the requests were generated. We plan to alleviate this problem by replaying the traces on a virtual machine and recording various low- level system metrics such as power consumption and CPU utilization. Such a task would enrich the data with more observed and meaningful features and help the workload characterization process significantly. It was also difficult to validate “distance” between non-numeric data and we had to invent clever workarounds for this problem. The most important of these non-numeric distance metrics is path distance between two given URLs for requested objects. Once we implement this distance metric, the data set and features will be richer and more informative. Another major improvement with regards to the way the algorithms handle the data is to perform seeded clustering analysis. For example, currently, the k-means algorithm randomly selects initial centroids to use for clustering. However, if we intelligently select initial cluster centroids that are sparse across the data space, our clustering results might prove to be more meaningful. The last and probably most prohibitive aspect of our data was the sheer volume of requests for each day. There were over 1.5 Million requests generated in a single day of trace data and our tools were not catered to handle this volume. Thus, our analysis was limited to a single day of data and did not span multiple days as we would have liked. As a result, the analysis is skewed by behavioral patterns observed in a single day and there is no way to normalize the clustering results based on other days of data. As a temporary workaround, we can experiment with stream processing the log files to cluster across multiple days. However, there are practical limitations to this approach due to available memory on the machines we use. The long-term solution to this problem might be to stratify the data per day and sample data from each day to represent requests spanning multiple days. Perhaps we should reconsider the entire clustering approach and find an alternative algorithm that does not involve tweaking the data so much. One conceivable (and appropriate) alternative is to
  • 11. use Kernelized Canonical Correlation Analysis. This algorithm is a hybrid approach that projects the data onto lower dimensions as well as clusters it to result in somewhat spherical clusters. It can use tuples of request data and measured metrics and preserve their relationship across the clustering process. This algorithm is the next action item we plan to pursue for our project. 6. Conclusions: There are various benefits to our proposed workload characterization and generation methodology. First and foremost, the methodology takes advantage of Machine Learning to automatically distill important parameters of the workload. This fact guarantees that we do not neglect important features based on any biases we have to system-specific behavior. We can also generate models specific to our metric of interest, for example, performance, power, throughput, user-response latency, or any combination of the above. Next, we can compare different generations of models (e.g. different systems, different time points etc.) to identify data trends over time and extrapolate to discover future behavior of the system. This ability to “fast-forward” will prove extremely useful to web service developers and operators. As we distinguish between model creation (inside companies, operates on sensitive data) and model publishing (sanitized data for use by researchers), we address the privacy concern and allow organizations to readily contribute trace data for us to process and use. This methodology and mechanism are by no means restricted to web services and can easily be generalized to other applications. References: [AJ00] M Arlitt, T Jin, A workload characterization study of the 1998 World Cup Web site. . Network, IEEE, 2000 [BC98] Paul Barford, Mark Crovella, Generating representative Web workloads for network and server performance evaluation. Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems SIGMETRICS '98, June 1998 [CM06] Gilberto Contreras, Margaret Martonosi, Techniques for Real-System Characterization of Java Virtual Machine Energy and Power Behavior. IISWC 2006 [EV+02] L Eeckhout, H Vandierendonck, K De Bosschere, How Input Data Sets Change Program Behaviour .Workshop on Computer Architecture Evaluation using …, 2002 [Filebench] www.solarisinternals.com/si/tools/filebench/index.php [Hammer] www.empirix.com/ [Harpoon] J. Sommers and P. Barford. Self-Configuring Network Traffic Generation. In Proceedings of ACM SIGCOMM Internet Measurement Conference, 2004. [MACE] Joel Sommers, Vinod Yegneswaran, Paul Barford, A framework for malicious workload generation. Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, Taormina, Sicily, Italy [MC+05] J Moore, J Chase, K Farkas, P Ranganathan, Data Center Workload Monitoring,
  • 12. Analysis, and Emulation. Eighth Workshop on Computer Architecture Evaluation using …, 2005 [Optixia] www.ixiacom.com/products/optixia/ [SLAMD] www.slamd.com/ [SPECweb] http://www.spec.org/web2005/ [StreamGen] Mohamed Mansour, Matthew Wolf, Karsten Schwan, A Workload Generation Tool for Distributed Information Flow Applications Proceedings of the 2004 International Conference on Parallel Processing (ICPP'04), August 2004 [TPC-W] www.tpc.org/tpcw/default.asp [YL+06] Richard Yoo, Han Lee, Kingsum Chow, Hsien-Hsin Lee, Constructing a Non-Linear Model with Neural Networks for Workload Characterization. IISWC, 2006