Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman
1. Ernest
Efficient Performance Prediction for
Advanced Analytics on Apache Spark
Shivaram Venkataraman, Zongheng Yang
Michael Franklin, Benjamin Recht, Ion Stoica
10. User
Concerns
“What is the cheapest configuration to run my job in 2 hours?”
Given a budget, how fast can I run my job ?
“What kind of instances should I use on EC2 ?”
10
18. BASIC Model
time = x1 + x2 ∗
input
machines
+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial
Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data
Fit Linear Regression
19. Does THE MODEL match real applications ?
Algorithm
Compute
Communication
O(n/p)
O(n2/p)
O(1)
O(p)
O(log(p))
Other
GLM Regression
✔
✔
✔
KMeans
✔
✔
✔
O(center)
Naïve Bayes
✔
✔
Pearson Correlation
✔
✔
PCA
✔
✔
✔
QR (TSQR)
✔
✔
Number of data items: n, Number of processors: p, Constant number of features
20. COLLECTING TRAINING DATA
1%
2%
4%
8%
1
2
4
8
Input
Machines
Grid of
input, machines
Associate cost with
each experiment
Baseline: Cheapest
configurations first
21. OPTIMAL Design of EXPERIMENTSeriment design
nsider the problem of estimating a vector x ∈ Rn
from measurements or
ments
yi = aT
i x + wi, i = 1, . . . , m,
wi is measurement noise. We assume that wi are independent Gaussian
m variables with zero mean and unit variance, and that the measurement
a1, . . . , am span Rn
. The maximum likelihood estimate of x, which is the
s the minimum variance estimate, is given by the least-squares solution
ˆx =
m
i=1
aiaT
i
−1 m
i=1
yiai.
sociated estimation error e = ˆx − x has zero mean and covariance matrix
m −1
Given a Linear Model
21
Lower variance à
Better model
λi - Fraction of times each experiment is run
ata
me
g-
ic-
ly.
We
ata
to
ed
is
can set i as the fraction of time an experiment is chosen and
minimize the trace of the covariance matrix:
Minimize tr((
mX
i=1
iaiaT
i ) 1
)
subject to i 0, i 1
Using Experiment Design: The predictive model we de-
scribed in the previous section can be formulated as an ex-
periment design problem. The feature set used for model de-
sign consisted of just the scale of the data used and the num-
Bound total cost
d jobs.
refers to the study of
experiment given the
riment design specifi-
ments that are optimal
on. At a high-level the
ne data points that can
accurate model. In or-
of training data points
ained with those data
em where we are try-
surements y1, . . . , ym
urement. Each feature
er of dimensions (say
design setup described above and only choose to run those
experiments whose values are non-zero.
Accounting for Cost: One additional factor we need to
consider in using experiment design is that each experiment
we run costs a different amount. This cost could be in terms
of time (i.e. it is more expensive to train with larger fraction
of the input) or in terms of machines (i.e. there is a fixed
cost to say launching a machine). To account for the cost
of an experiment we can augment the optimization problem
we setup above with an additional constraint that the total
cost should be lesser than some budget. That is if we have a
cost function which gives us a cost ci for an experiment with
scale si and mi machines, we add a constraint to our solver
that
mP
i=1
ci i B where B is the total budget. For the rest
of this paper we use the time taken to collect training data
as the cost and ignore any machine setup costs as we can
22. OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1
2
4
8
Input
Machines
Use off-the-shelf solver
(CVX)
25. Workloads
Keystone-ML
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES
Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions
26. Workloads
Keystone-ML
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES
Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions
27. NUMBER OF INSTANCES: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
27
0
4000
8000
12000
16000
20000
24000
0
20
40
60
80
100
120
140
Time(s)
Machines
For 2 hour deadline, up to 5x lower cost
30. TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
30
7 data points
Up to 16 machines
Up to 10% data
EXPERIMENT DESIGN
0
1000
2000
3000
4000
5000
6000
42 machines
Time (s)
Training Time
Running Time
31. Accuracy, Overhead: GLM REGRESSION
MLlib Regression: < 15% error
3.9%
2.9%
113.4%
112.7%
0
1000
2000
3000
4000
5000
64
45
Time (seconds)
Machines
Actual
Predicted
Training
31
33. MORE DETAILS
Detecting when the model is wrong
Model extensions
Amazon EC2 variations over time
Straggler mitigation strategies
http://shivaram.org/publications/ernest-nsdi.pdf
35. IN Conclusion
Workload Trends: Advanced Analytics in the Cloud
Computation, Communication patterns affect scalability
Ernest: Performance predictions with low overhead
End-to-end linear model
Optimal experimental design
Workloads / Traces ?
Shivaram Venkataraman shivaram@cs.berkeley.edu
35