SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Ernest
Efficient Performance Prediction for
Advanced Analytics on Apache Spark
Shivaram Venkataraman, Zongheng Yang 
Michael Franklin, Benjamin Recht, Ion Stoica
Workload TRENDS: ADVANCED ANALYTICS
Workload Trends: Advanced Analytics
4
Workload Trends: Advanced Analytics
5
Workload Trends: Advanced Analytics
Keystone-ML TIMIT PIPELINE
Cosine Transform
 Normalization
 Linear Solver
Raw Data
Cosine
Transform
Normalization
Linear Solver
~100 iterations
Iterative 
(each iteration many jobs)
Long Running à Expensive
Numerically Intensive
7
Keystone-ML TIMIT PIPELINE
Raw
Data
Properties
Cloud Computing CHOICEs
Amazon EC2
t2.nano, t2.micro, t2.small
m4.large, m4.xlarge, m4.2xlarge,
m4.4xlarge, m3.medium,
c4.large, c4.xlarge, c4.2xlarge,
c3.large, c3.xlarge, c3.4xlarge,
r3.large, r3.xlarge, r3.4xlarge,
i2.2xlarge, i2.4xlarge, d2.xlarge
d2.2xlarge, d2.4xlarge,…
MICROSOFT AZURE
Basic tier: A0, A1, A2, A3, A4
Optimized Compute : D1, D2,
D3, D4, D11, D12, D13
D1v2, D2v2, D3v2, D11v2,…
Latest CPUs: G1, G2, G3, …
Network Optimized: A8, A9
Compute Intensive: A10, A11,…
n1-standard-1, ns1-standard-2,
ns1-standard-4, ns1-standard-8,
ns1-standard-16, ns1highmem-2,
ns1-highmem-4, ns1-highmem-8,
n1-highcpu-2, n1-highcpu-4, n1-
highcpu-8, n1-highcpu-16, n1-
highcpu-32, f1-micro, g1-small…
Google Cloud Engine
Instance Types and Number of Instances
TYRANNY of CHOICE
User
Concerns
“What is the cheapest configuration to run my job in 2 hours?”

Given a budget, how fast can I run my job ?

“What kind of instances should I use on EC2 ?” 

10
DO CHOICES MATTER ? MATRIX MULTIPLY
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix size: 400K by 1K 
Cores = 16
Memory = 244 GB
Cost = $2.66/hr
DO CHOICES MATTER ?
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multiply: 400K by 1K 
0
5
10
15
20
25
30
35
Time(s)
QR Factorization 1M by 1K 
Network Bound
 Mem Bandwidth Bound
0
10
20
30
0
 100
 200
 300
 400
 500
 600
Time(s)
Cores
Actual
 Ideal
r3.4xlarge instances, QR Factorization:1M by 1K 
13
Do choices MATTER ?
Computation + Communication à Non-linear Scaling
APPROACH
Job
Data
+
Performance Model
Challenges
Black Box Jobs



Model Building Overhead
Regular Structure + Few Iterations
Modeling Jobs
15
Computation patterns
Time INPUT Time 1
machines
Communication Patterns
ONE-To-ONE Tree DAG All-to-one
CONSTANT LOG LINEAR
17
BASIC Model
time = x1 + x2 ∗
input
machines
+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial
Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data
 Fit Linear Regression
Does THE MODEL match real applications ?
Algorithm
Compute
 Communication
O(n/p)
 O(n2/p)
 O(1)
 O(p)
 O(log(p))
 Other
GLM Regression
 ✔
 ✔
 ✔
KMeans
 ✔
 ✔
 ✔
 O(center)
Naïve Bayes
 ✔
 ✔
Pearson Correlation
 ✔
 ✔
PCA
 ✔
 ✔
 ✔
QR (TSQR)
 ✔
 ✔
Number of data items: n, Number of processors: p, Constant number of features
COLLECTING TRAINING DATA
1%
2%
4%
8%
1
 2
 4
 8
Input
Machines
Grid of 
input, machines
Associate cost with 
each experiment
Baseline: Cheapest
configurations first
OPTIMAL Design of EXPERIMENTSeriment design
nsider the problem of estimating a vector x ∈ Rn
from measurements or
ments
yi = aT
i x + wi, i = 1, . . . , m,
wi is measurement noise. We assume that wi are independent Gaussian
m variables with zero mean and unit variance, and that the measurement
a1, . . . , am span Rn
. The maximum likelihood estimate of x, which is the
s the minimum variance estimate, is given by the least-squares solution
ˆx =
m
i=1
aiaT
i
−1 m
i=1
yiai.
sociated estimation error e = ˆx − x has zero mean and covariance matrix
m −1
Given a Linear Model
21
Lower variance à 
Better model 
λi - Fraction of times each experiment is run
ata
me
g-
ic-
ly.
We
ata
to
ed
is
can set i as the fraction of time an experiment is chosen and
minimize the trace of the covariance matrix:
Minimize tr((
mX
i=1
iaiaT
i ) 1
)
subject to i 0, i  1
Using Experiment Design: The predictive model we de-
scribed in the previous section can be formulated as an ex-
periment design problem. The feature set used for model de-
sign consisted of just the scale of the data used and the num-
Bound total cost
d jobs.
refers to the study of
experiment given the
riment design specifi-
ments that are optimal
on. At a high-level the
ne data points that can
accurate model. In or-
of training data points
ained with those data
em where we are try-
surements y1, . . . , ym
urement. Each feature
er of dimensions (say
design setup described above and only choose to run those
experiments whose values are non-zero.
Accounting for Cost: One additional factor we need to
consider in using experiment design is that each experiment
we run costs a different amount. This cost could be in terms
of time (i.e. it is more expensive to train with larger fraction
of the input) or in terms of machines (i.e. there is a fixed
cost to say launching a machine). To account for the cost
of an experiment we can augment the optimization problem
we setup above with an additional constraint that the total
cost should be lesser than some budget. That is if we have a
cost function which gives us a cost ci for an experiment with
scale si and mi machines, we add a constraint to our solver
that
mP
i=1
ci i  B where B is the total budget. For the rest
of this paper we use the time taken to collect training data
as the cost and ignore any machine setup costs as we can
OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1
 2
 4
 8
Input
Machines
Use off-the-shelf solver
(CVX)
USING ERNEST
Training
Jobs
Job
Binary
Machines,
Input Size 
Linear
Model
Experiment
Design
Use few iterations for
training
0
200
400
600
800
1000
1
 30
Time
Machines
ERNEST
EVALUATION
24
Workloads
Keystone-ML 
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions
Workloads
Keystone-ML 
Spark MLlib
ADAM
GenBase
Sparse GLMs
Random Projections
OBJECTIVES

Optimal number of machines
Prediction accuracy
Model training overhead
Importance of experiment design
Choosing EC2 instance types
Model extensions
NUMBER OF INSTANCES: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
27
0
4000
8000
12000
16000
20000
24000
0
 20
 40
 60
 80
 100
 120
 140
Time(s)
Machines
For 2 hour deadline, up to 5x lower cost
ACCURACY: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, Time Per Iteration
28
0
40
80
120
160
200
0
 20
 40
 60
 80
 100
 120
 140
TimePerIteration(s)
Machines
10%-15% Prediction error
Accuracy: SparK MLlib
29
0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
Regression
KMeans
NaiveBayes
PCA
Pearson
Predicted / Actual
64 machines
 45 machines
TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
30
7 data points
Up to 16 machines
Up to 10% data
EXPERIMENT DESIGN
0
 1000
 2000
 3000
 4000
 5000
 6000
42 machines
Time (s)
Training Time
Running Time
Accuracy, Overhead: GLM REGRESSION
MLlib Regression: < 15% error
3.9%
2.9%
113.4%
112.7%
0
 1000
 2000
 3000
 4000
 5000
64
45
Time (seconds)
Machines
Actual
Predicted
Training
31
0%
 20%
 40%
 60%
 80%
 100%
Regression
Classification
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
Is Experiment Design useful ?
32
MORE DETAILS

Detecting when the model is wrong

Model extensions

Amazon EC2 variations over time

Straggler mitigation strategies
http://shivaram.org/publications/ernest-nsdi.pdf
OPEN SOURCE

Python-based implementation

Experiment Design, Predictor Modules

Example using SparkML + RCV1
https://github.com/amplab/ernest
IN Conclusion
Workload Trends: Advanced Analytics in the Cloud
Computation, Communication patterns affect scalability

Ernest: Performance predictions with low overhead

End-to-end linear model

Optimal experimental design

Workloads / Traces ? 

Shivaram Venkataraman shivaram@cs.berkeley.edu

35

Contenu connexe

Tendances

High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
inside-BigData.com
 

Tendances (20)

DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
WebSphere Message Broker installation guide
WebSphere Message Broker installation guideWebSphere Message Broker installation guide
WebSphere Message Broker installation guide
 
Verda Cloud Family
Verda Cloud FamilyVerda Cloud Family
Verda Cloud Family
 
Top 15 Tips for vGPU Success - Part 3-3
Top 15 Tips for vGPU Success - Part 3-3Top 15 Tips for vGPU Success - Part 3-3
Top 15 Tips for vGPU Success - Part 3-3
 
Trends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_businessTrends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_business
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
 
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
Ad-Tech on AWS 세미나 | AWS와 실시간 입찰
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
 
Ai 그까이거
Ai 그까이거Ai 그까이거
Ai 그까이거
 
CCDT(client connection)MQ.docx
CCDT(client connection)MQ.docxCCDT(client connection)MQ.docx
CCDT(client connection)MQ.docx
 
幻の台湾ゲーム機Super A’can
幻の台湾ゲーム機Super A’can幻の台湾ゲーム機Super A’can
幻の台湾ゲーム機Super A’can
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
IBM entitled software support, how to use
IBM entitled software support, how to useIBM entitled software support, how to use
IBM entitled software support, how to use
 
Java WebStart Is Dead: What Should We Do Now?
Java WebStart Is Dead: What Should We Do Now?Java WebStart Is Dead: What Should We Do Now?
Java WebStart Is Dead: What Should We Do Now?
 
챗GPT 활용팁
챗GPT 활용팁챗GPT 활용팁
챗GPT 활용팁
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
Sub-Consultas Oracle
Sub-Consultas OracleSub-Consultas Oracle
Sub-Consultas Oracle
 
Supervisord 사용법을 간단히 알아보자!
Supervisord 사용법을 간단히 알아보자!Supervisord 사용법을 간단히 알아보자!
Supervisord 사용법을 간단히 알아보자!
 

Similaire à Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012
Jussara F.M.
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
sqlserver.co.il
 
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
eArtius, Inc.
 

Similaire à Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman (20)

A04230105
A04230105A04230105
A04230105
 
autoTVM
autoTVMautoTVM
autoTVM
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
C3 w2
C3 w2C3 w2
C3 w2
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptx
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer Models
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 

Plus de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Dernier (20)

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

  • 1. Ernest Efficient Performance Prediction for Advanced Analytics on Apache Spark Shivaram Venkataraman, Zongheng Yang Michael Franklin, Benjamin Recht, Ion Stoica
  • 6. Keystone-ML TIMIT PIPELINE Cosine Transform Normalization Linear Solver Raw Data
  • 7. Cosine Transform Normalization Linear Solver ~100 iterations Iterative (each iteration many jobs) Long Running à Expensive Numerically Intensive 7 Keystone-ML TIMIT PIPELINE Raw Data Properties
  • 8. Cloud Computing CHOICEs Amazon EC2 t2.nano, t2.micro, t2.small m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m3.medium, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.4xlarge, r3.large, r3.xlarge, r3.4xlarge, i2.2xlarge, i2.4xlarge, d2.xlarge d2.2xlarge, d2.4xlarge,… MICROSOFT AZURE Basic tier: A0, A1, A2, A3, A4 Optimized Compute : D1, D2, D3, D4, D11, D12, D13 D1v2, D2v2, D3v2, D11v2,… Latest CPUs: G1, G2, G3, … Network Optimized: A8, A9 Compute Intensive: A10, A11,… n1-standard-1, ns1-standard-2, ns1-standard-4, ns1-standard-8, ns1-standard-16, ns1highmem-2, ns1-highmem-4, ns1-highmem-8, n1-highcpu-2, n1-highcpu-4, n1- highcpu-8, n1-highcpu-16, n1- highcpu-32, f1-micro, g1-small… Google Cloud Engine Instance Types and Number of Instances
  • 10. User Concerns “What is the cheapest configuration to run my job in 2 hours?” Given a budget, how fast can I run my job ? “What kind of instances should I use on EC2 ?” 10
  • 11. DO CHOICES MATTER ? MATRIX MULTIPLY 0 5 10 15 20 25 30 Time(s) 1 r3.8xlarge 2 r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Matrix size: 400K by 1K Cores = 16 Memory = 244 GB Cost = $2.66/hr
  • 12. DO CHOICES MATTER ? 0 5 10 15 20 25 30 Time(s) 1 r3.8xlarge 2 r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Matrix Multiply: 400K by 1K 0 5 10 15 20 25 30 35 Time(s) QR Factorization 1M by 1K Network Bound Mem Bandwidth Bound
  • 13. 0 10 20 30 0 100 200 300 400 500 600 Time(s) Cores Actual Ideal r3.4xlarge instances, QR Factorization:1M by 1K 13 Do choices MATTER ? Computation + Communication à Non-linear Scaling
  • 14. APPROACH Job Data + Performance Model Challenges Black Box Jobs Model Building Overhead Regular Structure + Few Iterations
  • 17. Communication Patterns ONE-To-ONE Tree DAG All-to-one CONSTANT LOG LINEAR 17
  • 18. BASIC Model time = x1 + x2 ∗ input machines + x3 ∗ log(machines)+ x4 ∗ (machines) Serial Execution Computation (linear) Tree DAG All-to-One DAG Collect Training Data Fit Linear Regression
  • 19. Does THE MODEL match real applications ? Algorithm Compute Communication O(n/p) O(n2/p) O(1) O(p) O(log(p)) Other GLM Regression ✔ ✔ ✔ KMeans ✔ ✔ ✔ O(center) Naïve Bayes ✔ ✔ Pearson Correlation ✔ ✔ PCA ✔ ✔ ✔ QR (TSQR) ✔ ✔ Number of data items: n, Number of processors: p, Constant number of features
  • 20. COLLECTING TRAINING DATA 1% 2% 4% 8% 1 2 4 8 Input Machines Grid of input, machines Associate cost with each experiment Baseline: Cheapest configurations first
  • 21. OPTIMAL Design of EXPERIMENTSeriment design nsider the problem of estimating a vector x ∈ Rn from measurements or ments yi = aT i x + wi, i = 1, . . . , m, wi is measurement noise. We assume that wi are independent Gaussian m variables with zero mean and unit variance, and that the measurement a1, . . . , am span Rn . The maximum likelihood estimate of x, which is the s the minimum variance estimate, is given by the least-squares solution ˆx = m i=1 aiaT i −1 m i=1 yiai. sociated estimation error e = ˆx − x has zero mean and covariance matrix m −1 Given a Linear Model 21 Lower variance à Better model λi - Fraction of times each experiment is run ata me g- ic- ly. We ata to ed is can set i as the fraction of time an experiment is chosen and minimize the trace of the covariance matrix: Minimize tr(( mX i=1 iaiaT i ) 1 ) subject to i 0, i  1 Using Experiment Design: The predictive model we de- scribed in the previous section can be formulated as an ex- periment design problem. The feature set used for model de- sign consisted of just the scale of the data used and the num- Bound total cost d jobs. refers to the study of experiment given the riment design specifi- ments that are optimal on. At a high-level the ne data points that can accurate model. In or- of training data points ained with those data em where we are try- surements y1, . . . , ym urement. Each feature er of dimensions (say design setup described above and only choose to run those experiments whose values are non-zero. Accounting for Cost: One additional factor we need to consider in using experiment design is that each experiment we run costs a different amount. This cost could be in terms of time (i.e. it is more expensive to train with larger fraction of the input) or in terms of machines (i.e. there is a fixed cost to say launching a machine). To account for the cost of an experiment we can augment the optimization problem we setup above with an additional constraint that the total cost should be lesser than some budget. That is if we have a cost function which gives us a cost ci for an experiment with scale si and mi machines, we add a constraint to our solver that mP i=1 ci i  B where B is the total budget. For the rest of this paper we use the time taken to collect training data as the cost and ignore any machine setup costs as we can
  • 22. OPTIMAL Design of EXPERIMENTS 1% 2% 4% 8% 1 2 4 8 Input Machines Use off-the-shelf solver (CVX)
  • 23. USING ERNEST Training Jobs Job Binary Machines, Input Size Linear Model Experiment Design Use few iterations for training 0 200 400 600 800 1000 1 30 Time Machines ERNEST
  • 25. Workloads Keystone-ML Spark MLlib ADAM GenBase Sparse GLMs Random Projections OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
  • 26. Workloads Keystone-ML Spark MLlib ADAM GenBase Sparse GLMs Random Projections OBJECTIVES Optimal number of machines Prediction accuracy Model training overhead Importance of experiment design Choosing EC2 instance types Model extensions
  • 27. NUMBER OF INSTANCES: Keystone-ml TIMIT Pipeline on r3.xlarge instances, 100 iterations 27 0 4000 8000 12000 16000 20000 24000 0 20 40 60 80 100 120 140 Time(s) Machines For 2 hour deadline, up to 5x lower cost
  • 28. ACCURACY: Keystone-ml TIMIT Pipeline on r3.xlarge instances, Time Per Iteration 28 0 40 80 120 160 200 0 20 40 60 80 100 120 140 TimePerIteration(s) Machines 10%-15% Prediction error
  • 29. Accuracy: SparK MLlib 29 0 0.2 0.4 0.6 0.8 1 1.2 Regression KMeans NaiveBayes PCA Pearson Predicted / Actual 64 machines 45 machines
  • 30. TRAINING TIME: Keystone-ml TIMIT Pipeline on r3.xlarge instances, 100 iterations 30 7 data points Up to 16 machines Up to 10% data EXPERIMENT DESIGN 0 1000 2000 3000 4000 5000 6000 42 machines Time (s) Training Time Running Time
  • 31. Accuracy, Overhead: GLM REGRESSION MLlib Regression: < 15% error 3.9% 2.9% 113.4% 112.7% 0 1000 2000 3000 4000 5000 64 45 Time (seconds) Machines Actual Predicted Training 31
  • 32. 0% 20% 40% 60% 80% 100% Regression Classification KMeans PCA TIMIT Prediction Error (%) Experiment Design Cost-based Is Experiment Design useful ? 32
  • 33. MORE DETAILS Detecting when the model is wrong Model extensions Amazon EC2 variations over time Straggler mitigation strategies http://shivaram.org/publications/ernest-nsdi.pdf
  • 34. OPEN SOURCE Python-based implementation Experiment Design, Predictor Modules Example using SparkML + RCV1 https://github.com/amplab/ernest
  • 35. IN Conclusion Workload Trends: Advanced Analytics in the Cloud Computation, Communication patterns affect scalability Ernest: Performance predictions with low overhead End-to-end linear model Optimal experimental design Workloads / Traces ? Shivaram Venkataraman shivaram@cs.berkeley.edu 35