SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
OPTIMIZINGTERASCALE MACHINE
LEARNING PIPELINES WITH
Evan R. Sparks, UC Berkeley AMPLab
with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht
MLKeystone
Apache
WHAT’S A MACHINE
LEARNING PIPELINE?
A STANDARD MACHINE LEARNING PIPELINE
Right?
Data
Train
Classifier
Model
A STANDARD MACHINE LEARNING PIPELINE
That’s more like it!
Data
Train
Linear
Classifier
Model
Feature
Extraction
Test
Data
Predictions
A REAL PIPELINE FOR
IMAGE CLASSIFICATION
Inspired by Coates & Ng, 2012
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectifier
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
ModelLinear
Mapper
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Data
Image
Parser
Normalizer Convolver
sqrt,mean
Zipper
Linear
Solver
Symmetric
Rectifier
ident,abs
ident,mean
Global Pooling
Pooler
Patch
Extractor
Patch
Whitener
KMeans
Clusterer
Feature Extractor
Label
Extractor
Linear
Mapper
Model
Test
Data
Label
Extractor
Feature
Extractor
Test
Error
Error
Computer
Embarrassingly Parallel
Requires Coordination
Tricky to Scale
ABOUT KEYSTONEML
• Software framework for building scalable end-to-end machine
learning pipelines on Apache Spark.
• Helps us understand what it means to build systems for robust,
scalable, end-to-end advanced analytics workloads and the patterns
that emerge.
• Example pipelines that achieve state-of-the-art results on large scale
datasets in computer vision, NLP, and speech - fast.
• Open source software, available at: http://keystone-ml.org/
SIMPLE EXAMPLE:
TEXT CLASSIFICATION
20
Newsgroups
.fit( )
Trim
Tokenize
Bigrams
Top Features
Naive Bayes
Max Classifier
Trim
Tokenize
Bigrams
Max Classifier
Top Features
Transformer
Naive Bayes
Model
Once estimated - apply
these steps to your
production data in an
online or batch fashion.
NOT SO SIMPLE EXAMPLE:
IMAGE CLASSIFICATION
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance
of Chatfield et. al., 2011
Pleasantly parallel
featurization and evaluation.
7 minutes on a modest cluster.
5,000 examples, 40,000
features, 20 classes
EVEN LESS SIMPLE: IMAGENET
Color Edges
Resize
Grayscale
SIFT
PCA
FisherVector
Top 5 Classifier
LCS
PCA
FisherVector
Block Linear
Solver
<100 SLOC
Upgrading the solver
for higher precision
means changing 1 LOC.
Weighted Block
Linear Solver
Adding 100,000 more
texture features is easy.
Texture
Gabor
Wavelets
PCA
FisherVector
1000 class classification.
1,200,000 examples
64,000 features.
90 minutes on 100 nodes.
OPTIMIZING KEYSTONEML PIPELINES
High-level API enables rich space of optimizations
Automated ML operator selection. Linear
Solver
L-BFGS
Iterative
SGD
Direct
Solver
Training
Data
Grayscaler
SIFT
Extractor
Reduce
Dimensions
Fisher
Vector
Normalize
Column
Sampler
Linear
Map
Distributed
PCA
Column
Sampler
Local
GMM
Least Sq.
L-BFGS
Predictions
Training
Labels
Auto-caching for iterative workloads.
KEYSTONEML OPTIMIZER
• Sampling-based cost model
projects resource usage
• CPU, Memory, Network
• Utilization tracked through
pipeline.
• Decisions made to minimize
total cost of execution.
• Catalyst-based optimizer does
the heavy lifting.
Stage n d size (GB)
Input 5000 1m pixel
JPEG
0.4
Resize 5000 260k pixels 3.6
Grayscale 5000 260k pixels 1.2
SIFT 5000 65000x128 309
PCA 5000 65000x80 154
FV 5000 256x64x2 1.2
Linear
Regression
5000 20 0.0007
Max
Classifier
5000 1 0.00009
CHOOSING A SOLVER
• Datasets have a number of
interesting degrees of freedom.
• Problem size (n, d, k)
• sparsity (nnz)
• condition number
• Platform has degrees of freedom:
• Memory, CPU, Network, Nodes
• Solvers are predictable!
13
Where:
A 2 Rn⇥d
X 2 Rd⇥k
B 2 Rn⇥k
Objective:
min
X
|AX B|
2
2 + |X|2
2
CHOOSING A SOLVER
• Three Solvers
• Exact, Block, LBFGS
• Two datasets
• Amazon - >99% sparse, n=65m
• TIMIT - dense, n=2m
• Exact solve works well for small # features.
• Use LBFGS for sparse problems.
• Block solver scales well to big dense
problems.
• Hundreds of thousands of features.
●
●
●
●
●
●
Amazon TIMIT
100
1000
10000
10
100
1000
1024 2048 4096 8192 16384 1024 2048 4096 8192 16384
Number of Features
Time(s)
Solver ● Exact Block Solver LBFGS
14
SOLVER PERFORMANCE
• Compared KeystoneML with:
• VowpalWabbit - specialized system for
large, sparse problems.
• SystemML - general purpose, optimizing
ML system.
• Two problems:
• Amazon - Sparse text features.
• BinaryTIMIT - Dense phoneme data.
• High Order Bit:
• KeystoneML pipelines featurization and
adapts to workload changes.
Amazon
0
200
400
600
800
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Binary TIMIT
0
100
200
300
400
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML SystemML
Amazon
0
50
100
150
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
Binary TIMIT
0
500
1000
1500
1024 2048 4096 8192 16384
Features
Time(s)
System KeystoneML Vowpal Wabbit
DECIDING WHATTO SAVE
• Pipelines Generate Lots of
intermediate state.
• E.g. SIFT features blow up a
0.42GBVOC dataset to 300GB.
• Iterative algorithms —> state
needed many times.
• How do we determine what to save
for later and what to reuse, given
fixed resource budget?
• Can we adapt to workload changes?
16
Resize
Grayscale
SIFT
PCA
FisherVector
MaxClassifier
Linear Regression
CACHING PROBLEM
• Output is computed via depth-
first execution of DAG.
• Caching “truncates” a path
after first visit.
• Want to minimize execution
time.
• Subject to memory
constraints.
• Picking optimal set is hard!
17
A B
C
D
E
60s
50g
40s
200g
20s
40g
40g
15s
5s
10g
Output
Cache set Time Memory
ABCDE 140s 340g
B 140s 200g
A 180s 50g
{} 240s 0g
END-TO-END PERFORMANCE
Dataset
Training
Examples
Features Raw Size (GB)
Feature Size
(GB)
Amazon 65 million 100k (sparse) 14 89
TIMIT 2.25 million 528k 7.5 8800
ImageNet 1.28 million 262k 74 2500
VOC 5000 40k 0.43 1.5
END-TO-END PERFORMANCE
Dataset
KeystoneML
Accuracy
Reported
Accuracy
KeystoneML
Time (m)
Reported
Time (m)
Speedup
over
Reported
Amazon 91.6% N/A 3.3 N/A N/A
TIMIT 66.1% 66.3% 138 120 0.87x
ImageNet 67.4% 66.6% 270 5760 21x
VOC 57.2% 59.2% 7 87 12x
END-TO-END PERFORMANCE
Amazon TIMIT ImageNet
0
5
10
15
0
20
40
60
0
100
200
300
400
500
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Time(minutes)
Stage
Loading Train Data Featurization Model Solve
Loading Test Data Model Eval
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Amazon TIMIT ImageNet
1
2
4
8
16
8 16 32 64 128 8 16 32 64 128 8 16 32 64 128
Cluster Size (# of nodes)
Speedupover8nodes(x)
END-TO-END PERFORMANCE
• Tested three levels of
optimization
• None
• Auto-caching only
• Auto-caching and
operator-selection.
• 7x to 15x speedup
0
5
10
15
Amazon TIMIT VOC
Workload
Speedup
Optimization Level None Whole−Pipeline All
QUESTIONS?
http://keystone-ml.org/
Project Page
Code
http://github.com/amplab/keystone
Training
http://goo.gl/axbkkc
BACKUP SLIDES
SOFTWARE FEATURES
• Data Loaders
• CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups
• Transformers
• NLP -Tokenization, n-grams, term frequency, NER*,
parsing*
• Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling,Windowing, HOG, Daisy
• Speech - MFCCs*
• Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
• Utility/misc - Caching,Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
• Estimators
• Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - Amazon Product Review
Classification, 20 Newsgroups,Wikipedia
Language model
• Images - MNIST, CIFAR,VOC, ImageNet
• Speech -TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library
Just 11k Lines of Code,
5k of which areTests or JavaDoc.
KEY API CONCEPTS
TRANSFORMERS
TransformerInput Output
abstract classTransformer[In, Out] {
def apply(in: In): Out
def apply(in: RDD[In]): RDD[Out] = in.map(apply)
…
}
TYPE SAFETY HELPS ENSURE ROBUSTNESS
ESTIMATORS
EstimatorRDD[Input]
abstract class Estimator[In, Out] {
def fit(in: RDD[In]):Transformer[In,Out]
…
}
Transformer
.fit()
CHAINING
NGrams(2)String Vectorizer VectorBigrams
val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer
featurizerString Vector
=
COMPLEX PIPELINES
.fit(data, labels)
pipelineString Prediction
=
val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels)
featurizerString Vector
Linear
Model
Prediction
featurizerString Vector
Linear
Map
Prediction

Contenu connexe

Tendances

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetupTheodoros Vasiloudis
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 

Tendances (20)

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 

En vedette

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningJen Aman
 
Open Air 2016 Mini Talk
Open Air 2016 Mini TalkOpen Air 2016 Mini Talk
Open Air 2016 Mini TalkSky Yin
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajSri Ambati
 
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri AmbatiH2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri AmbatiSri Ambati
 
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi RenH2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi RenSri Ambati
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...Spark Summit
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...Spark Summit
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovSpark Summit
 
Netflix branding stumbles
Netflix branding stumblesNetflix branding stumbles
Netflix branding stumblesMayur Verma
 
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...Spark Summit
 
Sparkling Random Ferns by P Dendek and M Fedoryszak
Sparkling Random Ferns by  P Dendek and M FedoryszakSparkling Random Ferns by  P Dendek and M Fedoryszak
Sparkling Random Ferns by P Dendek and M FedoryszakSpark Summit
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 

En vedette (20)

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 
Open Air 2016 Mini Talk
Open Air 2016 Mini TalkOpen Air 2016 Mini Talk
Open Air 2016 Mini Talk
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri AmbatiH2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
 
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi RenH2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
Natural Sparksmanship – The Art of Making an Analytics Enterprise Cross the C...
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
 
Netflix branding stumbles
Netflix branding stumblesNetflix branding stumbles
Netflix branding stumbles
 
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
 
Sparkling Random Ferns by P Dendek and M Fedoryszak
Sparkling Random Ferns by  P Dendek and M FedoryszakSparkling Random Ferns by  P Dendek and M Fedoryszak
Sparkling Random Ferns by P Dendek and M Fedoryszak
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Netflix in France
Netflix in FranceNetflix in France
Netflix in France
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 

Similaire à Optimizing terascale machine learning pipelines with KeystoneML

Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Pragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsPragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsUniversität Rostock
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...DataScienceConferenc1
 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisChrisCoughlin9
 
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...Thom Lane
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyThe von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyPerry Lea
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersVittorio Giovara
 
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘CHENHuiMei
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureSanghamitra Deb
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Javier Cantón Ferrero
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0Plain Concepts
 

Similaire à Optimizing terascale machine learning pipelines with KeystoneML (20)

Svm on cloud (presntation)
Svm on cloud  (presntation)Svm on cloud  (presntation)
Svm on cloud (presntation)
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Smallsat 2021
Smallsat 2021Smallsat 2021
Smallsat 2021
 
Pragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsPragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementations
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data Analysis
 
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st CenturyThe von Neumann Memory Barrier and Computer Architectures for the 21st Century
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency Clusters
 
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
 
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Dernier (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Optimizing terascale machine learning pipelines with KeystoneML

  • 1. OPTIMIZINGTERASCALE MACHINE LEARNING PIPELINES WITH Evan R. Sparks, UC Berkeley AMPLab with ShivaramVenkataraman,Tomer Kaftan, Michael Franklin, Benjamin Recht MLKeystone Apache
  • 3. A STANDARD MACHINE LEARNING PIPELINE Right? Data Train Classifier Model
  • 4. A STANDARD MACHINE LEARNING PIPELINE That’s more like it! Data Train Linear Classifier Model Feature Extraction Test Data Predictions
  • 5. A REAL PIPELINE FOR IMAGE CLASSIFICATION Inspired by Coates & Ng, 2012 Data Image Parser Normalizer Convolver sqrt,mean Zipper Linear Solver Symmetric Rectifier ident,abs ident,mean Global Pooling Pooler Patch Extractor Patch Whitener KMeans Clusterer Feature Extractor Label Extractor ModelLinear Mapper Test Data Label Extractor Feature Extractor Test Error Error Computer
  • 6. Data Image Parser Normalizer Convolver sqrt,mean Zipper Linear Solver Symmetric Rectifier ident,abs ident,mean Global Pooling Pooler Patch Extractor Patch Whitener KMeans Clusterer Feature Extractor Label Extractor Linear Mapper Model Test Data Label Extractor Feature Extractor Test Error Error Computer Embarrassingly Parallel Requires Coordination Tricky to Scale
  • 7. ABOUT KEYSTONEML • Software framework for building scalable end-to-end machine learning pipelines on Apache Spark. • Helps us understand what it means to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge. • Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast. • Open source software, available at: http://keystone-ml.org/
  • 8. SIMPLE EXAMPLE: TEXT CLASSIFICATION 20 Newsgroups .fit( ) Trim Tokenize Bigrams Top Features Naive Bayes Max Classifier Trim Tokenize Bigrams Max Classifier Top Features Transformer Naive Bayes Model Once estimated - apply these steps to your production data in an online or batch fashion.
  • 9. NOT SO SIMPLE EXAMPLE: IMAGE CLASSIFICATION Images (VOC2007) .fit( ) Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression Resize Grayscale SIFT MaxClassifier PCA Map Fisher Encoder Linear Model Achieves performance of Chatfield et. al., 2011 Pleasantly parallel featurization and evaluation. 7 minutes on a modest cluster. 5,000 examples, 40,000 features, 20 classes
  • 10. EVEN LESS SIMPLE: IMAGENET Color Edges Resize Grayscale SIFT PCA FisherVector Top 5 Classifier LCS PCA FisherVector Block Linear Solver <100 SLOC Upgrading the solver for higher precision means changing 1 LOC. Weighted Block Linear Solver Adding 100,000 more texture features is easy. Texture Gabor Wavelets PCA FisherVector 1000 class classification. 1,200,000 examples 64,000 features. 90 minutes on 100 nodes.
  • 11. OPTIMIZING KEYSTONEML PIPELINES High-level API enables rich space of optimizations Automated ML operator selection. Linear Solver L-BFGS Iterative SGD Direct Solver Training Data Grayscaler SIFT Extractor Reduce Dimensions Fisher Vector Normalize Column Sampler Linear Map Distributed PCA Column Sampler Local GMM Least Sq. L-BFGS Predictions Training Labels Auto-caching for iterative workloads.
  • 12. KEYSTONEML OPTIMIZER • Sampling-based cost model projects resource usage • CPU, Memory, Network • Utilization tracked through pipeline. • Decisions made to minimize total cost of execution. • Catalyst-based optimizer does the heavy lifting. Stage n d size (GB) Input 5000 1m pixel JPEG 0.4 Resize 5000 260k pixels 3.6 Grayscale 5000 260k pixels 1.2 SIFT 5000 65000x128 309 PCA 5000 65000x80 154 FV 5000 256x64x2 1.2 Linear Regression 5000 20 0.0007 Max Classifier 5000 1 0.00009
  • 13. CHOOSING A SOLVER • Datasets have a number of interesting degrees of freedom. • Problem size (n, d, k) • sparsity (nnz) • condition number • Platform has degrees of freedom: • Memory, CPU, Network, Nodes • Solvers are predictable! 13 Where: A 2 Rn⇥d X 2 Rd⇥k B 2 Rn⇥k Objective: min X |AX B| 2 2 + |X|2 2
  • 14. CHOOSING A SOLVER • Three Solvers • Exact, Block, LBFGS • Two datasets • Amazon - >99% sparse, n=65m • TIMIT - dense, n=2m • Exact solve works well for small # features. • Use LBFGS for sparse problems. • Block solver scales well to big dense problems. • Hundreds of thousands of features. ● ● ● ● ● ● Amazon TIMIT 100 1000 10000 10 100 1000 1024 2048 4096 8192 16384 1024 2048 4096 8192 16384 Number of Features Time(s) Solver ● Exact Block Solver LBFGS 14
  • 15. SOLVER PERFORMANCE • Compared KeystoneML with: • VowpalWabbit - specialized system for large, sparse problems. • SystemML - general purpose, optimizing ML system. • Two problems: • Amazon - Sparse text features. • BinaryTIMIT - Dense phoneme data. • High Order Bit: • KeystoneML pipelines featurization and adapts to workload changes. Amazon 0 200 400 600 800 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML SystemML Binary TIMIT 0 100 200 300 400 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML SystemML Amazon 0 50 100 150 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML Vowpal Wabbit Binary TIMIT 0 500 1000 1500 1024 2048 4096 8192 16384 Features Time(s) System KeystoneML Vowpal Wabbit
  • 16. DECIDING WHATTO SAVE • Pipelines Generate Lots of intermediate state. • E.g. SIFT features blow up a 0.42GBVOC dataset to 300GB. • Iterative algorithms —> state needed many times. • How do we determine what to save for later and what to reuse, given fixed resource budget? • Can we adapt to workload changes? 16 Resize Grayscale SIFT PCA FisherVector MaxClassifier Linear Regression
  • 17. CACHING PROBLEM • Output is computed via depth- first execution of DAG. • Caching “truncates” a path after first visit. • Want to minimize execution time. • Subject to memory constraints. • Picking optimal set is hard! 17 A B C D E 60s 50g 40s 200g 20s 40g 40g 15s 5s 10g Output Cache set Time Memory ABCDE 140s 340g B 140s 200g A 180s 50g {} 240s 0g
  • 18. END-TO-END PERFORMANCE Dataset Training Examples Features Raw Size (GB) Feature Size (GB) Amazon 65 million 100k (sparse) 14 89 TIMIT 2.25 million 528k 7.5 8800 ImageNet 1.28 million 262k 74 2500 VOC 5000 40k 0.43 1.5
  • 19. END-TO-END PERFORMANCE Dataset KeystoneML Accuracy Reported Accuracy KeystoneML Time (m) Reported Time (m) Speedup over Reported Amazon 91.6% N/A 3.3 N/A N/A TIMIT 66.1% 66.3% 138 120 0.87x ImageNet 67.4% 66.6% 270 5760 21x VOC 57.2% 59.2% 7 87 12x
  • 20. END-TO-END PERFORMANCE Amazon TIMIT ImageNet 0 5 10 15 0 20 40 60 0 100 200 300 400 500 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 Cluster Size (# of nodes) Time(minutes) Stage Loading Train Data Featurization Model Solve Loading Test Data Model Eval ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Amazon TIMIT ImageNet 1 2 4 8 16 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 Cluster Size (# of nodes) Speedupover8nodes(x)
  • 21. END-TO-END PERFORMANCE • Tested three levels of optimization • None • Auto-caching only • Auto-caching and operator-selection. • 7x to 15x speedup 0 5 10 15 Amazon TIMIT VOC Workload Speedup Optimization Level None Whole−Pipeline All
  • 24. SOFTWARE FEATURES • Data Loaders • CSV, CIFAR, ImageNet,VOC,TIMIT, 20 Newsgroups • Transformers • NLP -Tokenization, n-grams, term frequency, NER*, parsing* • Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling,Windowing, HOG, Daisy • Speech - MFCCs* • Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT • Utility/misc - Caching,Top-K classifier, indicator label mapping, sparse/dense encoding transformers. • Estimators • Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM* • Example Pipelines • NLP - Amazon Product Review Classification, 20 Newsgroups,Wikipedia Language model • Images - MNIST, CIFAR,VOC, ImageNet • Speech -TIMIT • Evaluation Metrics • Binary Classification • Multiclass Classification • Multilabel Classification * - Links to external library Just 11k Lines of Code, 5k of which areTests or JavaDoc.
  • 26. TRANSFORMERS TransformerInput Output abstract classTransformer[In, Out] { def apply(in: In): Out def apply(in: RDD[In]): RDD[Out] = in.map(apply) … } TYPE SAFETY HELPS ENSURE ROBUSTNESS
  • 27. ESTIMATORS EstimatorRDD[Input] abstract class Estimator[In, Out] { def fit(in: RDD[In]):Transformer[In,Out] … } Transformer .fit()
  • 28. CHAINING NGrams(2)String Vectorizer VectorBigrams val featurizer:Transformer[String,Vector] = NGrams(2) thenVectorizer featurizerString Vector =
  • 29. COMPLEX PIPELINES .fit(data, labels) pipelineString Prediction = val pipeline = (featurizer thenLabelEstimator LinearModel).fit(data, labels) featurizerString Vector Linear Model Prediction featurizerString Vector Linear Map Prediction