SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Apache Spark MLlib's
past trajectory and new directions
Joseph K. Bradley
Spark Summit 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
4	4	
DATABRICKS RUNTIME 3.0
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
5
5 years ago…
Zaharia	et	al.		
NSDI,	2012
Scalable ML
using Spark!
6
3 ½ years ago until today
People added some algorithms…
Classification
•  Logistic regression & linear SVMs
•  L1, L2, or elastic net regularization
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Naive Bayes
•  Multilayer perceptron
•  One-vs-rest
•  Streaming logistic regression
Regression
•  Least squares
•  L1, L2, or elastic net regularization
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Isotonic regression
•  Streaming least squares
Feature extraction, transformation & selection
•  Binarizer
•  Bucketizer
•  Chi-Squared selection
•  CountVectorizer
•  Discrete cosine transform
•  ElementwiseProduct
•  Hashing term frequency
•  Inverse document frequency
•  MinMaxScaler
•  Ngram
•  Normalizer
•  VectorAssembler
•  VectorIndexer
•  VectorSlicer
•  Word2Vec
Clustering
•  Gaussian mixture models
•  K-Means
•  Streaming K-Means
•  Latent Dirichlet Allocatio
•  Power Iteration Clusterin
Recommendation
•  Alternating Least Squares (ALS)
Frequent itemsets
•  FP-growth
•  Prefix span
Statistics
•  Pearson correlation
•  Spearman correlation
•  Online summarization
•  Chi-squared test
•  Kernel density estimation
Linear Algebra
•  Local & distributed dense & sparse matrices
•  Matrix decompositions (PCA, SVD, QR)
7
Today
Apache Spark is widely used for ML.
•  Many production use cases
•  1000s of commits
•  100s of contributors
•  10,000s of users
(on Databricks alone)
8
This talk
What?
How has MLlib developed, and what has it become?
So what?
What are the implications for the dev & user communities?
Now what?
Where could or should it go?
9
What?
How has MLlib developed, and what has it become?
10
Let’s do some data analytics
50+ algorithms and featurizers
Rapid development
Shift from adding algorithms to improving algorithms & infrastructure
Growing dev community
11
12
13
14
Major projects
0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4
DataFrame-based API
ML persistence
Trees & ensembles
GLMs
SparkR
Pipelines Featurizers
Note: These are unofficial “project” labels based on JIRA/PR activity.
15
So what?
What are the implications for the dev & user communities?
16
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…
Data sources
CSV, JSON, Parquet, …
DataFrames
Simple, scalable
pre/post-processing
Graphs
Graph-based ML
implementations
Streaming
ML models deployed
with Streaming
17
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…and outside MLlib
E.g., on spark-packages.org
•  scikit-learn integration package
(“spark-sklearn”)
•  Stanford CoreNLP integration
(“spark-corenlp”)
Deep Learning libraries
•  Deep Learning Pipelines
(“spark-deep-learning”)
•  BigDL
•  TensorFlowOnSpark
•  Distributed training with Keras
(“dist-keras”)
Framework for integrating non-“big data” or specialized ML libraries
18
Scaling
Workflow scalability
Same code on laptop with small data à cluster with big data
Big data scalability
Scale-out implementations
4x faster than
xgboost
Abuzaid et al. (2016) Chen & Guestrin (2016)
8x slower than
xgboost
19
Building workflows
Simple concepts: Transformers, Estimators, Models, Evaluators
•  Unified, standardized APIs, including for algorithm parameters
Pipelines
•  Repeatable workflows across Spark deployments and languages
•  DataFrame APIs and optimizations
•  Automated model tuning
FeaturizationETL Training Evaluation
Tuning
20
Now what?
Where could or should it go?
21
Top items from the dev@ list…
• Continued speed & scalability
improvements
• Enhancements to core algorithms
• ML building blocks: linear algebra,
optimization
• Extensible and customizable APIs
Goals
Scale-out ML
Standard library
Extensible API
These are unofficial goals based
on my experience + discussions
on the dev@ mailing list.
22
Scale-out ML
General improvements
• DataFrame-based implementations
Targeted improvements
• Gradient-boosted trees: feature subsampling
• Linear models: vector-free L-BFGS for billions of features
• …
E.g., 10x scaling for
Connected Components
in GraphFrames
23
Standard library
General improvements
•  NaN/null handling
•  Instance weights
•  Warm starts / initial models
•  Multi-column feature transforms
Algorithm-specific improvements
•  Trees: access tree structure from Python
•  …
24
Extensible and customizable APIs
MLlib has ~50 algorithms.
CRAN lists 10,750 packages.
à Users must be able to
modify algorithms or
write their own.
algorithms
usage
25
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames (“graphframes”): DataFrame-based graphs
• Bisecting K-Means (“bisecting-kmeans”): now part of MLlib
• Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP
spark-packages.org
26
How to write a custom algorithm
You need to:
•  Extend an API like Transformer, Estimator, Model, or Evaluator
MLlib provides some help:
•  UnaryTransformer
•  DefaultParamsWritable/Readable
•  defaultCopy
You get:
•  Natural integration with DataFrames, Pipelines, model tuning
27
Challenges in customization
Modify existing algorithms
with custom:
•  Loss functions
•  Optimizers
•  Stopping criteria
•  Callbacks
Implement new algorithms
without worrying about:
•  Boilerplate
•  Python API
•  ML building blocks
•  Feature attributes/
metadata
28
Example of MLlib APIs
Deep Learning Pipelines for Apache Spark
spark-packages.org/package/databricks/spark-deep-learning
APIs are great for users:
•  Plug & play algorithms
•  Standard API
But challenges remain:
•  Development requires expertise
•  Community under development
29
My challenge to you
Can we fill out the long tail of ML use cases?
algorithms
usage
30
Resources
Spark Packages: spark-packages.org
•  Plugin for building packages:
https://github.com/databricks/sbt-spark-package
•  Tool for generating package templates:
https://github.com/databricks/spark-package-cmd-tool
Example packages:
•  scikit-learn integration (“spark-sklearn”) – Python-only
•  GraphFrames – Scala/Java/Python
Thank you!
Office hours today @ 3:50pm at Databricks booth
Twitter: @jkbatcmu

Contenu connexe

Tendances

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 

Tendances (20)

Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with Spark
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 

Similaire à Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 

Similaire à Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley (20)

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayAI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

  • 1. Apache Spark MLlib's past trajectory and new directions Joseph K. Bradley Spark Summit 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 4 4 DATABRICKS RUNTIME 3.0 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 6. 6 3 ½ years ago until today People added some algorithms… Classification •  Logistic regression & linear SVMs •  L1, L2, or elastic net regularization •  Decision trees •  Random forests •  Gradient-boosted trees •  Naive Bayes •  Multilayer perceptron •  One-vs-rest •  Streaming logistic regression Regression •  Least squares •  L1, L2, or elastic net regularization •  Decision trees •  Random forests •  Gradient-boosted trees •  Isotonic regression •  Streaming least squares Feature extraction, transformation & selection •  Binarizer •  Bucketizer •  Chi-Squared selection •  CountVectorizer •  Discrete cosine transform •  ElementwiseProduct •  Hashing term frequency •  Inverse document frequency •  MinMaxScaler •  Ngram •  Normalizer •  VectorAssembler •  VectorIndexer •  VectorSlicer •  Word2Vec Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocatio •  Power Iteration Clusterin Recommendation •  Alternating Least Squares (ALS) Frequent itemsets •  FP-growth •  Prefix span Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear Algebra •  Local & distributed dense & sparse matrices •  Matrix decompositions (PCA, SVD, QR)
  • 7. 7 Today Apache Spark is widely used for ML. •  Many production use cases •  1000s of commits •  100s of contributors •  10,000s of users (on Databricks alone)
  • 8. 8 This talk What? How has MLlib developed, and what has it become? So what? What are the implications for the dev & user communities? Now what? Where could or should it go?
  • 9. 9 What? How has MLlib developed, and what has it become?
  • 10. 10 Let’s do some data analytics 50+ algorithms and featurizers Rapid development Shift from adding algorithms to improving algorithms & infrastructure Growing dev community
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14 Major projects 0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4 DataFrame-based API ML persistence Trees & ensembles GLMs SparkR Pipelines Featurizers Note: These are unofficial “project” labels based on JIRA/PR activity.
  • 15. 15 So what? What are the implications for the dev & user communities?
  • 16. 16 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib… Data sources CSV, JSON, Parquet, … DataFrames Simple, scalable pre/post-processing Graphs Graph-based ML implementations Streaming ML models deployed with Streaming
  • 17. 17 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib…and outside MLlib E.g., on spark-packages.org •  scikit-learn integration package (“spark-sklearn”) •  Stanford CoreNLP integration (“spark-corenlp”) Deep Learning libraries •  Deep Learning Pipelines (“spark-deep-learning”) •  BigDL •  TensorFlowOnSpark •  Distributed training with Keras (“dist-keras”) Framework for integrating non-“big data” or specialized ML libraries
  • 18. 18 Scaling Workflow scalability Same code on laptop with small data à cluster with big data Big data scalability Scale-out implementations 4x faster than xgboost Abuzaid et al. (2016) Chen & Guestrin (2016) 8x slower than xgboost
  • 19. 19 Building workflows Simple concepts: Transformers, Estimators, Models, Evaluators •  Unified, standardized APIs, including for algorithm parameters Pipelines •  Repeatable workflows across Spark deployments and languages •  DataFrame APIs and optimizations •  Automated model tuning FeaturizationETL Training Evaluation Tuning
  • 20. 20 Now what? Where could or should it go?
  • 21. 21 Top items from the dev@ list… • Continued speed & scalability improvements • Enhancements to core algorithms • ML building blocks: linear algebra, optimization • Extensible and customizable APIs Goals Scale-out ML Standard library Extensible API These are unofficial goals based on my experience + discussions on the dev@ mailing list.
  • 22. 22 Scale-out ML General improvements • DataFrame-based implementations Targeted improvements • Gradient-boosted trees: feature subsampling • Linear models: vector-free L-BFGS for billions of features • … E.g., 10x scaling for Connected Components in GraphFrames
  • 23. 23 Standard library General improvements •  NaN/null handling •  Instance weights •  Warm starts / initial models •  Multi-column feature transforms Algorithm-specific improvements •  Trees: access tree structure from Python •  …
  • 24. 24 Extensible and customizable APIs MLlib has ~50 algorithms. CRAN lists 10,750 packages. à Users must be able to modify algorithms or write their own. algorithms usage
  • 25. 25 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames (“graphframes”): DataFrame-based graphs • Bisecting K-Means (“bisecting-kmeans”): now part of MLlib • Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP spark-packages.org
  • 26. 26 How to write a custom algorithm You need to: •  Extend an API like Transformer, Estimator, Model, or Evaluator MLlib provides some help: •  UnaryTransformer •  DefaultParamsWritable/Readable •  defaultCopy You get: •  Natural integration with DataFrames, Pipelines, model tuning
  • 27. 27 Challenges in customization Modify existing algorithms with custom: •  Loss functions •  Optimizers •  Stopping criteria •  Callbacks Implement new algorithms without worrying about: •  Boilerplate •  Python API •  ML building blocks •  Feature attributes/ metadata
  • 28. 28 Example of MLlib APIs Deep Learning Pipelines for Apache Spark spark-packages.org/package/databricks/spark-deep-learning APIs are great for users: •  Plug & play algorithms •  Standard API But challenges remain: •  Development requires expertise •  Community under development
  • 29. 29 My challenge to you Can we fill out the long tail of ML use cases? algorithms usage
  • 30. 30 Resources Spark Packages: spark-packages.org •  Plugin for building packages: https://github.com/databricks/sbt-spark-package •  Tool for generating package templates: https://github.com/databricks/spark-package-cmd-tool Example packages: •  scikit-learn integration (“spark-sklearn”) – Python-only •  GraphFrames – Scala/Java/Python
  • 31. Thank you! Office hours today @ 3:50pm at Databricks booth Twitter: @jkbatcmu