SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Practical Machine Learning
Pipelines with MLlib
Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
•  Shipped with Spark 0.8
Currently (Spark 1.3)
•  Contributions from 50+ orgs, 100+ individuals
•  Good coverage of algorithms
classifica'on	
  
regression	
  
clustering	
  
recommenda'on	
  
feature	
  extrac'on,	
  selec'on	
  
frequent	
  itemsets	
  
sta's'cs	
  
linear	
  algebra	
  
MLlib’s Mission
How	
  can	
  we	
  move	
  beyond	
  this	
  list	
  of	
  algorithms	
  
and	
  help	
  users	
  developer	
  real	
  ML	
  workflows?	
  
MLlib’s mission is to make practical
machine learning easy and scalable.
•  Capable of learning from large-scale datasets
•  Easy to build machine learning applications
Outline
ML workflows
Pipelines
Roadmap
Outline
ML workflows
Pipelines
Roadmap
Example: Text Classification
Set Footer from Insert Dropdown Menu 6
Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
1:	
  about	
  science	
  
0:	
  not	
  about	
  science	
  
Label	
  Features	
  
text,	
  image,	
  vector,	
  ...	
  
CTR,	
  inches	
  of	
  rainfall,	
  ...	
  
Dataset:	
  “20	
  Newsgroups”	
  
From	
  UCI	
  KDD	
  Archive	
  
Training & Testing
Set Footer from Insert Dropdown Menu 7
Training	
   Tes*ng/Produc*on	
  
Given	
  labeled	
  data:	
  
	
  	
  	
  	
  	
  RDD	
  of	
  (features,	
  label)	
  
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help...!
Subject: RIPEM FAQ!
RIPEM is a program which
performs Privacy Enhanced...!
...	
  
Label 0!
Label 1!
Learn	
  a	
  model.	
  
Given	
  new	
  unlabeled	
  data:	
  
	
  	
  	
  	
  	
  RDD	
  of	
  features	
  
Subject: Apollo Training!
The Apollo astronauts also
trained at (in) Meteor...!
Subject: A demo of Nonsense!
How can you lie about
something that no one...!
Use	
  model	
  to	
  make	
  predic'ons.	
  
Label 1!
Label 0!
Example ML Workflow
Training
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
labels	
  +	
  feature	
  vectors	
  
Extract	
  features	
  
Explicitly	
  unzip	
  &	
  zip	
  RDDs	
  
labels.zip(predictions).map {
if (_._1 == _._2) ...
}
val features: RDD[Vector]
val predictions: RDD[Double]
Create	
  many	
  RDDs	
  
val labels: RDD[Double] =
data.map(_.label)
Pain	
  point	
  
Example ML Workflow
Write	
  as	
  a	
  script	
  
Pain	
  point	
  
•  Not	
  modular	
  
•  Difficult	
  to	
  re-­‐use	
  workflow	
  
Training
labels	
  +	
  feature	
  vectors	
  
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
Extract	
  features	
  
Example ML Workflow
Training
labels	
  +	
  feature	
  vectors	
  
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
Extract	
  features	
  
Testing/Production
feature	
  vectors	
  
Predict	
  using	
  model	
  
predicEons	
  
Act	
  on	
  predic'ons	
  
Load	
  new	
  data	
  
plain	
  text	
  
Extract	
  features	
  
Almost	
  
iden-cal	
  
workflow	
  
Example ML Workflow
Training
labels	
  +	
  feature	
  vectors	
  
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
Extract	
  features	
  
Pain	
  point	
  
Parameter	
  tuning	
  
•  Key	
  part	
  of	
  ML	
  
•  Involves	
  training	
  many	
  models	
  
•  For	
  different	
  splits	
  of	
  the	
  data	
  
•  For	
  different	
  sets	
  of	
  parameters	
  
Pain Points
Create	
  &	
  handle	
  many	
  RDDs	
  and	
  data	
  types	
  
Write	
  as	
  a	
  script	
  
Tune	
  parameters	
  
Enter...
Pipelines!	
   in	
  Spark	
  1.2	
  &	
  1.3	
  
Outline
ML workflows
Pipelines
Roadmap
Key Concepts
DataFrame: The ML Dataset
Abstractions: Transformers, Estimators, & Evaluators
Parameters: API & tuning
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named	
  columns	
  with	
  types	
  
label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label	
   text	
   words	
   features	
  
0	
   This	
  is	
  ...	
   [“This”,	
  “is”,	
  …]	
   [0.5,	
  1.2,	
  …]	
  
0	
   When	
  we	
  ...	
   [“When”,	
  ...]	
   [1.9,	
  -­‐0.8,	
  …]	
  
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named	
  columns	
  with	
  types	
   Domain-­‐Specific	
  Language	
  
# Select science articles
sciDocs =
data.filter(“label” == 1)
# Scale labels
data(“label”) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
• Shipped	
  with	
  Spark	
  1.3	
  
• APIs	
  for	
  Python,	
  Java	
  &	
  Scala	
  (+R	
  in	
  dev)	
  
• Integra'on	
  with	
  Spark	
  SQL	
  
• Data	
  import/export	
  
• Internal	
  op'miza'ons	
  
Named	
  columns	
  with	
  types Domain-­‐Specific	
  Language	
  
Pain	
  point:	
  Create	
  &	
  handle	
  
many	
  RDDs	
  and	
  data	
  types	
  
BIG	
  data	
  
Abstractions
Set Footer from Insert Dropdown Menu 18
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
Abstraction: Transformer
Set Footer from Insert Dropdown Menu 19
Training
Train	
  model	
  
Evaluate	
  
Extract	
  features	
  
def transform(DataFrame): DataFrame
label: Double
text: String
label: Double
text: String
features: Vector
Abstraction: Estimator
Set Footer from Insert Dropdown Menu 20
Training
Train	
  model	
  
Evaluate	
  
Extract	
  features	
  
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Train	
  model	
  
Abstraction: Evaluator
Set Footer from Insert Dropdown Menu 21
Training
Evaluate	
  
Extract	
  features	
  
label: Double
text: String
features: Vector
prediction: Double
Metric:	
  
accuracy
AUC
MSE
...
def evaluate(DataFrame): Double
Act	
  on	
  predic'ons	
  
Abstraction: Model
Set Footer from Insert Dropdown Menu 22
Model	
  is	
  a	
  type	
  of	
  Transformer	
  
def transform(DataFrame): DataFrame
text: String
features: Vector
Testing/Production
Predict	
  using	
  model	
  
Extract	
  features	
   text: String
features: Vector
prediction: Double
(Recall) Abstraction: Estimator
Set Footer from Insert Dropdown Menu 23
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Abstraction: Pipeline
Set Footer from Insert Dropdown Menu 24
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Double
text: String
PipelineModel
Pipeline	
  is	
  a	
  type	
  of	
  Es*mator	
  
def fit(DataFrame): Model
Abstraction: PipelineModel
Set Footer from Insert Dropdown Menu 25
text: String
PipelineModel	
  is	
  a	
  type	
  of	
  Transformer	
  
def transform(DataFrame): DataFrame
Testing/Production
Predict	
  using	
  model	
  
Load	
  data	
  
Extract	
  features	
   text: String
features: Vector
prediction: Double
Act	
  on	
  predic'ons	
  
Abstractions: Summary
Set Footer from Insert Dropdown Menu 26
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  Transformer
DataFrame
Estimator
Evaluator
Testing
Predict	
  using	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
Demo
Set Footer from Insert Dropdown Menu 27
Transformer
DataFrame
Estimator
Evaluator
label: Double
text: String
features: Vector
Current	
  data	
  schema	
  
prediction: Double
Training
Logis'cRegression	
  
BinaryClassifica'on	
  
Evaluator	
  
Load	
  data	
  
Tokenizer	
  
Transformer HashingTF	
  
words: Seq[String]
Demo
Set Footer from Insert Dropdown Menu 28
Transformer
DataFrame
Estimator
Evaluator
Training
Logis'cRegression	
  
BinaryClassifica'on	
  
Evaluator	
  
Load	
  data	
  
Tokenizer	
  
Transformer HashingTF	
  
Pain	
  point:	
  Write	
  as	
  a	
  script	
  
Parameters
Set Footer from Insert Dropdown Menu 29
> hashingTF.numFeaturesStandard	
  API	
  
•  Typed	
  
•  Defaults	
  
•  Built-­‐in	
  doc	
  
•  Autocomplete	
  
org.apache.spark.ml.param.IntParam =
numFeatures: number of features
(default: 262144)
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures
Parameter Tuning
Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000}
Logis'cRegression	
  
Tokenizer	
  
HashingTF	
  
BinaryClassifica'on	
  
Evaluator	
  
CrossValidator
Parameter Tuning
Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
Logis'cRegression	
  
Tokenizer	
  
HashingTF	
  
BinaryClassifica'on	
  
Evaluator	
  
CrossValidator
Pain	
  point:	
  Tune	
  parameters	
  
Pipelines: Recap
Inspira'ons	
  
	
  
scikit-­‐learn	
  
	
  	
  +	
  Spark	
  DataFrame,	
  Param	
  API	
  
	
  
MLBase	
  (Berkeley	
  AMPLab)	
  
	
  	
  Ongoing	
  collaboraEons	
  
Create	
  &	
  handle	
  many	
  RDDs	
  and	
  data	
  types	
  
Write	
  as	
  a	
  script	
  
Tune	
  parameters	
  
DataFrame	
  
Abstrac'ons	
  
Parameter	
  API	
  
*	
  Groundwork	
  done;	
  full	
  support	
  WIP.	
  
Also	
  
•  Python,	
  Scala,	
  Java	
  APIs	
  
•  Schema	
  valida'on	
  
•  User-­‐Defined	
  Types*	
  
•  Feature	
  metadata*	
  
•  Mul'-­‐model	
  training	
  op'miza'ons*	
  
Outline
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib:	
  Primary	
  ML	
  package	
  
	
  
spark.ml:	
  High-­‐level	
  Pipelines	
  API	
  for	
  algorithms	
  in	
  spark.mllib
(experimental	
  in	
  Spark	
  1.2-­‐1.3)	
  
Near	
  future	
  
•  Feature	
  aoributes	
  
•  Feature	
  transformers	
  
•  More	
  algorithms	
  under	
  Pipeline	
  API	
  
	
  
Farther	
  ahead	
  
•  Ideas	
  from	
  AMPLab	
  MLBase	
  (auto-­‐tuning	
  models)	
  
•  SparkR	
  integra'on	
  
Thank you!
Outline	
  
•  ML	
  workflows	
  
•  Pipelines	
  
•  DataFrame	
  
•  Abstrac*ons	
  
•  Parameter	
  tuning	
  
•  Roadmap	
  
Spark	
  documenta'on	
  
	
  	
  	
  	
  hop://spark.apache.org/	
  
	
  
Pipelines	
  blog	
  post	
  
	
  	
  	
  	
  hops://databricks.com/blog/2015/01/07	
  

Contenu connexe

Tendances

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0 Databricks
 

Tendances (20)

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 

Similaire à Practical Machine Learning Pipelines with MLlib

Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksSpark Summit
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffJAX London
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana MandziukFwdays
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Data Mining for Developers
Data Mining for DevelopersData Mining for Developers
Data Mining for Developersllangit
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
9781285852744 ppt ch13
9781285852744 ppt ch139781285852744 ppt ch13
9781285852744 ppt ch13Terry Yoast
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 

Similaire à Practical Machine Learning Pipelines with MLlib (20)

Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
Scala and Spring
Scala and SpringScala and Spring
Scala and Spring
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
 
Linq to sql
Linq to sqlLinq to sql
Linq to sql
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Data Mining for Developers
Data Mining for DevelopersData Mining for Developers
Data Mining for Developers
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
9781285852744 ppt ch13
9781285852744 ppt ch139781285852744 ppt ch13
9781285852744 ppt ch13
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 

Dernier (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 

Practical Machine Learning Pipelines with MLlib

  • 1. Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015
  • 2. About Spark MLlib Started in UC Berkeley AMPLab •  Shipped with Spark 0.8 Currently (Spark 1.3) •  Contributions from 50+ orgs, 100+ individuals •  Good coverage of algorithms classifica'on   regression   clustering   recommenda'on   feature  extrac'on,  selec'on   frequent  itemsets   sta's'cs   linear  algebra  
  • 3. MLlib’s Mission How  can  we  move  beyond  this  list  of  algorithms   and  help  users  developer  real  ML  workflows?   MLlib’s mission is to make practical machine learning easy and scalable. •  Capable of learning from large-scale datasets •  Easy to build machine learning applications
  • 6. Example: Text Classification Set Footer from Insert Dropdown Menu 6 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1:  about  science   0:  not  about  science   Label  Features   text,  image,  vector,  ...   CTR,  inches  of  rainfall,  ...   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  • 7. Training & Testing Set Footer from Insert Dropdown Menu 7 Training   Tes*ng/Produc*on   Given  labeled  data:            RDD  of  (features,  label)   Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help...! Subject: RIPEM FAQ! RIPEM is a program which performs Privacy Enhanced...! ...   Label 0! Label 1! Learn  a  model.   Given  new  unlabeled  data:            RDD  of  features   Subject: Apollo Training! The Apollo astronauts also trained at (in) Meteor...! Subject: A demo of Nonsense! How can you lie about something that no one...! Use  model  to  make  predic'ons.   Label 1! Label 0!
  • 8. Example ML Workflow Training Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   labels  +  feature  vectors   Extract  features   Explicitly  unzip  &  zip  RDDs   labels.zip(predictions).map { if (_._1 == _._2) ... } val features: RDD[Vector] val predictions: RDD[Double] Create  many  RDDs   val labels: RDD[Double] = data.map(_.label) Pain  point  
  • 9. Example ML Workflow Write  as  a  script   Pain  point   •  Not  modular   •  Difficult  to  re-­‐use  workflow   Training labels  +  feature  vectors   Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   Extract  features  
  • 10. Example ML Workflow Training labels  +  feature  vectors   Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   Extract  features   Testing/Production feature  vectors   Predict  using  model   predicEons   Act  on  predic'ons   Load  new  data   plain  text   Extract  features   Almost   iden-cal   workflow  
  • 11. Example ML Workflow Training labels  +  feature  vectors   Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   Extract  features   Pain  point   Parameter  tuning   •  Key  part  of  ML   •  Involves  training  many  models   •  For  different  splits  of  the  data   •  For  different  sets  of  parameters  
  • 12. Pain Points Create  &  handle  many  RDDs  and  data  types   Write  as  a  script   Tune  parameters   Enter... Pipelines!   in  Spark  1.2  &  1.3  
  • 14. Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning
  • 15. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named  columns  with  types   label: Double text: String words: Seq[String] features: Vector prediction: Double label   text   words   features   0   This  is  ...   [“This”,  “is”,  …]   [0.5,  1.2,  …]   0   When  we  ...   [“When”,  ...]   [1.9,  -­‐0.8,  …]  
  • 16. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named  columns  with  types   Domain-­‐Specific  Language   # Select science articles sciDocs = data.filter(“label” == 1) # Scale labels data(“label”) * 0.5
  • 17. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL • Shipped  with  Spark  1.3   • APIs  for  Python,  Java  &  Scala  (+R  in  dev)   • Integra'on  with  Spark  SQL   • Data  import/export   • Internal  op'miza'ons   Named  columns  with  types Domain-­‐Specific  Language   Pain  point:  Create  &  handle   many  RDDs  and  data  types   BIG  data  
  • 18. Abstractions Set Footer from Insert Dropdown Menu 18 Training Train  model   Evaluate   Load  data   Extract  features  
  • 19. Abstraction: Transformer Set Footer from Insert Dropdown Menu 19 Training Train  model   Evaluate   Extract  features   def transform(DataFrame): DataFrame label: Double text: String label: Double text: String features: Vector
  • 20. Abstraction: Estimator Set Footer from Insert Dropdown Menu 20 Training Train  model   Evaluate   Extract  features   label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 21. Train  model   Abstraction: Evaluator Set Footer from Insert Dropdown Menu 21 Training Evaluate   Extract  features   label: Double text: String features: Vector prediction: Double Metric:   accuracy AUC MSE ... def evaluate(DataFrame): Double
  • 22. Act  on  predic'ons   Abstraction: Model Set Footer from Insert Dropdown Menu 22 Model  is  a  type  of  Transformer   def transform(DataFrame): DataFrame text: String features: Vector Testing/Production Predict  using  model   Extract  features   text: String features: Vector prediction: Double
  • 23. (Recall) Abstraction: Estimator Set Footer from Insert Dropdown Menu 23 Training Train  model   Evaluate   Load  data   Extract  features   label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 24. Abstraction: Pipeline Set Footer from Insert Dropdown Menu 24 Training Train  model   Evaluate   Load  data   Extract  features   label: Double text: String PipelineModel Pipeline  is  a  type  of  Es*mator   def fit(DataFrame): Model
  • 25. Abstraction: PipelineModel Set Footer from Insert Dropdown Menu 25 text: String PipelineModel  is  a  type  of  Transformer   def transform(DataFrame): DataFrame Testing/Production Predict  using  model   Load  data   Extract  features   text: String features: Vector prediction: Double Act  on  predic'ons  
  • 26. Abstractions: Summary Set Footer from Insert Dropdown Menu 26 Training Train  model   Evaluate   Load  data   Extract  features  Transformer DataFrame Estimator Evaluator Testing Predict  using  model   Evaluate   Load  data   Extract  features  
  • 27. Demo Set Footer from Insert Dropdown Menu 27 Transformer DataFrame Estimator Evaluator label: Double text: String features: Vector Current  data  schema   prediction: Double Training Logis'cRegression   BinaryClassifica'on   Evaluator   Load  data   Tokenizer   Transformer HashingTF   words: Seq[String]
  • 28. Demo Set Footer from Insert Dropdown Menu 28 Transformer DataFrame Estimator Evaluator Training Logis'cRegression   BinaryClassifica'on   Evaluator   Load  data   Tokenizer   Transformer HashingTF   Pain  point:  Write  as  a  script  
  • 29. Parameters Set Footer from Insert Dropdown Menu 29 > hashingTF.numFeaturesStandard  API   •  Typed   •  Defaults   •  Built-­‐in  doc   •  Autocomplete   org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures
  • 30. Parameter Tuning Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Logis'cRegression   Tokenizer   HashingTF   BinaryClassifica'on   Evaluator   CrossValidator
  • 31. Parameter Tuning Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters Logis'cRegression   Tokenizer   HashingTF   BinaryClassifica'on   Evaluator   CrossValidator Pain  point:  Tune  parameters  
  • 32. Pipelines: Recap Inspira'ons     scikit-­‐learn      +  Spark  DataFrame,  Param  API     MLBase  (Berkeley  AMPLab)      Ongoing  collaboraEons   Create  &  handle  many  RDDs  and  data  types   Write  as  a  script   Tune  parameters   DataFrame   Abstrac'ons   Parameter  API   *  Groundwork  done;  full  support  WIP.   Also   •  Python,  Scala,  Java  APIs   •  Schema  valida'on   •  User-­‐Defined  Types*   •  Feature  metadata*   •  Mul'-­‐model  training  op'miza'ons*  
  • 34. Roadmap spark.mllib:  Primary  ML  package     spark.ml:  High-­‐level  Pipelines  API  for  algorithms  in  spark.mllib (experimental  in  Spark  1.2-­‐1.3)   Near  future   •  Feature  aoributes   •  Feature  transformers   •  More  algorithms  under  Pipeline  API     Farther  ahead   •  Ideas  from  AMPLab  MLBase  (auto-­‐tuning  models)   •  SparkR  integra'on  
  • 35. Thank you! Outline   •  ML  workflows   •  Pipelines   •  DataFrame   •  Abstrac*ons   •  Parameter  tuning   •  Roadmap   Spark  documenta'on          hop://spark.apache.org/     Pipelines  blog  post          hops://databricks.com/blog/2015/01/07