SPARK ML
A new High-Level API for MLlib
Spark 1.4.0 preview
Matthieu Blanc
Instructor
Spark DevelopperTraining
@matthieublanc
MLLIB
Makes Machine Learning Easy and Scalable
Selection of Machine Learning Algorithms
Several design flaws :
• machine le...
Machine Learning
Train
Dataset ML Algorithm
Model
Test
Dataset
Predictions
Feature
Engineering
Feature
Engineering
label
f...
Machine Learning Pipeline
• Simple construction of ML workflow
• Inspect and debug it
• Tune parameters
• Re-run it on new ...
Dataframes
org.apache.spark.ml
Key concepts
• DataFrame as ML Datasets
• Abstractions :
• Transformers
• Estimators
• Evaluators
• Parameters API -> Cros...
Transformers
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
colA
colB
…
colX
colA
colB
…
colX
newCol
Transformer Usage
// Add a categoryVec column to the DataFrame
// by applying OneHotEncoder transformation to the column c...
Transformers Examples
Normalizer
VectorAssembler
PolynomialExpansion
Model
Tokenizer
OneHotEncoder
HashingTF
Binarizer
Estimators
DataFrame
Model
def fit(dataset: DataFrame): Model
label: double
features: vector
…
extends Transformer
Model is aTransformer
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
features: vector
…
features: vector...
Estimator + Model Usage
// Apply logisticRegression on a training dataset to create a model

// used to compute prediction...
Estimators Examples
StringIndexer
StandardScaler
CrossValidator
Pipeline
LinearRegression
LogisticRegression
DecisionTreeC...
Evaluators
DataFrame Metric (Double)
area under ROC curve
area under PR curve
root mean square error
def evaluate(dataset:...
Estimator + Model Usage
// Area under the ROC curve for the validation set

val evaluator = new BinaryClassificationEvalua...
Evaluators Examples
RegressionEvaluator
BinaryClassificationEvaluator
Pipeline
Train
Dataset
ML Algorithm
Model
Test
Dataset
Predictions
Feature
Engineering
Feature
Engineering
Pipeline
Transf...
Pipeline Usage
// The stages of our pipeline

val classEncoder = new OneHotEncoder()

.setInputCol("class")

.setOutputCol...
CrossValidator
Given
• Estimator
• Parameter Grid
• Evaluator
Find the Model with the best Parameters
CrossValidator is al...
CrossValidator Usage
// We will cross validate our pipeline

val crossValidator = new CrossValidator()

.setEstimator(pipe...
DEMO
https://github.com/mblanc/spark-ml
Conclusion
DataFrame
o.a.spark.ml
RDD
o.a.spark.mllib
Today Tomorrow
uses uses
uses
uses
Summary
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameters tuning
• Schema validation
...
MERCI
Prochain SlideShare
Chargement dans…5
×

Spark ML par Xebia (Spark Meetup du 11/06/2015)

7 562 vues

Publié le

Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.

http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/

Publié dans : Internet
0 commentaire
7 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
7 562
Sur SlideShare
0
Issues des intégrations
0
Intégrations
4 252
Actions
Partages
0
Téléchargements
51
Commentaires
0
J’aime
7
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

Spark ML par Xebia (Spark Meetup du 11/06/2015)

  1. 1. SPARK ML A new High-Level API for MLlib Spark 1.4.0 preview
  2. 2. Matthieu Blanc Instructor Spark DevelopperTraining @matthieublanc
  3. 3. MLLIB Makes Machine Learning Easy and Scalable Selection of Machine Learning Algorithms Several design flaws : • machine learning workflows/pipelines • make MLlib itself a scalable project • lack of homogeneity org.apache.spark.ml to the rescue!
  4. 4. Machine Learning Train Dataset ML Algorithm Model Test Dataset Predictions Feature Engineering Feature Engineering label features features features prediction
  5. 5. Machine Learning Pipeline • Simple construction of ML workflow • Inspect and debug it • Tune parameters • Re-run it on new data
  6. 6. Dataframes org.apache.spark.ml
  7. 7. Key concepts • DataFrame as ML Datasets • Abstractions : • Transformers • Estimators • Evaluators • Parameters API -> CrossValidator
  8. 8. Transformers DataFrame DataFrame def transform(dataset: DataFrame): DataFrame colA colB … colX colA colB … colX newCol
  9. 9. Transformer Usage // Add a categoryVec column to the DataFrame // by applying OneHotEncoder transformation to the column category
 val classEncoder = new OneHotEncoder()
 .setInputCol("catergory")
 .setOutputCol("catergoryVec")
 
 val newDataFrame = classEncoder.transform(dataFrame) dataFrame newDataFrame colA colB … category: double colA colB … category: double categoryVec: vector
  10. 10. Transformers Examples Normalizer VectorAssembler PolynomialExpansion Model Tokenizer OneHotEncoder HashingTF Binarizer
  11. 11. Estimators DataFrame Model def fit(dataset: DataFrame): Model label: double features: vector … extends Transformer
  12. 12. Model is aTransformer DataFrame DataFrame def transform(dataset: DataFrame): DataFrame features: vector … features: vector prediction: double …
  13. 13. Estimator + Model Usage // Apply logisticRegression on a training dataset to create a model
 // used to compute predictions on a test dataset
 val logisticRegression = new LogisticRegression()
 .setMaxIter(50)
 .setRegParam(0.01)
 // train val lrModel = logisticRegression.fit(trainDF)
 // predict
 val newDataFrameWithPredictions = lrModel.transform(testDF)
  14. 14. Estimators Examples StringIndexer StandardScaler CrossValidator Pipeline LinearRegression LogisticRegression DecisionTreeClassifier RandomForestClassifier GBTClassifier ALS
  15. 15. Evaluators DataFrame Metric (Double) area under ROC curve area under PR curve root mean square error def evaluate(dataset: DataFrame): Double label: Double prediction: Double …
  16. 16. Estimator + Model Usage // Area under the ROC curve for the validation set
 val evaluator = new BinaryClassificationEvaluator()
 println(evaluator.evaluate(dataFrameWithLabelAndPrediction))
  17. 17. Evaluators Examples RegressionEvaluator BinaryClassificationEvaluator
  18. 18. Pipeline Train Dataset ML Algorithm Model Test Dataset Predictions Feature Engineering Feature Engineering Pipeline Transformer EstimatorDataFrame PipelineModel Transformer Estimator DataFrame DataFrame Pipeline is an Estimator
  19. 19. Pipeline Usage // The stages of our pipeline
 val classEncoder = new OneHotEncoder()
 .setInputCol("class")
 .setOutputCol("classVec")
 val vectorAssembler = new VectorAssembler()
 .setInputCols(Array("age", "fare", "classVec"))
 .setOutputCol("features")
 val logisticRegression = new LogisticRegression()
 .setMaxIter(50)
 .setRegParam(0.01)
 
 // the pipeline
 val pipeline = new Pipeline()
 .setStages(Array(classEncoder, vectorAssembler, logisticRegression))
 
 // train val pipelineModel = pipeline.fit(trainSet)
 
 // predict val validationPredictions = pipelineModel.transform(testSet)
  20. 20. CrossValidator Given • Estimator • Parameter Grid • Evaluator Find the Model with the best Parameters CrossValidator is also an Estimator
  21. 21. CrossValidator Usage // We will cross validate our pipeline
 val crossValidator = new CrossValidator()
 .setEstimator(pipeline)
 .setEvaluator(new BinaryClassificationEvaluator)
 
 // The params we want to test
 val paramGrid = new ParamGridBuilder()
 .addGrid(hashingTF.numFeatures, Array(2, 5, 1000))
 .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01))
 .addGrid(logisticRegression.maxIter, Array(10, 50, 100))
 .build()
 crossValidator.setEstimatorParamMaps(paramGrid)
 
 // We will use a 3-fold cross validation
 crossValidator.setNumFolds(3)
 
 // train val cvModel = crossValidator.fit(trainSet)
 // predict with the best model
 val testSetWithPrediction = cvModel.transform(testSet)
  22. 22. DEMO https://github.com/mblanc/spark-ml
  23. 23. Conclusion DataFrame o.a.spark.ml RDD o.a.spark.mllib Today Tomorrow uses uses uses uses
  24. 24. Summary • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameters tuning • Schema validation • User-defined Transformers and Estimators • Composable and DAG Pipelines 1.4.1? 1.5.0?
  25. 25. MERCI

×