Spark ML par Xebia (Spark Meetup du 11/06/2015)

SPARK ML
A new High-Level API for MLlib
Spark 1.4.0 preview

Matthieu Blanc
Instructor
Spark DevelopperTraining
@matthieublanc

MLLIB
Makes Machine Learning Easy and Scalable
Selection of Machine Learning Algorithms
Several design ﬂaws :
• machine learning workﬂows/pipelines
• make MLlib itself a scalable project
• lack of homogeneity
org.apache.spark.ml to the rescue!

Machine Learning
Train
Dataset ML Algorithm
Model
Test
Dataset
Predictions
Feature
Engineering
Feature
Engineering
label
features
features
features
prediction

Machine Learning Pipeline
• Simple construction of ML workﬂow
• Inspect and debug it
• Tune parameters
• Re-run it on new data

Dataframes
org.apache.spark.ml

Key concepts
• DataFrame as ML Datasets
• Abstractions :
• Transformers
• Estimators
• Evaluators
• Parameters API -> CrossValidator

Transformers
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
colA
colB
…
colX
colA
colB
…
colX
newCol

Transformer Usage
// Add a categoryVec column to the DataFrame
// by applying OneHotEncoder transformation to the column category 
val classEncoder = new OneHotEncoder() 
.setInputCol("catergory") 
.setOutputCol("catergoryVec") 
 
val newDataFrame = classEncoder.transform(dataFrame)
dataFrame newDataFrame
colA
colB
…
category: double
colA
colB
…
category: double
categoryVec: vector

Transformers Examples
Normalizer
VectorAssembler
PolynomialExpansion
Model
Tokenizer
OneHotEncoder
HashingTF
Binarizer

Estimators
DataFrame
Model
def fit(dataset: DataFrame): Model
label: double
features: vector
…
extends Transformer

Model is aTransformer
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
features: vector
…
features: vector
prediction: double
…

Estimator + Model Usage
// Apply logisticRegression on a training dataset to create a model 
// used to compute predictions on a test dataset 
val logisticRegression = new LogisticRegression() 
.setMaxIter(50) 
.setRegParam(0.01) 
// train
val lrModel = logisticRegression.fit(trainDF) 
// predict 
val newDataFrameWithPredictions = lrModel.transform(testDF)

Estimators Examples
StringIndexer
StandardScaler
CrossValidator
Pipeline
LinearRegression
LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
GBTClassifier
ALS

Evaluators
DataFrame Metric (Double)
area under ROC curve
area under PR curve
root mean square error
def evaluate(dataset: DataFrame): Double
label: Double
prediction: Double
…

Estimator + Model Usage
// Area under the ROC curve for the validation set 
val evaluator = new BinaryClassificationEvaluator() 
println(evaluator.evaluate(dataFrameWithLabelAndPrediction))

Evaluators Examples
RegressionEvaluator
BinaryClassiﬁcationEvaluator

Pipeline
Train
Dataset
ML Algorithm
Model
Test
Dataset
Predictions
Feature
Engineering
Feature
Engineering
Pipeline
Transformer EstimatorDataFrame
PipelineModel
Transformer
Estimator
DataFrame DataFrame
Pipeline is an Estimator

Pipeline Usage
// The stages of our pipeline 
val classEncoder = new OneHotEncoder() 
.setInputCol("class") 
.setOutputCol("classVec") 
val vectorAssembler = new VectorAssembler() 
.setInputCols(Array("age", "fare", "classVec")) 
.setOutputCol("features") 
val logisticRegression = new LogisticRegression() 
.setMaxIter(50) 
.setRegParam(0.01) 
 
// the pipeline 
val pipeline = new Pipeline() 
.setStages(Array(classEncoder, vectorAssembler, logisticRegression)) 
 
// train
val pipelineModel = pipeline.fit(trainSet) 
 
// predict
val validationPredictions = pipelineModel.transform(testSet)

CrossValidator
Given
• Estimator
• Parameter Grid
• Evaluator
Find the Model with the best Parameters
CrossValidator is also an Estimator

CrossValidator Usage
// We will cross validate our pipeline 
val crossValidator = new CrossValidator() 
.setEstimator(pipeline) 
.setEvaluator(new BinaryClassificationEvaluator) 
 
// The params we want to test 
val paramGrid = new ParamGridBuilder() 
.addGrid(hashingTF.numFeatures, Array(2, 5, 1000)) 
.addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01)) 
.addGrid(logisticRegression.maxIter, Array(10, 50, 100)) 
.build() 
crossValidator.setEstimatorParamMaps(paramGrid) 
 
// We will use a 3-fold cross validation 
crossValidator.setNumFolds(3) 
 
// train
val cvModel = crossValidator.fit(trainSet) 
// predict with the best model 
val testSetWithPrediction = cvModel.transform(testSet)

DEMO
https://github.com/mblanc/spark-ml

Conclusion
DataFrame
o.a.spark.ml
RDD
o.a.spark.mllib
Today Tomorrow
uses uses
uses
uses

Summary
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameters tuning
• Schema validation
• User-deﬁned Transformers and
Estimators
• Composable and DAG Pipelines
1.4.1? 1.5.0?

Spark ML par Xebia (Spark Meetup du 11/06/2015)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Spark ML par Xebia (Spark Meetup du 11/06/2015)

Similaire à Spark ML par Xebia (Spark Meetup du 11/06/2015) (20)

Plus de Modern Data Stack France

Plus de Modern Data Stack France (20)

Dernier

Dernier (20)

Spark ML par Xebia (Spark Meetup du 11/06/2015)