Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
3. MLLIB
Makes Machine Learning Easy and Scalable
Selection of Machine Learning Algorithms
Several design flaws :
• machine learning workflows/pipelines
• make MLlib itself a scalable project
• lack of homogeneity
org.apache.spark.ml to the rescue!
4. Machine Learning
Train
Dataset ML Algorithm
Model
Test
Dataset
Predictions
Feature
Engineering
Feature
Engineering
label
features
features
features
prediction
5. Machine Learning Pipeline
• Simple construction of ML workflow
• Inspect and debug it
• Tune parameters
• Re-run it on new data
9. Transformer Usage
// Add a categoryVec column to the DataFrame
// by applying OneHotEncoder transformation to the column category
val classEncoder = new OneHotEncoder()
.setInputCol("catergory")
.setOutputCol("catergoryVec")
val newDataFrame = classEncoder.transform(dataFrame)
dataFrame newDataFrame
colA
colB
…
category: double
colA
colB
…
category: double
categoryVec: vector
12. Model is aTransformer
DataFrame DataFrame
def transform(dataset: DataFrame): DataFrame
features: vector
…
features: vector
prediction: double
…
13. Estimator + Model Usage
// Apply logisticRegression on a training dataset to create a model
// used to compute predictions on a test dataset
val logisticRegression = new LogisticRegression()
.setMaxIter(50)
.setRegParam(0.01)
// train
val lrModel = logisticRegression.fit(trainDF)
// predict
val newDataFrameWithPredictions = lrModel.transform(testDF)
15. Evaluators
DataFrame Metric (Double)
area under ROC curve
area under PR curve
root mean square error
def evaluate(dataset: DataFrame): Double
label: Double
prediction: Double
…
16. Estimator + Model Usage
// Area under the ROC curve for the validation set
val evaluator = new BinaryClassificationEvaluator()
println(evaluator.evaluate(dataFrameWithLabelAndPrediction))
19. Pipeline Usage
// The stages of our pipeline
val classEncoder = new OneHotEncoder()
.setInputCol("class")
.setOutputCol("classVec")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("age", "fare", "classVec"))
.setOutputCol("features")
val logisticRegression = new LogisticRegression()
.setMaxIter(50)
.setRegParam(0.01)
// the pipeline
val pipeline = new Pipeline()
.setStages(Array(classEncoder, vectorAssembler, logisticRegression))
// train
val pipelineModel = pipeline.fit(trainSet)
// predict
val validationPredictions = pipelineModel.transform(testSet)
21. CrossValidator Usage
// We will cross validate our pipeline
val crossValidator = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
// The params we want to test
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(2, 5, 1000))
.addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01))
.addGrid(logisticRegression.maxIter, Array(10, 50, 100))
.build()
crossValidator.setEstimatorParamMaps(paramGrid)
// We will use a 3-fold cross validation
crossValidator.setNumFolds(3)
// train
val cvModel = crossValidator.fit(trainSet)
// predict with the best model
val testSetWithPrediction = cvModel.transform(testSet)
24. Summary
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameters tuning
• Schema validation
• User-defined Transformers and
Estimators
• Composable and DAG Pipelines
1.4.1? 1.5.0?