Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Spark ML par Xebia (Spark Meetup du 11/06/2015)

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 25 Publicité

Spark ML par Xebia (Spark Meetup du 11/06/2015)

Télécharger pour lire hors ligne

Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.

http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/

Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.

http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Spark ML par Xebia (Spark Meetup du 11/06/2015) (20)

Publicité

Plus par HUG France (20)

Plus récents (20)

Publicité

Spark ML par Xebia (Spark Meetup du 11/06/2015)

  1. 1. SPARK ML A new High-Level API for MLlib Spark 1.4.0 preview
  2. 2. Matthieu Blanc Instructor Spark DevelopperTraining @matthieublanc
  3. 3. MLLIB Makes Machine Learning Easy and Scalable Selection of Machine Learning Algorithms Several design flaws : • machine learning workflows/pipelines • make MLlib itself a scalable project • lack of homogeneity org.apache.spark.ml to the rescue!
  4. 4. Machine Learning Train Dataset ML Algorithm Model Test Dataset Predictions Feature Engineering Feature Engineering label features features features prediction
  5. 5. Machine Learning Pipeline • Simple construction of ML workflow • Inspect and debug it • Tune parameters • Re-run it on new data
  6. 6. Dataframes org.apache.spark.ml
  7. 7. Key concepts • DataFrame as ML Datasets • Abstractions : • Transformers • Estimators • Evaluators • Parameters API -> CrossValidator
  8. 8. Transformers DataFrame DataFrame def transform(dataset: DataFrame): DataFrame colA colB … colX colA colB … colX newCol
  9. 9. Transformer Usage // Add a categoryVec column to the DataFrame // by applying OneHotEncoder transformation to the column category
 val classEncoder = new OneHotEncoder()
 .setInputCol("catergory")
 .setOutputCol("catergoryVec")
 
 val newDataFrame = classEncoder.transform(dataFrame) dataFrame newDataFrame colA colB … category: double colA colB … category: double categoryVec: vector
  10. 10. Transformers Examples Normalizer VectorAssembler PolynomialExpansion Model Tokenizer OneHotEncoder HashingTF Binarizer
  11. 11. Estimators DataFrame Model def fit(dataset: DataFrame): Model label: double features: vector … extends Transformer
  12. 12. Model is aTransformer DataFrame DataFrame def transform(dataset: DataFrame): DataFrame features: vector … features: vector prediction: double …
  13. 13. Estimator + Model Usage // Apply logisticRegression on a training dataset to create a model
 // used to compute predictions on a test dataset
 val logisticRegression = new LogisticRegression()
 .setMaxIter(50)
 .setRegParam(0.01)
 // train val lrModel = logisticRegression.fit(trainDF)
 // predict
 val newDataFrameWithPredictions = lrModel.transform(testDF)
  14. 14. Estimators Examples StringIndexer StandardScaler CrossValidator Pipeline LinearRegression LogisticRegression DecisionTreeClassifier RandomForestClassifier GBTClassifier ALS
  15. 15. Evaluators DataFrame Metric (Double) area under ROC curve area under PR curve root mean square error def evaluate(dataset: DataFrame): Double label: Double prediction: Double …
  16. 16. Estimator + Model Usage // Area under the ROC curve for the validation set
 val evaluator = new BinaryClassificationEvaluator()
 println(evaluator.evaluate(dataFrameWithLabelAndPrediction))
  17. 17. Evaluators Examples RegressionEvaluator BinaryClassificationEvaluator
  18. 18. Pipeline Train Dataset ML Algorithm Model Test Dataset Predictions Feature Engineering Feature Engineering Pipeline Transformer EstimatorDataFrame PipelineModel Transformer Estimator DataFrame DataFrame Pipeline is an Estimator
  19. 19. Pipeline Usage // The stages of our pipeline
 val classEncoder = new OneHotEncoder()
 .setInputCol("class")
 .setOutputCol("classVec")
 val vectorAssembler = new VectorAssembler()
 .setInputCols(Array("age", "fare", "classVec"))
 .setOutputCol("features")
 val logisticRegression = new LogisticRegression()
 .setMaxIter(50)
 .setRegParam(0.01)
 
 // the pipeline
 val pipeline = new Pipeline()
 .setStages(Array(classEncoder, vectorAssembler, logisticRegression))
 
 // train val pipelineModel = pipeline.fit(trainSet)
 
 // predict val validationPredictions = pipelineModel.transform(testSet)
  20. 20. CrossValidator Given • Estimator • Parameter Grid • Evaluator Find the Model with the best Parameters CrossValidator is also an Estimator
  21. 21. CrossValidator Usage // We will cross validate our pipeline
 val crossValidator = new CrossValidator()
 .setEstimator(pipeline)
 .setEvaluator(new BinaryClassificationEvaluator)
 
 // The params we want to test
 val paramGrid = new ParamGridBuilder()
 .addGrid(hashingTF.numFeatures, Array(2, 5, 1000))
 .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01))
 .addGrid(logisticRegression.maxIter, Array(10, 50, 100))
 .build()
 crossValidator.setEstimatorParamMaps(paramGrid)
 
 // We will use a 3-fold cross validation
 crossValidator.setNumFolds(3)
 
 // train val cvModel = crossValidator.fit(trainSet)
 // predict with the best model
 val testSetWithPrediction = cvModel.transform(testSet)
  22. 22. DEMO https://github.com/mblanc/spark-ml
  23. 23. Conclusion DataFrame o.a.spark.ml RDD o.a.spark.mllib Today Tomorrow uses uses uses uses
  24. 24. Summary • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameters tuning • Schema validation • User-defined Transformers and Estimators • Composable and DAG Pipelines 1.4.1? 1.5.0?
  25. 25. MERCI

×