Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Automation and optimisation of machine learning pipelines on
top of Apache
Peter Rudenko
@peter_rud
peter.rudenko@datarobo...
DataRobot data pipeline
Data
upload
Training models,
selecting best
models &
hyperparameters
Exploratory
data
analysis
Mod...
Our journey to Apache Spark
PySpark vs Scala API?
Spark
worker
JVM
Python
process
Sending instructions:
df.agg({"age": "ma...
Our journey to Apache Spark
RDD vs DataFrame
RDD[Row[(Double, String, Vector)]]
Dataframe
(DoubleType,
nullable=true)
+ At...
Our journey to Apache Spark
Mllib vs ML
Mllib:
● Low - level implementation of machine learning algorithms
● Based on RDD
...
Columnar format
● Compression
● Scan optimization
● Null-imputor improvement
- val na2mean = {value: Double =>
- if (value...
Typical machine learning pipeline
● Features extraction
● Missing values imputation
● Variables encoding
● Dimensionality ...
Introducing Blueprint
Pipeline config
pipeline: {
"1": {
input: ["NUM"],
class: "org.apache.spark.ml.feature.MeanImputor"
},
"2": {
input: ["CAT...
Introducing Blueprint
YARN cluster
Blueprint
Spark jobserver
Transformer (pure function)
abstract class Transformer extends PipelineStage with Params {
/**
* Transforms the dataset wi...
Estimator
abstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
/**
* Fits a single model to the inp...
Predictor
Estimator that predicts a value
ProbabilisticClassifier
Predictor
Classifier Regressor
Evaluator
abstract class Evaluator extends Identifiable {
/**
* Evaluates the output.
*
* @param dataset a dataset that co...
Pipeline
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setI...
CrossValidator
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new
BinaryClassificationEvaluator...
Pluggable backend
● H20
● Flink
● DeepLearning4J
● http://keystone-ml.org/
● etc.
Optimization
● Disable k-fold cross validation
● Minimize redundant pre-processing
● Parallel grid search
● Parallel DAG p...
Minimize redundant pre-processing
regParam:
0.1
regParam:
0.01
val rdd1 = rdd.map(function)
val rdd2 = rdd.map(function)
r...
Summary
● Good model != good result. Feature engineering is
the key.
● Spark provides a good abstraction, but need to tune...
Thanks,
Demo & QA
Prochain SlideShare
Chargement dans…5
×

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

23.05.15 Одесса. Impact Hub Odessa. Конференция AI&BigData Lab

Руденко Петр (Инженер-программист, Datarobot) Automation and optimisation of machine learning pipelines on top of Apache Spark

В компании Datarobot мы занимаемся автоматизированным построением точных предсказательных моделей. Помимо непосредственного обучения модели, важную роль во всем процессе играет препроцессинг данных (feature selection/normalization/transformation). В своем докладе я поделюсь нашим опытом использования платформы Apache Spark и в частности новыми ml API, которые предоставляют функционал для построения пайплайнов (Pipeline), поиска оптимальных значений гиперпараметров моделей (Crossvalidation).

Подробнее:
http://geekslab.co/
https://www.facebook.com/GeeksLab.co
https://www.youtube.com/user/GeeksLabVideo

  • Soyez le premier à commenter

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

  1. 1. Automation and optimisation of machine learning pipelines on top of Apache Peter Rudenko @peter_rud peter.rudenko@datarobot.com
  2. 2. DataRobot data pipeline Data upload Training models, selecting best models & hyperparameters Exploratory data analysis Models leaderboard Prediction API
  3. 3. Our journey to Apache Spark PySpark vs Scala API? Spark worker JVM Python process Sending instructions: df.agg({"age": "max"}) FAST! Spark worker JVM Python process Sending data: data.map(lambda x: …) data.filter(lambda x: …) SLOW! Instructions py4j Data ipc/serde
  4. 4. Our journey to Apache Spark RDD vs DataFrame RDD[Row[(Double, String, Vector)]] Dataframe (DoubleType, nullable=true) + Attributes (in spark-1.4) Dataframe (StringType, nullable=true) + Attributes (in spark-1.4) Dataframe (VectorType, nullable=true) + Attributes (in spark-1.4) Attributes: NumericAttribute NominalAttribute (Ordinal) BinaryAttribute
  5. 5. Our journey to Apache Spark Mllib vs ML Mllib: ● Low - level implementation of machine learning algorithms ● Based on RDD ML: ● High-level pipeline abstractions ● Based on dataframes ● Uses mllib under the hood.
  6. 6. Columnar format ● Compression ● Scan optimization ● Null-imputor improvement - val na2mean = {value: Double => - if (value.isNaN) meanValue else value - } - dataset.withColumn(map(outputCol), callUDF(na2mean, DoubleType, dataset(map (inputCol)))) + dataset.na.fill(map(inputCols).zip (meanValues).toMap)
  7. 7. Typical machine learning pipeline ● Features extraction ● Missing values imputation ● Variables encoding ● Dimensionality reduction ● Training model (finding the optimal model parameters) ● Selecting hyperparameters Model evaluation on some metric (AUC, R2, RMSE, etc.) Train data (features + label) Test data (features) Model state (parameters + hyperparameters) Prediction
  8. 8. Introducing Blueprint
  9. 9. Pipeline config pipeline: { "1": { input: ["NUM"], class: "org.apache.spark.ml.feature.MeanImputor" }, "2": { input: ["CAT"], class: "org.apache.spark.ml.feature.OneHotEncoder" }, "3":{ input: ["1", "2"], class: "org.apache.spark.ml.feature.VectorAssembler" }, "4": { input: "3", class : "org.apache.spark.ml.classification.LogisticRegression", params: { optimizer: “LBFGS”, regParam: [0.5, 0.1, 0.01, 0.001] } } }
  10. 10. Introducing Blueprint YARN cluster Blueprint Spark jobserver
  11. 11. Transformer (pure function) abstract class Transformer extends PipelineStage with Params { /** * Transforms the dataset with provided parameter map as additional parameters. * @param dataset input dataset * @param paramMap additional parameters, overwrite embedded params * @return transformed dataset */ def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame } Example: (new HashingTF). setInputCol("categorical_column"). setOutputCol("Hashing_tf_1"). setNumFeatures(1<<20). transform(data)
  12. 12. Estimator abstract class Estimator[M <: Model[M]] extends PipelineStage with Params { /** * Fits a single model to the input data with optional parameters. * * @param dataset input dataset * @param paramPairs Optional list of param pairs. * These values override any specified in this Estimator's embedded ParamMap. * @return fitted model */ @varargs def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = { val map = ParamMap(paramPairs: _*) fit(dataset, map) } } Example: val oneHotEncoderModel = (new OneHotEncoder). setInputCol("vector_col"). fit(trainingData) oneHotEncoderModel.transform(trainingData) oneHotEncoderModel.transform(testData) Estimator => Transformer
  13. 13. Predictor Estimator that predicts a value ProbabilisticClassifier Predictor Classifier Regressor
  14. 14. Evaluator abstract class Evaluator extends Identifiable { /** * Evaluates the output. * * @param dataset a dataset that contains labels/observations and predictions. * @param paramMap parameter map that specifies the input columns and output metrics * @return metric */ def evaluate(dataset: DataFrame, paramMap: ParamMap): Double } Example: val areaUnderROC = (new BinaryClassificationEvaluator). setScoreCol("prediction"). evaluate(data)
  15. 15. Pipeline val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) Input data Tockenizer HashingTF Logistic Regression fit Pipeline Model Estimator that encapsulates other transformers / estimators
  16. 16. CrossValidator val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array (10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) . build() crossval.setEstimatorParamMaps( paramGrid) crossval.setNumFolds(3) val cvModel = crossval.fit(training.toDF) Input data Tockenizer HashingTF Logistic Regression fit CrossVal Model numFeatures: {10, 100, 1000} regParam: {0.1, 0.01} Folds
  17. 17. Pluggable backend ● H20 ● Flink ● DeepLearning4J ● http://keystone-ml.org/ ● etc.
  18. 18. Optimization ● Disable k-fold cross validation ● Minimize redundant pre-processing ● Parallel grid search ● Parallel DAG pipeline ● Pluggable optimizer ● Non-gridsearch hyperparameter optimization (bayesian & hypergrad): http://arxiv.org/pdf/1502.03492v2.pdf http://arxiv.org/pdf/1206.2944.pdf http://arxiv.org/pdf/1502.05700v1.pdf
  19. 19. Minimize redundant pre-processing regParam: 0.1 regParam: 0.01 val rdd1 = rdd.map(function) val rdd2 = rdd.map(function) rdd1 != rdd2
  20. 20. Summary ● Good model != good result. Feature engineering is the key. ● Spark provides a good abstraction, but need to tune some parts to achieve good performance. ● ml pipeline API gives a pluggable and reusable building blocks. ● Don’t forget to clean after yourself (unpersist cache).
  21. 21. Thanks, Demo & QA

×