SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
R, Scikit-Learn and Apache Spark ML -
What difference does it make?
Villu Ruusmann
Openscoring OÜ
Overview
● Identifying long-standing, high-value opportunities in the
applied predictive analytics domain
● Thinking about problems in API terms
● Providing solutions in API terms
● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
The trade-off
"More data beats better algorithms"
The state of the art
Scaling out horizontally
Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset
● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model
○ From model to real-life output table
● Model
● Statistics
Calling R from within Apache Spark
1. Create and initialize R runtime
2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD
3. Destroy R runtime
Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array
2. Invoke Scikit-Learn via Python/C API
3. Parse output numpy.array into result RDD
API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activities
Short-term << Long-term
JPMML - Java PMML API
● Conversion API
● Maintenance API
● Execution API
○ Interpreted mode
○ Translated + compiled ("Transpiled") mode
● Serving API
○ Integrations with popular Big Data frameworks
○ REST web service
Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:
○ A continuous label: log(price)
○ Two string and four numeric categorical features
○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:
○ 153'978 complete cases
○ 116'480 incomplete (ie. with missing values) cases
Gradient-Boosted Trees (GBTs)
R training and conversion API
#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
Scikit-Learn training and conversion API
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
Dataset
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory
layout
Contiguous,
dense
Contiguous,
dense(?)
Contiguous,
dense/sparse
Contiguous,
dense/sparse
Distributed,
dense/sparse
Data type Any double float float or
double
double
Categorical
values
As-is (factor) Encoded Binarized Binarized Binarized
Missing
values
Yes Pseudo (NaN) Pseudo (NaN) No No
LightGBM via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
XGBoost via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
GBT algorithm (training)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoost
ingRegressor
GBTRegressor
Parameterizab
ility
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical
values
"set contains" "equals" Pseudo
("equals")
Pseudo
("equals")
"equals"
Missing
values
First-class Pseudo Pseudo No No
gbm-style splits
<Node id="9">
<SimplePredicate field="interior_type" operator="isMissing"/>
<Node id="12" score="3.0702062395803734E-4">
<SimplePredicate field="colour" operator="isMissing"/>
</Node>
<Node id="10" score="-0.018950416258408962">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Grün Rot Violett Weiß</Array>
</SimpleSetPredicate>
</Node>
<Node id="11" score="-0.0017446280908351925">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>
</SimpleSetPredicate>
</Node>
</Node>
LightGBM- and XGBoost-style splits (1/3)
<Node id="39" defaultChild="76">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<Node id="76" score="0.0030283758">
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</Node>
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="colour" operator="isMissing"/>
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else if("Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
<!-- else return null -->
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="colour" operator="isNotMissing"/>
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<True/>
</Node>
</Node>
Model measurement using JPMML
org.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
GBT algorithm (interpretation)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Feature
importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model
persistence
RDS (binary) Proprietary
(text)
Proprietary
(binary, text)
Pickle (binary) SER (binary) or
JSON (text)
Model
reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
LightGBM feature importances
Age 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
Model execution using JPMML
org.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
Lessons (to be-) learned
● Limits and limitations of individual APIs
● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform
○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in
most application scenarios
● "Conventions over configuration"
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

Contenu connexe

Tendances

Implementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in CImplementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in C
Kasun Ranga Wijeweera
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

Tendances (20)

PR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image SynthesisPR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image Synthesis
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
Implementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in CImplementation of k-means clustering algorithm in C
Implementation of k-means clustering algorithm in C
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
코드로 이해하는 Back_propagation(cs231n)
코드로 이해하는 Back_propagation(cs231n)코드로 이해하는 Back_propagation(cs231n)
코드로 이해하는 Back_propagation(cs231n)
 
MongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree IndexesMongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree Indexes
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
lecun-01.ppt
lecun-01.pptlecun-01.ppt
lecun-01.ppt
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
 
Passes of compilers
Passes of compilersPasses of compilers
Passes of compilers
 
Non- Recursive Predictive Parsing.pptx
Non- Recursive Predictive Parsing.pptxNon- Recursive Predictive Parsing.pptx
Non- Recursive Predictive Parsing.pptx
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 

En vedette

Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
Deepak Narayanan
 

En vedette (20)

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 

Similaire à R, Scikit-Learn and Apache Spark ML - What difference does it make?

Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
Tomáš Drenčák
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 

Similaire à R, Scikit-Learn and Apache Spark ML - What difference does it make? (20)

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IP
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzML
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
R console
R consoleR console
R console
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
 

Dernier

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Dernier (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 

R, Scikit-Learn and Apache Spark ML - What difference does it make?

  • 1. R, Scikit-Learn and Apache Spark ML - What difference does it make? Villu Ruusmann Openscoring OÜ
  • 2. Overview ● Identifying long-standing, high-value opportunities in the applied predictive analytics domain ● Thinking about problems in API terms ● Providing solutions in API terms ● Developing and applying custom tools + A couple of tips if you're looking to buy or sell a VW Golf
  • 4. "More data beats better algorithms"
  • 5. The state of the art
  • 7. Elements of reproducibility Standardized, human- and machine-readable descriptions: ● Dataset ● Data pre- and post-processing steps: ○ From real-life input table (SQL, CSV) to model ○ From model to real-life output table ● Model ● Statistics
  • 8. Calling R from within Apache Spark 1. Create and initialize R runtime 2. Format and upload input RDD; upload and execute R model; download output and parse into result RDD 3. Destroy R runtime
  • 9. Calling Scikit-Learn from within Apache Spark 1. Format input RDD (eg. using Java NIO) as numpy.array 2. Invoke Scikit-Learn via Python/C API 3. Parse output numpy.array into result RDD
  • 10. API prioritization Training << Maintenance ~ Deployment One-time activity << Repeated activities Short-term << Long-term
  • 11. JPMML - Java PMML API ● Conversion API ● Maintenance API ● Execution API ○ Interpreted mode ○ Translated + compiled ("Transpiled") mode ● Serving API ○ Integrations with popular Big Data frameworks ○ REST web service
  • 12. Calling JPMML-Spark from within Apache Spark org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..; org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build(); org.apache.spark.sql.Dataset<Row> input = ..; org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
  • 13. The case study Predicting the price of VW Golf cars using GBT algorithms: ● 71 columns: ○ A continuous label: log(price) ○ Two string and four numeric categorical features ○ 64 binary-like (0/1) and numeric continuous features ● 270'458 rows: ○ 153'978 complete cases ○ 116'480 incomplete (ie. with missing values) cases
  • 15. R training and conversion API #library("caret") library("gbm") library("r2pmml") cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A") factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type") for(factor_col in factor_cols){ cars[, factor_col] = as.factor(cars[, factor_col]) } # Doesn't work with factors with missing values #cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..) cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6) r2pmml(cars.gbm, "gbm.pmml")
  • 16. Scikit-Learn training and conversion API from sklearn_pandas import DataFrameMapper from sklearn.model_selection import GridSearchCV from sklearn2pmml import sklearn2pmml, PMMLPipeline cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"]) mapper = DataFrameMapper(..) regressor = .. tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..) tuner.fit(mapper.fit_transform(cars), cars["price"]) pipeline = PMMLPipeline([ ("mapper", mapper), ("regressor", tuner.best_estimator_) ]) sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
  • 17. Dataset R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector> Memory layout Contiguous, dense Contiguous, dense(?) Contiguous, dense/sparse Contiguous, dense/sparse Distributed, dense/sparse Data type Any double float float or double double Categorical values As-is (factor) Encoded Binarized Binarized Binarized Missing values Yes Pseudo (NaN) Pseudo (NaN) No No
  • 18. LightGBM via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelEncoder from lightgbm import LGBMRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64) regressor.fit(transformed_cars, cars["price"], categorical_feature = list(range(0, len(factor_columns))))
  • 19. XGBoost via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelBinarizer from xgboost.sklearn import XGBRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6) regressor.fit(transformed_cars, cars["price"])
  • 20. GBT algorithm (training) R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction gbm LGBMRegressor XGBRegressor GradientBoost ingRegressor GBTRegressor Parameterizab ility Medium High High Medium Medium Split type Multi-way Binary Binary Binary Binary Categorical values "set contains" "equals" Pseudo ("equals") Pseudo ("equals") "equals" Missing values First-class Pseudo Pseudo No No
  • 21. gbm-style splits <Node id="9"> <SimplePredicate field="interior_type" operator="isMissing"/> <Node id="12" score="3.0702062395803734E-4"> <SimplePredicate field="colour" operator="isMissing"/> </Node> <Node id="10" score="-0.018950416258408962"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Grün Rot Violett Weiß</Array> </SimpleSetPredicate> </Node> <Node id="11" score="-0.0017446280908351925"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array> </SimpleSetPredicate> </Node> </Node>
  • 22. LightGBM- and XGBoost-style splits (1/3) <Node id="39" defaultChild="76"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <Node id="76" score="0.0030283758"> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </Node> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> </Node>
  • 23. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 --> <Node id="76" score="0.0030283758"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="colour" operator="isMissing"/> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </CompoundPredicate> </Node> <!-- else if("Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> <!-- else return null --> </Node>
  • 24. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="colour" operator="isNotMissing"/> <SimplePredicate field="colour" operator="equal" value="Orange"/> </CompoundPredicate> </Node> <!-- else return 0.0030283758 --> <Node id="76" score="0.0030283758"> <True/> </Node> </Node>
  • 25. Model measurement using JPMML org.dmg.pmml.tree.TreeModel treeModel = ..; treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){ private int count = 0; // Number of Node elements private int maxDepth = 0; // Max "nesting depth" of Node elements @Override public VisitorAction visit(org.dmg.pmml.tree.Node node){ this.count++; int depth = 0; for(org.dmg.pmml.PMMLObject parent : getParents()){ if(!(parent instanceof org.dmg.pmml.tree.Node)) break; depth++; } this.maxDepth = Math.max(this.maxDepth, depth); return super.visit(node); } });
  • 26.
  • 27.
  • 28.
  • 29. GBT algorithm (interpretation) R LightGBM XGBoost Scikit- Learn Apache Spark ML Feature importances Direct Direct Transformed Transformed Transformed Decision path No No(?) No(?) Transformed Transformed Model persistence RDS (binary) Proprietary (text) Proprietary (binary, text) Pickle (binary) SER (binary) or JSON (text) Model reusability Good Fair(?) Good Fair Fair Java API No No Pseudo No Yes
  • 30. LightGBM feature importances Age 936 Mileage 887 Performance 738 [Category] 205 New? 179 [Type of fuel] 170 [Type of interior] 167 Airbags? 130 [Colour] 129 [Type of gearbox] 105
  • 31. Model execution using JPMML org.dmg.pmml.PMML pmml; try(InputStream is = ..){ pmml = org.jpmml.model.PMMLUtil.unmarshal(is); } org.jpmml.evaluator.Evaluator evaluator = new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml); org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..); org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..); for(int value = min; value <= max; value += increment){ Map<FieldName, FieldValue> arguments = Collections.singletonMap(inputField.getName(), inputField.prepare(value)); Map<FieldName, ?> result = evaluator.evaluate(arguments); System.out.println(result.get(targetField.getName())); }
  • 32.
  • 33.
  • 34. Lessons (to be-) learned ● Limits and limitations of individual APIs ● Vertical integration vs. horizontal integration: ○ All capabilities on a single platform ○ Specialized capabilities on specialized platforms ● Ease-of-use and robustness beat raw performance in most application scenarios ● "Conventions over configuration"