SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Taking your machine learning workflow
to the next level using Scikit-Learn
Pipelines
Philip Goddard
github.com/philipmgoddard/pipelines
• Many ML problems can be solved by
assembling modular components:
‒ Raw data in
‒ Predictions out.
• Pipelines encapsulate data preparation
as well as prediction.
Pipelines?
2
• Maintaining a clean workflow can be challenging when building ML models:
‒ Different models require different considerations, leading to multiple versions of
training data.
‒ When experimenting, the state of a Jupyter Notebook isn’t always obvious.
‒ Tuning model hyperparameters is (relatively) easy, tuning data transformations can
be a little more tedious.
Why should I care?
3
• Modular and natural way to build supervised ML models.
• Flexibility to experiment with data transformations as well as model tuning.
• Clean, DRY, reusable code.
How can Pipelines help?
• Scikit-Learn works around the concept of transformers and estimators.
• Typically, data is run through some transformations before reaching an estimator.
• Pipelines are implemented using the Pipeline class, chaining the components together.
• Different transformation stages can built, and combined using FeatureUnion.
• When the final step of the Pipeline is an estimator, the GridSearchCV class allows
hyperparameter tuning.
‒ Tune (or fix) the parameters for the transformers as well as the estimator.
Pipelines in Scikit-Learn
Case Study
7
• Dataset for predicting whether subscribers to a mobile telephone plan will churn.
‒ Provided with a train (3333 observations) and test (1667 observations) set.
• Features include:
‒ Continuous measurements (e.g. total charges).
‒ Count measurements (e.g. number of customer service calls).
‒ Categorical (e.g. customer area code).
• Outcome is binary: ~85% did not churn and ~15% did churn.
Case Study: Customer Churn
github.com/philipmgoddard/pipelines
Understanding data informs Pipeline design
• The data set has a tractable number of features (19).
• We can easily visualise to get a feel for any considerations required for our Pipeline:
‒ Correlated features
‒ Low variance features
‒ Non-linearities
‒ Poorly behaved distributions
‒ Any other nuances
Numerical features (continuous)
Watch out for
correlations!
• Pairwise plots are a good way to get a
feeling for highly correlated features.
• Some classes of model encounter
numerical instabilities when fitting if this
isn’t resolved.
• The pipeline should provide a way to
identify and remove such features.
Numerical features (counts)
Categorical features
Building the Pipeline
Pipeline Schematic
Numerical feature
Pipeline
Categorical feature
Pipeline
FeatureUnion
Filter high
correlation
Estimator
Pipeline components
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features
Drop
baseline
category
Encode
features
Translate to code using Scikit-Learn API
num_pipeline = Pipeline([
('selector', DataFrameSelector(float_col_names + int_col_names)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation()),
('opt_scaler', OptionalStandardScaler()),
('poly_features', PolynomialFeatures())
])
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features
Translate to code using Scikit-Learn API
cat_pipeline = Pipeline([
('selector', DataFrameSelector(fac_col_names)),
('onehot_encoder', OneHotEncoder(sparse=False)),
('manual_dropper', ManualDropper(drop_ix=drop_col_ix)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation())
])
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features (encoded)
Drop
baseline
category
Encode
features
# bring it all together to produce ’base’ pipeline
base_pipe = Pipeline([
('union', FeatureUnion(
# parallel parts of pipeline
transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
),
# final correlation check
('correlation', FindCorrelation()),
])
The ‘base’ Pipeline in code
Numerical
feature
Pipeline
Categorical
feature
Pipeline
FeatureUnion
correlation
• Sometimes custom transformers are needed to suit the specific problem.
• And sometimes more flexibility is required:
‒ OptionalStandardScaler is a wrapper class around the StandardScaler class in
the preprocessing module.
• Custom transformer classes must define a fit and a transform method.
• See github repository for examples.
Custom transformers
Estimators
• Built a base Pipeline object of transformers.
• Use this as a template, and:
‒ Make a copy,
‒ Append an estimator to the end,
‒ Train using GridSearchCV.
• Possible to throw a handful of estimators into a single Pipeline
‒ However, makes it harder to dissect models.
Estimators
• Logistic regression.
• Trial L1 and L2 penalties, and a range of C (penalty strength).
• As the model is linear, trial non-linear interaction terms.
• Worried about multicollinearity, so explicitly drop baseline categories.
• Center and scale the data.
Our first estimator
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
CV results: Logistic Regression
• Selected:
‒ L1 penalty,
‒ C = 0.1,
‒ Quadratic interactions.
• Final underlying model is accessible as
an attribute of the GridSearchCV object.
• Easy to reuse the Pipeline- train a Random Forest.
• Model is nonlinear, so no interaction terms needed.
• Don’t need to center or scale.
• Don’t drop baseline categories (unless binary) for this class of model.
• Hyperparameters to consider:
‒ number of estimators (trees) in our ensemble,
‒ number of features to consider for each split,
‒ maximum depth of trees.
A second estimator
# copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data
# copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data
CV results: Random Forest
More advantages of a
flexible pipeline
framework
• Consider upsampling or downsampling to resolve class imbalance.
‒ We want to sample rows.
‒ This is a different problem: transformers act on columns.
• Ratio of majority to minority class is another parameter to investigate.
‒ A pipeline would be perfect to trial different ratios.
• Add extra behavior to Sklearn Classifiers with a mixin class.
Example: Imbalanced Classes
# use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit
# use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit
CV results: Logistic Regression with up
sampling
Making predictions
• We can evaluate our models by making
predictions on the test set.
• As the fitted Pipelines are estimators, we can
make predictions like any other fitted model.
• For this example order our predictions of churn
from most to least confident.
‒ Visualise with a lift chart.
Evaluating our model on test set
• Discussed advantages of having a framework for ML Pipelines.
• Demonstrated how the Pipeline implementation in Scikit-Learn provides a
framework for flexible, readable and reusable ML.
• Walked through a case study to demonstrate how to apply in practice.
• Hopefully convinced you this is a great way to work with Scikit-Learn!
Conclusion
Thank You
41
github.com/philipmgoddard/pipelines

Contenu connexe

Tendances

Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
GauravSonawane51
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 

Tendances (20)

Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Shap
ShapShap
Shap
 
Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
fitness function
fitness functionfitness function
fitness function
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Optimizers
OptimizersOptimizers
Optimizers
 
XGBoost (System Overview)
XGBoost (System Overview)XGBoost (System Overview)
XGBoost (System Overview)
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
NS-CUK Joint Journal Club: V.T.Hoang, Review on "Heterogeneous Graph Attentio...
NS-CUK Joint Journal Club: V.T.Hoang, Review on "Heterogeneous Graph Attentio...NS-CUK Joint Journal Club: V.T.Hoang, Review on "Heterogeneous Graph Attentio...
NS-CUK Joint Journal Club: V.T.Hoang, Review on "Heterogeneous Graph Attentio...
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 
Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learning
 

Similaire à Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 

Similaire à Taking your machine learning workflow to the next level using Scikit-Learn Pipelines (20)

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Learning to Optimize
Learning to OptimizeLearning to Optimize
Learning to Optimize
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Survey on Software Defect Prediction
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

  • 1. Taking your machine learning workflow to the next level using Scikit-Learn Pipelines Philip Goddard github.com/philipmgoddard/pipelines
  • 2. • Many ML problems can be solved by assembling modular components: ‒ Raw data in ‒ Predictions out. • Pipelines encapsulate data preparation as well as prediction. Pipelines? 2
  • 3. • Maintaining a clean workflow can be challenging when building ML models: ‒ Different models require different considerations, leading to multiple versions of training data. ‒ When experimenting, the state of a Jupyter Notebook isn’t always obvious. ‒ Tuning model hyperparameters is (relatively) easy, tuning data transformations can be a little more tedious. Why should I care? 3
  • 4. • Modular and natural way to build supervised ML models. • Flexibility to experiment with data transformations as well as model tuning. • Clean, DRY, reusable code. How can Pipelines help?
  • 5. • Scikit-Learn works around the concept of transformers and estimators. • Typically, data is run through some transformations before reaching an estimator. • Pipelines are implemented using the Pipeline class, chaining the components together. • Different transformation stages can built, and combined using FeatureUnion. • When the final step of the Pipeline is an estimator, the GridSearchCV class allows hyperparameter tuning. ‒ Tune (or fix) the parameters for the transformers as well as the estimator. Pipelines in Scikit-Learn
  • 7. • Dataset for predicting whether subscribers to a mobile telephone plan will churn. ‒ Provided with a train (3333 observations) and test (1667 observations) set. • Features include: ‒ Continuous measurements (e.g. total charges). ‒ Count measurements (e.g. number of customer service calls). ‒ Categorical (e.g. customer area code). • Outcome is binary: ~85% did not churn and ~15% did churn. Case Study: Customer Churn github.com/philipmgoddard/pipelines
  • 8. Understanding data informs Pipeline design • The data set has a tractable number of features (19). • We can easily visualise to get a feel for any considerations required for our Pipeline: ‒ Correlated features ‒ Low variance features ‒ Non-linearities ‒ Poorly behaved distributions ‒ Any other nuances
  • 10. Watch out for correlations! • Pairwise plots are a good way to get a feeling for highly correlated features. • Some classes of model encounter numerical instabilities when fitting if this isn’t resolved. • The pipeline should provide a way to identify and remove such features.
  • 14. Pipeline Schematic Numerical feature Pipeline Categorical feature Pipeline FeatureUnion Filter high correlation Estimator
  • 15. Pipeline components Select features Filter low variance features Filter high correlations Center and scale Calculate interactions Numerical features Select features Filter low variance features Filter high correlations Categorical features Drop baseline category Encode features
  • 16. Translate to code using Scikit-Learn API num_pipeline = Pipeline([ ('selector', DataFrameSelector(float_col_names + int_col_names)), ('zero_var', ZeroVariance()), ('correlation', FindCorrelation()), ('opt_scaler', OptionalStandardScaler()), ('poly_features', PolynomialFeatures()) ]) Select features Filter low variance features Filter high correlations Center and scale Calculate interactions Numerical features
  • 17. Translate to code using Scikit-Learn API cat_pipeline = Pipeline([ ('selector', DataFrameSelector(fac_col_names)), ('onehot_encoder', OneHotEncoder(sparse=False)), ('manual_dropper', ManualDropper(drop_ix=drop_col_ix)), ('zero_var', ZeroVariance()), ('correlation', FindCorrelation()) ]) Select features Filter low variance features Filter high correlations Categorical features (encoded) Drop baseline category Encode features
  • 18. # bring it all together to produce ’base’ pipeline base_pipe = Pipeline([ ('union', FeatureUnion( # parallel parts of pipeline transformer_list=[ ('num_pipeline', num_pipeline), ('cat_pipeline', cat_pipeline) ]) ), # final correlation check ('correlation', FindCorrelation()), ]) The ‘base’ Pipeline in code Numerical feature Pipeline Categorical feature Pipeline FeatureUnion correlation
  • 19. • Sometimes custom transformers are needed to suit the specific problem. • And sometimes more flexibility is required: ‒ OptionalStandardScaler is a wrapper class around the StandardScaler class in the preprocessing module. • Custom transformer classes must define a fit and a transform method. • See github repository for examples. Custom transformers
  • 21. • Built a base Pipeline object of transformers. • Use this as a template, and: ‒ Make a copy, ‒ Append an estimator to the end, ‒ Train using GridSearchCV. • Possible to throw a handful of estimators into a single Pipeline ‒ However, makes it harder to dissect models. Estimators
  • 22. • Logistic regression. • Trial L1 and L2 penalties, and a range of C (penalty strength). • As the model is linear, trial non-linear interaction terms. • Worried about multicollinearity, so explicitly drop baseline categories. • Center and scale the data. Our first estimator
  • 23. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 24. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 25. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 26. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 27. CV results: Logistic Regression • Selected: ‒ L1 penalty, ‒ C = 0.1, ‒ Quadratic interactions. • Final underlying model is accessible as an attribute of the GridSearchCV object.
  • 28. • Easy to reuse the Pipeline- train a Random Forest. • Model is nonlinear, so no interaction terms needed. • Don’t need to center or scale. • Don’t drop baseline categories (unless binary) for this class of model. • Hyperparameters to consider: ‒ number of estimators (trees) in our ensemble, ‒ number of features to consider for each split, ‒ maximum depth of trees. A second estimator
  • 29. # copy of the base pipeline, append estimator rf_est = copy.deepcopy(base_pipe) rf_est.steps.append(('random_forest‘, RandomForestClassifier(random_state = 1234))) # parameter grid rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False], union__num_pipeline__poly_features__degree=[1], union__cat_pipeline__manual_dropper__optional_drop_ix=[None], random_forest__n_estimators=[50, 100, 200], random_forest__max_depth=[6, 9, 12], random_forest__max_features=[4, 5, 6]) # ... create the GridSearchCV object as before, and fit to training data
  • 30. # copy of the base pipeline, append estimator rf_est = copy.deepcopy(base_pipe) rf_est.steps.append(('random_forest‘, RandomForestClassifier(random_state = 1234))) # parameter grid rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False], union__num_pipeline__poly_features__degree=[1], union__cat_pipeline__manual_dropper__optional_drop_ix=[None], random_forest__n_estimators=[50, 100, 200], random_forest__max_depth=[6, 9, 12], random_forest__max_features=[4, 5, 6]) # ... create the GridSearchCV object as before, and fit to training data
  • 32. More advantages of a flexible pipeline framework
  • 33. • Consider upsampling or downsampling to resolve class imbalance. ‒ We want to sample rows. ‒ This is a different problem: transformers act on columns. • Ratio of majority to minority class is another parameter to investigate. ‒ A pipeline would be perfect to trial different ratios. • Add extra behavior to Sklearn Classifiers with a mixin class. Example: Imbalanced Classes
  • 34. # use the factory to add extra behavior to the sklearn class LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression) # copy base pipeline, append estimator lr_us_est = copy.deepcopy(base_pipe) lr_us_est.steps.append(('lr_sample’, LogisticRegressionWithSampling(random_state=1234))) # we can specify the sampling proportion on the target class # as a hyperparameter now! lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], lr_sample__penalty=[‘l1’], lr_sample__C=[0.01, 0.1, 1.0], lr_sample__upsample=[True, False], lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0]) # ... and fit
  • 35. # use the factory to add extra behavior to the sklearn class LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression) # copy base pipeline, append estimator lr_us_est = copy.deepcopy(base_pipe) lr_us_est.steps.append(('lr_sample’, LogisticRegressionWithSampling(random_state=1234))) # we can specify the sampling proportion on the target class # as a hyperparameter now! lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], lr_sample__penalty=[‘l1’], lr_sample__C=[0.01, 0.1, 1.0], lr_sample__upsample=[True, False], lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0]) # ... and fit
  • 36. CV results: Logistic Regression with up sampling
  • 38. • We can evaluate our models by making predictions on the test set. • As the fitted Pipelines are estimators, we can make predictions like any other fitted model. • For this example order our predictions of churn from most to least confident. ‒ Visualise with a lift chart. Evaluating our model on test set
  • 39. • Discussed advantages of having a framework for ML Pipelines. • Demonstrated how the Pipeline implementation in Scikit-Learn provides a framework for flexible, readable and reusable ML. • Walked through a case study to demonstrate how to apply in practice. • Hopefully convinced you this is a great way to work with Scikit-Learn! Conclusion