Taking your machine learning workflow to the next level using Scikit-Learn Pipelines
1. Taking your machine learning workflow
to the next level using Scikit-Learn
Pipelines
Philip Goddard
github.com/philipmgoddard/pipelines
2. • Many ML problems can be solved by
assembling modular components:
‒ Raw data in
‒ Predictions out.
• Pipelines encapsulate data preparation
as well as prediction.
Pipelines?
2
3. • Maintaining a clean workflow can be challenging when building ML models:
‒ Different models require different considerations, leading to multiple versions of
training data.
‒ When experimenting, the state of a Jupyter Notebook isn’t always obvious.
‒ Tuning model hyperparameters is (relatively) easy, tuning data transformations can
be a little more tedious.
Why should I care?
3
4. • Modular and natural way to build supervised ML models.
• Flexibility to experiment with data transformations as well as model tuning.
• Clean, DRY, reusable code.
How can Pipelines help?
5. • Scikit-Learn works around the concept of transformers and estimators.
• Typically, data is run through some transformations before reaching an estimator.
• Pipelines are implemented using the Pipeline class, chaining the components together.
• Different transformation stages can built, and combined using FeatureUnion.
• When the final step of the Pipeline is an estimator, the GridSearchCV class allows
hyperparameter tuning.
‒ Tune (or fix) the parameters for the transformers as well as the estimator.
Pipelines in Scikit-Learn
7. • Dataset for predicting whether subscribers to a mobile telephone plan will churn.
‒ Provided with a train (3333 observations) and test (1667 observations) set.
• Features include:
‒ Continuous measurements (e.g. total charges).
‒ Count measurements (e.g. number of customer service calls).
‒ Categorical (e.g. customer area code).
• Outcome is binary: ~85% did not churn and ~15% did churn.
Case Study: Customer Churn
github.com/philipmgoddard/pipelines
8. Understanding data informs Pipeline design
• The data set has a tractable number of features (19).
• We can easily visualise to get a feel for any considerations required for our Pipeline:
‒ Correlated features
‒ Low variance features
‒ Non-linearities
‒ Poorly behaved distributions
‒ Any other nuances
10. Watch out for
correlations!
• Pairwise plots are a good way to get a
feeling for highly correlated features.
• Some classes of model encounter
numerical instabilities when fitting if this
isn’t resolved.
• The pipeline should provide a way to
identify and remove such features.
16. Translate to code using Scikit-Learn API
num_pipeline = Pipeline([
('selector', DataFrameSelector(float_col_names + int_col_names)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation()),
('opt_scaler', OptionalStandardScaler()),
('poly_features', PolynomialFeatures())
])
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features
17. Translate to code using Scikit-Learn API
cat_pipeline = Pipeline([
('selector', DataFrameSelector(fac_col_names)),
('onehot_encoder', OneHotEncoder(sparse=False)),
('manual_dropper', ManualDropper(drop_ix=drop_col_ix)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation())
])
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features (encoded)
Drop
baseline
category
Encode
features
18. # bring it all together to produce ’base’ pipeline
base_pipe = Pipeline([
('union', FeatureUnion(
# parallel parts of pipeline
transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
),
# final correlation check
('correlation', FindCorrelation()),
])
The ‘base’ Pipeline in code
Numerical
feature
Pipeline
Categorical
feature
Pipeline
FeatureUnion
correlation
19. • Sometimes custom transformers are needed to suit the specific problem.
• And sometimes more flexibility is required:
‒ OptionalStandardScaler is a wrapper class around the StandardScaler class in
the preprocessing module.
• Custom transformer classes must define a fit and a transform method.
• See github repository for examples.
Custom transformers
21. • Built a base Pipeline object of transformers.
• Use this as a template, and:
‒ Make a copy,
‒ Append an estimator to the end,
‒ Train using GridSearchCV.
• Possible to throw a handful of estimators into a single Pipeline
‒ However, makes it harder to dissect models.
Estimators
22. • Logistic regression.
• Trial L1 and L2 penalties, and a range of C (penalty strength).
• As the model is linear, trial non-linear interaction terms.
• Worried about multicollinearity, so explicitly drop baseline categories.
• Center and scale the data.
Our first estimator
23. # copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
24. # copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
25. # copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
26. # copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
27. CV results: Logistic Regression
• Selected:
‒ L1 penalty,
‒ C = 0.1,
‒ Quadratic interactions.
• Final underlying model is accessible as
an attribute of the GridSearchCV object.
28. • Easy to reuse the Pipeline- train a Random Forest.
• Model is nonlinear, so no interaction terms needed.
• Don’t need to center or scale.
• Don’t drop baseline categories (unless binary) for this class of model.
• Hyperparameters to consider:
‒ number of estimators (trees) in our ensemble,
‒ number of features to consider for each split,
‒ maximum depth of trees.
A second estimator
29. # copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data
30. # copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data
33. • Consider upsampling or downsampling to resolve class imbalance.
‒ We want to sample rows.
‒ This is a different problem: transformers act on columns.
• Ratio of majority to minority class is another parameter to investigate.
‒ A pipeline would be perfect to trial different ratios.
• Add extra behavior to Sklearn Classifiers with a mixin class.
Example: Imbalanced Classes
34. # use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit
35. # use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit
38. • We can evaluate our models by making
predictions on the test set.
• As the fitted Pipelines are estimators, we can
make predictions like any other fitted model.
• For this example order our predictions of churn
from most to least confident.
‒ Visualise with a lift chart.
Evaluating our model on test set
39. • Discussed advantages of having a framework for ML Pipelines.
• Demonstrated how the Pipeline implementation in Scikit-Learn provides a
framework for flexible, readable and reusable ML.
• Walked through a case study to demonstrate how to apply in practice.
• Hopefully convinced you this is a great way to work with Scikit-Learn!
Conclusion