Democratizing machine learning: perspective from scikit-learn

Democratizing machine learning:
perspective from scikit-learn
Gaël Varoquaux,
scikit
machine learning in Python

scikit-learn
From nerds to an industry standard
Number of monthly users
2010 2012 2014 2016 2018
200000
400000
600000
800000

scikit-learn
We were not aiming for the enterprise
but rather
ourselves
scientists
students

scikit-learn
Data-science for the many, not only the mighty
The news:

scikit-learn
Data scientists:
Largest data processed
Poll by KDnuggets
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20%
huge = 10 to 100GB

scikit-learn
Data scientists:
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20% 2018
2016
2015
2014
2013
no increase with time

Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money

Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
Databricks survey 90% organizations invest in AI, few succeed
Challenges reported:
98%: preparation and aggregation of large datasets
96%: data exploration and iterative model training
https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding

Democratizing machine learning
1 Building a toolkit for all
2 Tackling scalability
3 Bridging to data engineering

1 Building a toolkit for all
The scikit-learn story

1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)

Python
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)

Python
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
Enables bridging across languages (eg for lapack), Cython

1 Focus on usability
API design
Grey box: all models interchangeable,
but still inspectable
Documentation & examples
Good documentation required to add a feature
Easy-understable examples guide API design
Teach statistical learning, rather than code
Models, solvers, hyperparameters
Choices that do not require tinkering
Lots of usecase-driven empirical testing

1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People ﬁx & improve what’s
important to them

Gave man power
2010 2014 2018
0
25
50
and the right focus
important to them
Open source has won
But it needs sustainability and investment

Gave man power
2010 2014 2018
0
25
50
and the right focus
important to them
Open source has won
But it needs sustainability and investment
mid-2018: A foundation for scikit-learn
+ the community

2 Tackling scalability
A new challenge

2 Algorithmic improvements
PCA
Cost: np min(n,p)
Randomized PCA (simpliﬁed intuitions)
1 loop: take a random fraction of the data
2 small PCA on that fraction
3 aggregate results via PCA across results
svd_solver=’auto’ Up to ×10 speedup

PCA aggregate across sub-samples ⇒ ×10
Logistic regression
Gradient descent on error measure: wi+1 = wi +α∇wf
Large n = costly gradient computation
Full gradient
Costly
Sub-sampling in
gradient
Finnicky
Sub-sampling +
noise reduction
solver=’saga’
Fast & easy

Logistic regression sub-sampling + noise reduction
Gradient-boosted trees ﬁt on sufﬁcient summary
Succession of decision trees that enrich each other
Iteration 1 Iteration 2 Iteration 3
Speedup: bin data and compute histograms
HistGradientBoostingRegressor v0.21
catch up with XGBoost & lightgbm

Logistic regression sub-sampling + noise reduction
Gradient-boosted trees ﬁt on sufﬁcient summary
Fit on several subsamples / chunks
+ aggregation or variance reduction
Fit on summary statistics

2 Scaling out: parallel computing
Simple parallel computing schemes limiting data transfer
Data parallel
03878794797927
03878794797927
mostly in inner loops
Model parallel
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
for model selection
eg GridSearchCV

Simple parallel computing schemes limiting data transfer
Data parallel
03878794797927
03878794797927
mostly in inner loops
Model parallel
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
for model selection
eg GridSearchCV
Real-life machine-learning
03878794797927

Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
03878794797927

Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
Real life = A merry mess oversubscription, inefﬁcient transfert
A scheduling problem But: need simple API to focus on algorithmics
scikit-learn is a library: doesn’t own the “main”

2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)

2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
Extendable backend API (eg dask)
delegates scheduling (eg to a framework)
still a dispatch / receive queue
overﬂows the memory of greedy schedulers

2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574

2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
Language-agnostic predictor representation ONNX
sklearn-onnx can convert trained models to other runtimes
Working on guaranteeing compliance
Useful to deployment to production

Scaling
Algorithmic improvement is top priority
External infrastructure helps scaling out
Tension with our mission of generic, reusable library
⇒ work on impedence matching layer

3 Bridging to data engineering

3 Data assembly for statistics
“Dirty data” is a central problem
Merging data sources
Input errors

3 Machine learning versus data in the wild
numbers (in arrays)
arrays (of numbers)
arrays
strings
databases
schemas
A gap between
statistics
&
data engineering

Machine learning
Let X ∈ Rn×p
or a numpy array

Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Ofﬁcer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Ofﬁcer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I

Machine learning
Let X ∈ Rn×p
or a numpy array
M NA Bus Operator
Non-numerical, heterogeneous data

Machine learning
Let X ∈ Rn×p
or a numpy array
M NA Bus Operator
Missing values

Machine learning
Let X ∈ Rn×p
or a numpy array
M NA Bus Operator
Missing values
Non-normalized entries

3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering

3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
Separating ﬁtting from transforming
Can be applied to new data
Avoids data leakage
Model selection on dataframes
model = make_pipeline(column_trans,
HistGradientBoostingClassifier())
scores = cross_val_score(model, df, y)
Choose data-engineering operations to maximize prediction

3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others

3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
For prediction
If y depends on missingness, perfect imputation breaks prediction
⇒ add a missing indicator: IterativeImputer(add_indicator=True)
With constant imputation a powerful learner can model missing values
On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019
NA in HistGradientBoosting v0.22

3 Encoding dirty categories
Digression: not in scikit-learn
One-hot encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Policer Officer II ... 0 0 1
Policer Oficer II ... 0 1 0
Policer Officer I ... 1 0 0
X ∈ Rn×p p grows fast
new categories?
link categories?
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
Bus Operator
Bus ::::::::::
Opperator
Electrician
Library Assistant I
Social Work IV
Library Manager

3 Encoding dirty categories
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
Traditional view:
Data cleaning,
feature engineering
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
...

3 Encoding dirty categories https://project.inria.fr/dirtydata
Similarity encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Police Officer II ... 0.9 0.8 1
Police Oficer II ... 0.8 1 0.9
Police Officer I ... 1 0.9 0.8
string_distance(Police Officer II, Police Oficer II)
https://dirty-cat.github.io
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...

3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn Modeling substrings
ssistant,
library
uipment,
operator
ation,
specialist
worker,
warehouse
program,
manager
chanic,
community
,
rescuer,
rescue
rrection,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
ed
featurenam
es
Categories
social worker III
Police Ofﬁcer III
Social Worker II
Police ::::::
Oﬁcer II
...

@GaelVaroquaux
Democratizing machine learning
Machine learning for everyone
– from beginner to expert
Agile development, good numerics, collaboration & user focus
Scalability via light coupling to infrastructure and ecosystem
Ongoing research on machine learning with dirty data
Sustainability: community + sponsors

Democratizing machine learning: perspective from scikit-learn

Recommandé

Recommandé

Contenu connexe

Plus de Gael Varoquaux

Plus de Gael Varoquaux (20)

Dernier

Dernier (20)

Democratizing machine learning: perspective from scikit-learn