SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Democratizing machine learning:
perspective from scikit-learn
Gaël Varoquaux,
scikit
machine learning in Python
scikit-learn
From nerds
scikit-learn
From nerds to an industry standard
Number of monthly users
2010 2012 2014 2016 2018
200000
400000
600000
800000
scikit-learn
We were not aiming for the enterprise
but rather
ourselves
scientists
students
scikit-learn
Data-science for the many, not only the mighty
The news:
scikit-learn
Data-science for the many, not only the mighty
Data scientists:
Largest data processed
Poll by KDnuggets
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20%
huge = 10 to 100GB
scikit-learn
Data-science for the many, not only the mighty
Data scientists:
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1TB
1.1to10TB
11to100TB
101TBto1PB
1.1to10PB
11to100PB
over100PB
0%
10%
20% 2018
2016
2015
2014
2013
no increase with time
Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
Challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
1. Dirty data
2. Talent
3. Money
Databricks survey 90% organizations invest in AI, few succeed
Challenges reported:
98%: preparation and aggregation of large datasets
96%: data exploration and iterative model training
https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding
Democratizing machine learning
1 Building a toolkit for all
2 Tackling scalability
3 Bridging to data engineering
1 Building a toolkit for all
The scikit-learn story
1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
1 Embracing the Python stack
Python
Interactive, easy General-purpose
Crucially, Python was made for embedding
⇒ simple virtual machine
(ref-counting garbage collection, no transactional memory)
numpy
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
78957187745620
Numerical operations
Continuous-memory model (float*)
Enables bridging across languages (eg for lapack), Cython
1 Focus on usability
API design
Grey box: all models interchangeable,
but still inspectable
Documentation & examples
Good documentation required to add a feature
Easy-understable examples guide API design
Teach statistical learning, rather than code
Models, solvers, hyperparameters
Choices that do not require tinkering
Lots of usecase-driven empirical testing
1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
Open source has won
But it needs sustainability and investment
1 Community-driven development
Our DNA: distributed development & decision making
Gave man power
2010 2014 2018
0
25
50
# monthly contributors
and the right focus
People fix & improve what’s
important to them
Open source has won
But it needs sustainability and investment
mid-2018: A foundation for scikit-learn
+ the community
2 Tackling scalability
A new challenge
2 Algorithmic improvements
PCA
Cost: np min(n,p)
Randomized PCA (simplified intuitions)
1 loop: take a random fraction of the data
2 small PCA on that fraction
3 aggregate results via PCA across results
svd_solver=’auto’ Up to ×10 speedup
2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression
Gradient descent on error measure: wi+1 = wi +α∇wf
Large n = costly gradient computation
Full gradient
Costly
Sub-sampling in
gradient
Finnicky
Sub-sampling +
noise reduction
solver=’saga’
Fast & easy
2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression sub-sampling + noise reduction
Gradient-boosted trees fit on sufficient summary
Succession of decision trees that enrich each other
Iteration 1 Iteration 2 Iteration 3
Speedup: bin data and compute histograms
HistGradientBoostingRegressor v0.21
catch up with XGBoost & lightgbm
2 Algorithmic improvements
PCA aggregate across sub-samples ⇒ ×10
Logistic regression sub-sampling + noise reduction
Gradient-boosted trees fit on sufficient summary
Fit on several subsamples / chunks
+ aggregation or variance reduction
Fit on summary statistics
2 Scaling out: parallel computing
Simple parallel computing schemes limiting data transfer
Data parallel
03878794797927
03878794797927
mostly in inner loops
Model parallel
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
for model selection
eg GridSearchCV
2 Scaling out: parallel computing
Simple parallel computing schemes limiting data transfer
Data parallel
03878794797927
03878794797927
mostly in inner loops
Model parallel
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
for model selection
eg GridSearchCV
Real-life machine-learning
03878794797927
2 Scaling out: parallel computing
Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
03878794797927
2 Scaling out: parallel computing
Implementations (used by scikit-learn)
Inner loops (fast)
OS threads
OpenMP (from GCC, ICC, clang)
all in same process
Large-scale parallelism
Across Python VMs,
Across computers
Transfer
Synchronization
Real life = A merry mess oversubscription, inefficient transfert
A scheduling problem But: need simple API to focus on algorithmics
scikit-learn is a library: doesn’t own the “main”
2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
2 Our abstraction: joblib’s parallel for
joblib.Parallel()(joblib.delayed(f)(i) for i in ...)
lazy evaluation
Multiprocessing / loky backend
manages a pool of Python VMs segfault resilient
lazy loop consumption to limit memory usage
auto-bunching dispatch to lower overhead
limits # threads in sub-process (threadpoolctl)
Extendable backend API (eg dask)
delegates scheduling (eg to a framework)
still a dispatch / receive queue
overflows the memory of greedy schedulers
2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
2 Better serialization for better scaling
Serializing arbitrary Python objects cloudpickle
eg dispatch estimators across the network
Python 3.8
improve-
ments
Subclassable C persister
⇒ Much faster
Out of band serialization
⇒ no memory copies when
serializing numpy & arrow
PEP 574
Language-agnostic predictor representation ONNX
sklearn-onnx can convert trained models to other runtimes
Working on guaranteeing compliance
Useful to deployment to production
Scaling
Algorithmic improvement is top priority
External infrastructure helps scaling out
Tension with our mission of generic, reusable library
⇒ work on impedence matching layer
3 Bridging to data engineering
3 Data assembly for statistics
“Dirty data” is a central problem
Merging data sources
Input errors
3 Machine learning versus data in the wild
numbers (in arrays)
arrays (of numbers)
arrays
strings
databases
schemas
A gap between
statistics
&
data engineering
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
Missing values
3 Machine learning versus data in the wild
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M NA Master Police Officer
F 09/12/1988 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Non-numerical, heterogeneous data
Missing values
Non-normalized entries
3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
3 Ingesting heterogeneous data: the column transformer
Applies different transformers to columns
These can be complex pipelines
column_trans = compose.make_column_transformer(
(one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]),
( date_trans , ’ Date F i r s t Hired ’ ),
)
X = column_trans . f i t _ t r a n s f o r m ( df )
Dataframe in, array out
with heterogeneous preprocessing & feature engineering
Separating fitting from transforming
Can be applied to new data
Avoids data leakage
Model selection on dataframes
model = make_pipeline(column_trans,
HistGradientBoostingClassifier())
scores = cross_val_score(model, df, y)
Choose data-engineering operations to maximize prediction
3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
3 Machine learning with missing data
Imputation replace NA by plausible values
Constant imputation
sklearn.impute.SimpleImpute
Replace by mean of feature
Conditional imputation v0.21
sklearn.impute.IterativeImputer
Feature as functions of others
For prediction
If y depends on missingness, perfect imputation breaks prediction
⇒ add a missing indicator: IterativeImputer(add_indicator=True)
With constant imputation a powerful learner can model missing values
On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019
NA in HistGradientBoosting v0.22
3 Encoding dirty categories
Digression: not in scikit-learn
One-hot encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Policer Officer II ... 0 0 1
Policer Oficer II ... 0 1 0
Policer Officer I ... 1 0 0
X ∈ Rn×p p grows fast
new categories?
link categories?
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
Bus Operator
Bus ::::::::::
Opperator
Electrician
Library Assistant I
Social Work IV
Library Manager
3 Encoding dirty categories
Digression: not in scikit-learn
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
Traditional view:
Data cleaning,
feature engineering
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
...
3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn
Similarity encoding
... Police
O
fficer I
Police
O
ficer II
Police
O
fficer II
Police Officer II ... 0.9 0.8 1
Police Oficer II ... 0.8 1 0.9
Police Officer I ... 1 0.9 0.8
string_distance(Police Officer II, Police Oficer II)
https://dirty-cat.github.io
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
3 Encoding dirty categories https://project.inria.fr/dirtydata
Digression: not in scikit-learn Modeling substrings
ssistant,
library
uipment,
operator
ation,
specialist
worker,
warehouse
program,
manager
chanic,
community
,
rescuer,
rescue
rrection,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
ed
featurenam
es
Categories
Employee Position Title
master police officer
social worker III
Police Officer III
Social Worker II
Police ::::::
Oficer II
...
@GaelVaroquaux
Democratizing machine learning
Machine learning for everyone
– from beginner to expert
Agile development, good numerics, collaboration & user focus
Scalability via light coupling to infrastructure and ecosystem
Ongoing research on machine learning with dirty data
Sustainability: community + sponsors

Contenu connexe

Plus de Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in PythonGael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareGael Varoquaux
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetGael Varoquaux
 

Plus de Gael Varoquaux (20)

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Democratizing machine learning: perspective from scikit-learn

  • 1. Democratizing machine learning: perspective from scikit-learn Gaël Varoquaux, scikit machine learning in Python
  • 3. scikit-learn From nerds to an industry standard Number of monthly users 2010 2012 2014 2016 2018 200000 400000 600000 800000
  • 4. scikit-learn We were not aiming for the enterprise but rather ourselves scientists students
  • 5. scikit-learn Data-science for the many, not only the mighty The news:
  • 6. scikit-learn Data-science for the many, not only the mighty Data scientists: Largest data processed Poll by KDnuggets lessthan1MB 1.1to10MB 11to100MB 101MBto1GB 1.1to10GB 11to100GB 101GBto1TB 1.1to10TB 11to100TB 101TBto1PB 1.1to10PB 11to100PB over100PB 0% 10% 20% huge = 10 to 100GB
  • 7. scikit-learn Data-science for the many, not only the mighty Data scientists: lessthan1MB 1.1to10MB 11to100MB 101MBto1GB 1.1to10GB 11to100GB 101GBto1TB 1.1to10TB 11to100TB 101TBto1PB 1.1to10PB 11to100PB over100PB 0% 10% 20% 2018 2016 2015 2014 2013 no increase with time
  • 8. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster 1. Dirty data 2. Talent 3. Money
  • 9. Challenges to data science www.kaggle.com/ash316/novice-to-grandmaster 1. Dirty data 2. Talent 3. Money Databricks survey 90% organizations invest in AI, few succeed Challenges reported: 98%: preparation and aggregation of large datasets 96%: data exploration and iterative model training https://databricks.com/company/newsroom/press-releases/databricks-survey-gets-to-the-heart-of-the-ai-dilemma-nearly-90-of-organizations-investing-in-ai-very-few-succeeding
  • 10. Democratizing machine learning 1 Building a toolkit for all 2 Tackling scalability 3 Bridging to data engineering
  • 11. 1 Building a toolkit for all The scikit-learn story
  • 12. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory)
  • 13. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory) numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations Continuous-memory model (float*)
  • 14. 1 Embracing the Python stack Python Interactive, easy General-purpose Crucially, Python was made for embedding ⇒ simple virtual machine (ref-counting garbage collection, no transactional memory) numpy 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 78957187745620 Numerical operations Continuous-memory model (float*) Enables bridging across languages (eg for lapack), Cython
  • 15. 1 Focus on usability API design Grey box: all models interchangeable, but still inspectable Documentation & examples Good documentation required to add a feature Easy-understable examples guide API design Teach statistical learning, rather than code Models, solvers, hyperparameters Choices that do not require tinkering Lots of usecase-driven empirical testing
  • 16. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them
  • 17. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment
  • 18. 1 Community-driven development Our DNA: distributed development & decision making Gave man power 2010 2014 2018 0 25 50 # monthly contributors and the right focus People fix & improve what’s important to them Open source has won But it needs sustainability and investment mid-2018: A foundation for scikit-learn + the community
  • 19. 2 Tackling scalability A new challenge
  • 20. 2 Algorithmic improvements PCA Cost: np min(n,p) Randomized PCA (simplified intuitions) 1 loop: take a random fraction of the data 2 small PCA on that fraction 3 aggregate results via PCA across results svd_solver=’auto’ Up to ×10 speedup
  • 21. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression Gradient descent on error measure: wi+1 = wi +α∇wf Large n = costly gradient computation Full gradient Costly Sub-sampling in gradient Finnicky Sub-sampling + noise reduction solver=’saga’ Fast & easy
  • 22. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression sub-sampling + noise reduction Gradient-boosted trees fit on sufficient summary Succession of decision trees that enrich each other Iteration 1 Iteration 2 Iteration 3 Speedup: bin data and compute histograms HistGradientBoostingRegressor v0.21 catch up with XGBoost & lightgbm
  • 23. 2 Algorithmic improvements PCA aggregate across sub-samples ⇒ ×10 Logistic regression sub-sampling + noise reduction Gradient-boosted trees fit on sufficient summary Fit on several subsamples / chunks + aggregation or variance reduction Fit on summary statistics
  • 24. 2 Scaling out: parallel computing Simple parallel computing schemes limiting data transfer Data parallel 03878794797927 03878794797927 mostly in inner loops Model parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 for model selection eg GridSearchCV
  • 25. 2 Scaling out: parallel computing Simple parallel computing schemes limiting data transfer Data parallel 03878794797927 03878794797927 mostly in inner loops Model parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 for model selection eg GridSearchCV Real-life machine-learning 03878794797927
  • 26. 2 Scaling out: parallel computing Implementations (used by scikit-learn) Inner loops (fast) OS threads OpenMP (from GCC, ICC, clang) all in same process Large-scale parallelism Across Python VMs, Across computers Transfer Synchronization 03878794797927
  • 27. 2 Scaling out: parallel computing Implementations (used by scikit-learn) Inner loops (fast) OS threads OpenMP (from GCC, ICC, clang) all in same process Large-scale parallelism Across Python VMs, Across computers Transfer Synchronization Real life = A merry mess oversubscription, inefficient transfert A scheduling problem But: need simple API to focus on algorithmics scikit-learn is a library: doesn’t own the “main”
  • 28. 2 Our abstraction: joblib’s parallel for joblib.Parallel()(joblib.delayed(f)(i) for i in ...) lazy evaluation Multiprocessing / loky backend manages a pool of Python VMs segfault resilient lazy loop consumption to limit memory usage auto-bunching dispatch to lower overhead limits # threads in sub-process (threadpoolctl)
  • 29. 2 Our abstraction: joblib’s parallel for joblib.Parallel()(joblib.delayed(f)(i) for i in ...) lazy evaluation Multiprocessing / loky backend manages a pool of Python VMs segfault resilient lazy loop consumption to limit memory usage auto-bunching dispatch to lower overhead limits # threads in sub-process (threadpoolctl) Extendable backend API (eg dask) delegates scheduling (eg to a framework) still a dispatch / receive queue overflows the memory of greedy schedulers
  • 30. 2 Better serialization for better scaling Serializing arbitrary Python objects cloudpickle eg dispatch estimators across the network Python 3.8 improve- ments Subclassable C persister ⇒ Much faster Out of band serialization ⇒ no memory copies when serializing numpy & arrow PEP 574
  • 31. 2 Better serialization for better scaling Serializing arbitrary Python objects cloudpickle eg dispatch estimators across the network Python 3.8 improve- ments Subclassable C persister ⇒ Much faster Out of band serialization ⇒ no memory copies when serializing numpy & arrow PEP 574 Language-agnostic predictor representation ONNX sklearn-onnx can convert trained models to other runtimes Working on guaranteeing compliance Useful to deployment to production
  • 32. Scaling Algorithmic improvement is top priority External infrastructure helps scaling out Tension with our mission of generic, reusable library ⇒ work on impedence matching layer
  • 33. 3 Bridging to data engineering
  • 34. 3 Data assembly for statistics “Dirty data” is a central problem Merging data sources Input errors
  • 35. 3 Machine learning versus data in the wild numbers (in arrays) arrays (of numbers) arrays strings databases schemas A gap between statistics & data engineering
  • 36. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array
  • 37. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I
  • 38. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data
  • 39. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data Missing values
  • 40. 3 Machine learning versus data in the wild Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M NA Master Police Officer F 09/12/1988 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Non-numerical, heterogeneous data Missing values Non-normalized entries
  • 41. 3 Ingesting heterogeneous data: the column transformer Applies different transformers to columns These can be complex pipelines column_trans = compose.make_column_transformer( (one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]), ( date_trans , ’ Date F i r s t Hired ’ ), ) X = column_trans . f i t _ t r a n s f o r m ( df ) Dataframe in, array out with heterogeneous preprocessing & feature engineering
  • 42. 3 Ingesting heterogeneous data: the column transformer Applies different transformers to columns These can be complex pipelines column_trans = compose.make_column_transformer( (one_hot_enc , [ ’ Gender ’ , ’ Employee P o s i t i o n T i t l e ’ ]), ( date_trans , ’ Date F i r s t Hired ’ ), ) X = column_trans . f i t _ t r a n s f o r m ( df ) Dataframe in, array out with heterogeneous preprocessing & feature engineering Separating fitting from transforming Can be applied to new data Avoids data leakage Model selection on dataframes model = make_pipeline(column_trans, HistGradientBoostingClassifier()) scores = cross_val_score(model, df, y) Choose data-engineering operations to maximize prediction
  • 43. 3 Machine learning with missing data Imputation replace NA by plausible values Constant imputation sklearn.impute.SimpleImpute Replace by mean of feature Conditional imputation v0.21 sklearn.impute.IterativeImputer Feature as functions of others
  • 44. 3 Machine learning with missing data Imputation replace NA by plausible values Constant imputation sklearn.impute.SimpleImpute Replace by mean of feature Conditional imputation v0.21 sklearn.impute.IterativeImputer Feature as functions of others For prediction If y depends on missingness, perfect imputation breaks prediction ⇒ add a missing indicator: IterativeImputer(add_indicator=True) With constant imputation a powerful learner can model missing values On the consistency of supervised learning with missing values, J Josse et al, arXiv 2019 NA in HistGradientBoosting v0.22
  • 45. 3 Encoding dirty categories Digression: not in scikit-learn One-hot encoding ... Police O fficer I Police O ficer II Police O fficer II Policer Officer II ... 0 0 1 Policer Oficer II ... 0 1 0 Policer Officer I ... 1 0 0 X ∈ Rn×p p grows fast new categories? link categories? Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II Bus Operator Bus :::::::::: Opperator Electrician Library Assistant I Social Work IV Library Manager
  • 46. 3 Encoding dirty categories Digression: not in scikit-learn Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ... Traditional view: Data cleaning, feature engineering Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III ...
  • 47. 3 Encoding dirty categories https://project.inria.fr/dirtydata Digression: not in scikit-learn Similarity encoding ... Police O fficer I Police O ficer II Police O fficer II Police Officer II ... 0.9 0.8 1 Police Oficer II ... 0.8 1 0.9 Police Officer I ... 1 0.9 0.8 string_distance(Police Officer II, Police Oficer II) https://dirty-cat.github.io Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ...
  • 48. 3 Encoding dirty categories https://project.inria.fr/dirtydata Digression: not in scikit-learn Modeling substrings ssistant, library uipment, operator ation, specialist worker, warehouse program, manager chanic, community , rescuer, rescue rrection, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant ed featurenam es Categories Employee Position Title master police officer social worker III Police Officer III Social Worker II Police :::::: Oficer II ...
  • 49. @GaelVaroquaux Democratizing machine learning Machine learning for everyone – from beginner to expert Agile development, good numerics, collaboration & user focus Scalability via light coupling to infrastructure and ecosystem Ongoing research on machine learning with dirty data Sustainability: community + sponsors