AutoML Sao Paulo

D E M O C R AT I Z I N G & A U T O M AT I N G
M A C H I N E L E A R N I N G
dr. ir. Joaquin Vanschoren, TU/e
@open_ml, @joavanschorenwww.openml.org

S L O A N D I G I TA L S K Y S U R V E Y

move data out of people’s labs and minds, and into online tools that
help us structure and reuse the information in new ways -M. Nielsen
Designed serendipity
Broadcast data, so people may use it in unexpected ways
Broadcast questions, combine many minds to solve it
Somewhere, someone has just the right data or ideas to help
Requires frictionless collaboration
Open, accessible, organized data (collective working memory)
Contribute new data/ideas in seconds, not days
N E T W O R K E D S C I E N C E

D E M O C R AT I Z E M A C H I N E L E A R N I N G
Machine Learning: far from frictionless
Lots of manual work, trial-and-error:
• How much, which data do I need?
• How should I clean the data?
• How should I represent the data (features)?
• How should I preprocess the data?
• Which algorithms should I use?
• How should I tune all algorithm parameters?
• How do I properly evaluate the models?
• What if the data evolves over time?

D E M O C R AT I Z E M A C H I N E L E A R N I N G
Empower anybody to do data science well
Provide easy access to reproducible data/code/models
Build easily on prior work:
Similar datasets, promising pipelines/models, prior evaluations,…
Allow frictionless collaboration
Crowdsource hard problems, share results easily (automatically)
Organized, reproducible, open results (trust, reputation)
Simplify, speed up data science process (AutoML)
Automate drudge work (e.g. hyperparameter optimization)
Learn from past, build ever better algorithms, models:
Meta-learning, pipeline generation, neural architecture search,…

Use machine learning to help
people do machine learning

W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY

O N W E B S C A L E

O N W E B S C A L E I N R E A L T I M E

OpenML: Frictionless, collaborative,
and automated machine learning
Build ML models using many open-source libraries
Run locally (or wherever), auto-upload all results
Auto-annotated, organized, well-formatted datasets.
Bring your own data for easier analysis
Auto-generate clear (machine-readable) tasks
What are your goals, evaluation metrics?
Find out what works (and what doesn’t)
Write scripts (bots) that automatically ﬁnd best models
(with or without a PhD)

Data (tabular) easily uploaded or referenced (URL)
It starts with data

auto-versioned, analysed, organised online

Data can live in existing repositories (e.g. Zenodo)
-> API allows integrations (auto-sharing, DOIs,…)
-> registered via URL, transparent to users
interoperability
Limited data formats: ARFF (CSV + header)
-> CSV importer (auto-annotates features)
-> Working on FrictionlessData support

Tasks contain data, goals, procedures.
Auto-build + evaluate models correctly
so you can focus on the science
optimize accuracy
Predict target T
Why analyze this data?
10-fold Xval

Collaborate in real time online
optimize accuracy
Predict target T
10-fold Xval

Auto-run algorithms/workﬂows on any task
Integrated in many machine learning tools (+ APIs)
How? Which algorithms?

Run locally (or wherever you want)

reproducible, linked to data, ﬂows, authors
and all other experiments
Experiments auto-uploaded, evaluated online
What works?

Experiments auto-uploaded, evaluated online

Publish, and track your impact

OpenML Community
5000+ registered users,
10000+ 30-day active users

R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , …
O P E N M L A P I
• Get Data, Flows, Models, Evaluations
• Download data, meta-data and run files
• Upload new data, flows, runs
• Upload/download optimization traces
• Upload/download meta-models
• Build AutoML bots that run against OpenML
Interact with OpenML any way you want

Learning to learn
Bots that learn from all prior experiments
Automate drudge work, help people build models

Join us! (and change the world)
OpenML
Active open source community
Regular hackathons
We need bright people
- ML experts
- Developers
- UX

A U T O M AT I C M A C H I N E L E A R N I N G
Can algorithms be trained to automatically build end-
to-end machine learning systems?
Use machine learning to do better machine learning
Can we turn
Solution = data + manual exploration + computation
Solution = data + computation (x100)
Into

E X P L O S I O N O F C O M M E R C I A L I N T E R E S T

(IBM)

(MicroSoft Azure)

E X P L O S I O N O F F U N D A M E N TA L R E S E A R C H ,
H I G H - Q U A L I T Y O P E N S O U R C E T O O L S
We’ll focus on these (and the concepts behind them)

A U T O M AT I C M A C H I N E L E A R N I N G
Not about automating data scientists
• Efﬁcient exploration of techniques
• Automate the tedious aspects (inner loop)
• Make every data scientist a super data scientist
• Democratisation
• Allow individuals, small companies to use machine
learning effectively (at lower cost)
• Open source tools and platforms
• Data Science
• Better understand algorithms, develop better ones
• Self-learning algorithms

M A C H I N E L E A R N I N G P I P E L I N E S

machine learning pipelinesA U T O M AT I N G M A C H I N E
L E A R N I N G P I P E L I N E S

N E U R A L A R C H I T E C T U R E S

N E U R A L A R C H I T E C T U R E S E A R C H
AdaNet (Google)

Different types, ranges, dependencies. E.g. neural net:
Everything user has to decide before running the algorithm
H Y P E R PA R A M E T E R S

Complex interactions, different effects on performance
H Y P E R PA R A M E T E R S

H Y P E R PA R A M E T E R S PA C E

B L A C K B O X O P T I M I Z AT I O N
• Typically, represent the AutoML problem as a large n-dimensional
(Euclidean) search space: configuration is a point in that space

H Y P E R PA R A M E T E R S PA C E ( 2 D )

Grid search

Random search

Guided (model-based) search

• Start with a few (random) hyperparameter configurations
• Build a surrogate model to predict how well other
configurations will work (probabilistic, regression)
• Most popular: Gaussian processes, Random Forests
• To avoid a greedy search, use an acquisition function to trade
off exploration and exploitation.
• The next configuration is the best one under that function
B AY E S I A N O P T I M I Z AT I O N

• Repeat
• Stopping criterion:
• Fixed budget
(time, evaluations)
• Min. distance
between configs
• Threshold for
acquisition function
• …
• Overfits quite easily
B AY E S I A N O P T I M I Z AT I O N
https://raw.githubusercontent.com/
mlr-org/mlrMBO/master/docs/
articles/helpers/animation-.gif

• Lot of research into surrogate models:
• Gaussian processes: extrapolate well, not scalable
• Sparse GPs (select ‘inducing points’ that minimize info loss)
• Multi-task GPs (transfer surrogate models from other tasks)
• RandomForest: more scalable, extrapolates badly
• Bayesian Neural Networks (expensive, sensitive to hyperparams)
• …
S U R R O G AT E M O D E L S

• SMAC: Random Forests (with random configs in between):
(minimizing)

• SMAC: Random Forests (with random configs in between):
• After 12 iterations: extrapolation remains an issue

Randall Olson et al. (2016) GECCO
G E N E T I C P R O G R A M M I N G

G E N E T I C P R O G R A M M I N G

S P E E D I N G U P E VA L U AT I O N S
Testing on a smaller sample of the data is faster, may give a good
estimate of final performance (or maybe not)

Hyperband: based on ‘successive halving’

DAUB: computes upper bounds

M E TA - L E A R N I N G
• Learn across datasets
• Warm-starting: start with configurations/pipelines/
architectures that worked well on similar datasets
• Meta-models:
• Describe datasets with characteristics (metafeatures)
• Predict performance, runtime,… for new datasets
• Active testing:
• Don’t use meta-features, consider datasets to be
similar if algorithms behave similarly
• Learn dataset similarity and try promising
configurations at the same time
• Transfer learning: reuse surrogate models from similar
datasets

M E TA - L E A R N I N G + O P E N M L
• Find similar datasets
• 20,000+ datasets, each with 130+ meta-features
• Reuse (millions of) prior model evaluations:
• Meta-models: E.g. predict performance or training time
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Model the effect/importance of hyperparameters
• Deeper understanding of algorithm performance:
• WHY models perform well (or not)
• How is performance related to dataset properties
• Never-ending AutoML:
• AutoML methods built on top of OpenML get
increasingly better as more meta-data is added

Auto-SKLearn: find similar datasets on OpenML, start with best
configurations:
Adaptive Bayesian Lin. Regress.:
- build surrogate models from prior
data on many OpenML tasks
- coupled through a neural net
trained on evaluations on new
dataset
Valerio Perrone et al. (2017) MetaLearning@NIPS
Matthias Feurer et al. (2016) NIPS
M E TA - L E A R N I N G : WA R M S TA RT I N G

SVM hyperparameter
importance according to
OpenML
Jan van Rijn et al. (2017) AutoML@ICML
Many other possible ways:
- Learn which hyperparameters should be tuned at all
- Learn good value ranges
- Learn good default values
M E TA - L E A R N I N G
H Y P E R PA R A M E T E R I M P O RTA N C E

Track score while adding:
- irrelevant features (dotted)
- redundant features (full)
- Redundant features are bad
(colinearity): SVMs, NBayes
- RandomForests, Boosting
very sensitive to both!
- Tuning hyperparameters
doesn’t help -> select feats.
M E TA - L E A R N I N G : P R O F I L I N G

Add white noise to features.
- RandomForests, Boosting,
NBayes very sensitive
- kNN, SVM much less
- Hyperparameter tuning
helps (regularization)!

Add noise to labels
- All algorithms hurt
- NBayes, kNN slightly
more robust
- Tuning doesn’t help.

• 100+ dataset meta-features and 100000+ experiment runtimes
• Build meta-models (Random Forest works well)
• Some algorithms more predictable than others
M E TA - L E A R N I N G :
R U N T I M E P R E D I C T I O N

A U T O - N O T E B O O K S
Automatically generate notebooks with interesting
analyses for any OpenML dataset (in progress):

R E A L - W O R L D E X A M P L E :
D R U G D I S C O V E RY
• Predict how well a drug
inhibits protein function (and
diseases depending on it)
• Problem: lab work (data) is
expensive
• Solution:
• use all available data on all
available protein targets
• use meta-learning to learn
how to build optimal
models for specific proteins

• Auto-builds signiﬁcantly better pipelines/models
(Olier et al., Machine Learning 107(1), 2018)
• When data is scarce, build better models by transfer from
genetically similar targets (Sadawi et al, ChemInformatics, under review)
• Use trained models to build better molecule representations
• Collaboration with Exscientia Ltd (partner of GlaxoSmithKline)
ChEMBL DB: 1.4M compounds,
10k proteins,12.8M activities
Molecule
representations
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
16,000 regression datasets
x52 pipelines (on OpenML)
meta-model
all data on
new protein
optimal models
to predict activity

Thank You
@open_ml
OpenML
Questions?

AutoML Sao Paulo

Recommandé

Recommandé

Contenu connexe

Plus de Joaquin Vanschoren

Plus de Joaquin Vanschoren (12)

Dernier

Dernier (20)

AutoML Sao Paulo