3. move data out of people’s labs and minds, and into online tools that
help us structure and reuse the information in new ways -M. Nielsen
Designed serendipity
Broadcast data, so people may use it in unexpected ways
Broadcast questions, combine many minds to solve it
Somewhere, someone has just the right data or ideas to help
Requires frictionless collaboration
Open, accessible, organized data (collective working memory)
Contribute new data/ideas in seconds, not days
N E T W O R K E D S C I E N C E
4. D E M O C R AT I Z E M A C H I N E L E A R N I N G
Machine Learning: far from frictionless
Lots of manual work, trial-and-error:
• How much, which data do I need?
• How should I clean the data?
• How should I represent the data (features)?
• How should I preprocess the data?
• Which algorithms should I use?
• How should I tune all algorithm parameters?
• How do I properly evaluate the models?
• What if the data evolves over time?
5. D E M O C R AT I Z E M A C H I N E L E A R N I N G
Empower anybody to do data science well
Provide easy access to reproducible data/code/models
Build easily on prior work:
Similar datasets, promising pipelines/models, prior evaluations,…
Allow frictionless collaboration
Crowdsource hard problems, share results easily (automatically)
Organized, reproducible, open results (trust, reputation)
Simplify, speed up data science process (AutoML)
Automate drudge work (e.g. hyperparameter optimization)
Learn from past, build ever better algorithms, models:
Meta-learning, pipeline generation, neural architecture search,…
7. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
8. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
9. W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
10. OpenML: Frictionless, collaborative,
and automated machine learning
Build ML models using many open-source libraries
Run locally (or wherever), auto-upload all results
Auto-annotated, organized, well-formatted datasets.
Bring your own data for easier analysis
Auto-generate clear (machine-readable) tasks
What are your goals, evaluation metrics?
Find out what works (and what doesn’t)
Write scripts (bots) that automatically find best models
(with or without a PhD)
14. Data can live in existing repositories (e.g. Zenodo)
-> API allows integrations (auto-sharing, DOIs,…)
-> registered via URL, transparent to users
interoperability
Limited data formats: ARFF (CSV + header)
-> CSV importer (auto-annotates features)
-> Working on FrictionlessData support
15. Tasks contain data, goals, procedures.
Auto-build + evaluate models correctly
so you can focus on the science
optimize accuracy
Predict target T
Why analyze this data?
10-fold Xval
16. Collaborate in real time online
optimize accuracy
Predict target T
10-fold Xval
25. R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , …
O P E N M L A P I
• Get Data, Flows, Models, Evaluations
• Download data, meta-data and run files
• Upload new data, flows, runs
• Upload/download optimization traces
• Upload/download meta-models
• Build AutoML bots that run against OpenML
Interact with OpenML any way you want
26. Learning to learn
Bots that learn from all prior experiments
Automate drudge work, help people build models
27. Join us! (and change the world)
OpenML
Active open source community
Regular hackathons
We need bright people
- ML experts
- Developers
- UX
28. A U T O M AT I C M A C H I N E L E A R N I N G
Can algorithms be trained to automatically build end-
to-end machine learning systems?
Use machine learning to do better machine learning
Can we turn
Solution = data + manual exploration + computation
Solution = data + computation (x100)
Into
29. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
30. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
31. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
32. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
(IBM)
33. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
(MicroSoft Azure)
34. E X P L O S I O N O F F U N D A M E N TA L R E S E A R C H ,
H I G H - Q U A L I T Y O P E N S O U R C E T O O L S
We’ll focus on these (and the concepts behind them)
35. A U T O M AT I C M A C H I N E L E A R N I N G
Not about automating data scientists
• Efficient exploration of techniques
• Automate the tedious aspects (inner loop)
• Make every data scientist a super data scientist
• Democratisation
• Allow individuals, small companies to use machine
learning effectively (at lower cost)
• Open source tools and platforms
• Data Science
• Better understand algorithms, develop better ones
• Self-learning algorithms
36. M A C H I N E L E A R N I N G P I P E L I N E S
43. B L A C K B O X O P T I M I Z AT I O N
• Typically, represent the AutoML problem as a large n-dimensional
(Euclidean) search space: configuration is a point in that space
45. H Y P E R PA R A M E T E R S PA C E
Grid search
46. H Y P E R PA R A M E T E R S PA C E
Random search
47. H Y P E R PA R A M E T E R S PA C E
Guided (model-based) search
48. • Start with a few (random) hyperparameter configurations
• Build a surrogate model to predict how well other
configurations will work (probabilistic, regression)
• Most popular: Gaussian processes, Random Forests
• To avoid a greedy search, use an acquisition function to trade
off exploration and exploitation.
• The next configuration is the best one under that function
B AY E S I A N O P T I M I Z AT I O N
49. • Repeat
• Stopping criterion:
• Fixed budget
(time, evaluations)
• Min. distance
between configs
• Threshold for
acquisition function
• …
• Overfits quite easily
B AY E S I A N O P T I M I Z AT I O N
https://raw.githubusercontent.com/
mlr-org/mlrMBO/master/docs/
articles/helpers/animation-.gif
50. • Lot of research into surrogate models:
• Gaussian processes: extrapolate well, not scalable
• Sparse GPs (select ‘inducing points’ that minimize info loss)
• Multi-task GPs (transfer surrogate models from other tasks)
• RandomForest: more scalable, extrapolates badly
• Bayesian Neural Networks (expensive, sensitive to hyperparams)
• …
S U R R O G AT E M O D E L S
51. • SMAC: Random Forests (with random configs in between):
S U R R O G AT E M O D E L S
(minimizing)
52. • SMAC: Random Forests (with random configs in between):
• After 12 iterations: extrapolation remains an issue
S U R R O G AT E M O D E L S
53. Randall Olson et al. (2016) GECCO
G E N E T I C P R O G R A M M I N G
55. S P E E D I N G U P E VA L U AT I O N S
Testing on a smaller sample of the data is faster, may give a good
estimate of final performance (or maybe not)
56. S P E E D I N G U P E VA L U AT I O N S
Hyperband: based on ‘successive halving’
57. S P E E D I N G U P E VA L U AT I O N S
Hyperband: based on ‘successive halving’
58. S P E E D I N G U P E VA L U AT I O N S
DAUB: computes upper bounds
59. M E TA - L E A R N I N G
• Learn across datasets
• Warm-starting: start with configurations/pipelines/
architectures that worked well on similar datasets
• Meta-models:
• Describe datasets with characteristics (metafeatures)
• Predict performance, runtime,… for new datasets
• Active testing:
• Don’t use meta-features, consider datasets to be
similar if algorithms behave similarly
• Learn dataset similarity and try promising
configurations at the same time
• Transfer learning: reuse surrogate models from similar
datasets
60. M E TA - L E A R N I N G + O P E N M L
• Find similar datasets
• 20,000+ datasets, each with 130+ meta-features
• Reuse (millions of) prior model evaluations:
• Meta-models: E.g. predict performance or training time
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Model the effect/importance of hyperparameters
• Deeper understanding of algorithm performance:
• WHY models perform well (or not)
• How is performance related to dataset properties
• Never-ending AutoML:
• AutoML methods built on top of OpenML get
increasingly better as more meta-data is added
61. Auto-SKLearn: find similar datasets on OpenML, start with best
configurations:
Adaptive Bayesian Lin. Regress.:
- build surrogate models from prior
data on many OpenML tasks
- coupled through a neural net
trained on evaluations on new
dataset
Valerio Perrone et al. (2017) MetaLearning@NIPS
Matthias Feurer et al. (2016) NIPS
M E TA - L E A R N I N G : WA R M S TA RT I N G
62. SVM hyperparameter
importance according to
OpenML
Jan van Rijn et al. (2017) AutoML@ICML
Many other possible ways:
- Learn which hyperparameters should be tuned at all
- Learn good value ranges
- Learn good default values
M E TA - L E A R N I N G
H Y P E R PA R A M E T E R I M P O RTA N C E
63. Track score while adding:
- irrelevant features (dotted)
- redundant features (full)
- Redundant features are bad
(colinearity): SVMs, NBayes
- RandomForests, Boosting
very sensitive to both!
- Tuning hyperparameters
doesn’t help -> select feats.
M E TA - L E A R N I N G : P R O F I L I N G
64. Add white noise to features.
- RandomForests, Boosting,
NBayes very sensitive
- kNN, SVM much less
- Hyperparameter tuning
helps (regularization)!
M E TA - L E A R N I N G : P R O F I L I N G
65. Add noise to labels
- All algorithms hurt
- NBayes, kNN slightly
more robust
- Tuning doesn’t help.
M E TA - L E A R N I N G : P R O F I L I N G
66. • 100+ dataset meta-features and 100000+ experiment runtimes
• Build meta-models (Random Forest works well)
• Some algorithms more predictable than others
M E TA - L E A R N I N G :
R U N T I M E P R E D I C T I O N
67. A U T O - N O T E B O O K S
Automatically generate notebooks with interesting
analyses for any OpenML dataset (in progress):
68. R E A L - W O R L D E X A M P L E :
D R U G D I S C O V E RY
• Predict how well a drug
inhibits protein function (and
diseases depending on it)
• Problem: lab work (data) is
expensive
• Solution:
• use all available data on all
available protein targets
• use meta-learning to learn
how to build optimal
models for specific proteins
69. • Auto-builds significantly better pipelines/models
(Olier et al., Machine Learning 107(1), 2018)
• When data is scarce, build better models by transfer from
genetically similar targets (Sadawi et al, ChemInformatics, under review)
• Use trained models to build better molecule representations
• Collaboration with Exscientia Ltd (partner of GlaxoSmithKline)
ChEMBL DB: 1.4M compounds,
10k proteins,12.8M activities
Molecule
representations
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
16,000 regression datasets
x52 pipelines (on OpenML)
meta-model
all data on
new protein
optimal models
to predict activity