SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
D E M O C R AT I Z I N G & A U T O M AT I N G
M A C H I N E L E A R N I N G
dr. ir. Joaquin Vanschoren, TU/e
@open_ml, @joavanschorenwww.openml.org
S L O A N D I G I TA L S K Y S U R V E Y
move data out of people’s labs and minds, and into online tools that
help us structure and reuse the information in new ways -M. Nielsen
Designed serendipity
Broadcast data, so people may use it in unexpected ways
Broadcast questions, combine many minds to solve it
Somewhere, someone has just the right data or ideas to help
Requires frictionless collaboration
Open, accessible, organized data (collective working memory)
Contribute new data/ideas in seconds, not days
N E T W O R K E D S C I E N C E
D E M O C R AT I Z E M A C H I N E L E A R N I N G
Machine Learning: far from frictionless
Lots of manual work, trial-and-error:
• How much, which data do I need?
• How should I clean the data?
• How should I represent the data (features)?
• How should I preprocess the data?
• Which algorithms should I use?
• How should I tune all algorithm parameters?
• How do I properly evaluate the models?
• What if the data evolves over time?
D E M O C R AT I Z E M A C H I N E L E A R N I N G
Empower anybody to do data science well
Provide easy access to reproducible data/code/models
Build easily on prior work:
Similar datasets, promising pipelines/models, prior evaluations,…
Allow frictionless collaboration
Crowdsource hard problems, share results easily (automatically)
Organized, reproducible, open results (trust, reputation)
Simplify, speed up data science process (AutoML)
Automate drudge work (e.g. hyperparameter optimization)
Learn from past, build ever better algorithms, models:
Meta-learning, pipeline generation, neural architecture search,…
Use machine learning to help
people do machine learning
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E
W H AT I F W E C A N A N A LY S E D ATA
C O L L A B O R AT I V E LY
O N W E B S C A L E I N R E A L T I M E
OpenML: Frictionless, collaborative,
and automated machine learning
Build ML models using many open-source libraries
Run locally (or wherever), auto-upload all results
Auto-annotated, organized, well-formatted datasets.
Bring your own data for easier analysis
Auto-generate clear (machine-readable) tasks
What are your goals, evaluation metrics?
Find out what works (and what doesn’t)
Write scripts (bots) that automatically find best models
(with or without a PhD)
It starts with data
Data (tabular) easily uploaded or referenced (URL)
It starts with data
auto-versioned, analysed, organised online
Data can live in existing repositories (e.g. Zenodo)
-> API allows integrations (auto-sharing, DOIs,…)
-> registered via URL, transparent to users
interoperability
Limited data formats: ARFF (CSV + header)
-> CSV importer (auto-annotates features)
-> Working on FrictionlessData support
Tasks contain data, goals, procedures.
Auto-build + evaluate models correctly
so you can focus on the science
optimize accuracy
Predict target T
Why analyze this data?
10-fold Xval
Collaborate in real time online
optimize accuracy
Predict target T
10-fold Xval
Auto-run algorithms/workflows on any task
Integrated in many machine learning tools (+ APIs)
How? Which algorithms?
Integrated in many machine learning tools (+ APIs)
Run locally (or wherever you want)
Integrated in many machine learning tools (+ APIs)
reproducible, linked to data, flows, authors
and all other experiments
Experiments auto-uploaded, evaluated online
What works?
Experiments auto-uploaded, evaluated online
Publish, and track your impact
OpenML Community
5000+ registered users,
10000+ 30-day active users
www.openml.org
R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , …
O P E N M L A P I
• Get Data, Flows, Models, Evaluations
• Download data, meta-data and run files
• Upload new data, flows, runs
• Upload/download optimization traces
• Upload/download meta-models
• Build AutoML bots that run against OpenML
Interact with OpenML any way you want
Learning to learn
Bots that learn from all prior experiments
Automate drudge work, help people build models
Join us! (and change the world)
OpenML
Active open source community
Regular hackathons
We need bright people
- ML experts
- Developers
- UX
A U T O M AT I C M A C H I N E L E A R N I N G
Can algorithms be trained to automatically build end-
to-end machine learning systems?
Use machine learning to do better machine learning
Can we turn
Solution = data + manual exploration + computation
Solution = data + computation (x100)
Into
E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
(IBM)
E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
(MicroSoft Azure)
E X P L O S I O N O F F U N D A M E N TA L R E S E A R C H ,
H I G H - Q U A L I T Y O P E N S O U R C E T O O L S
We’ll focus on these (and the concepts behind them)
A U T O M AT I C M A C H I N E L E A R N I N G
Not about automating data scientists
• Efficient exploration of techniques
• Automate the tedious aspects (inner loop)
• Make every data scientist a super data scientist
• Democratisation
• Allow individuals, small companies to use machine
learning effectively (at lower cost)
• Open source tools and platforms
• Data Science
• Better understand algorithms, develop better ones
• Self-learning algorithms
M A C H I N E L E A R N I N G P I P E L I N E S
machine learning pipelinesA U T O M AT I N G M A C H I N E
L E A R N I N G P I P E L I N E S
N E U R A L A R C H I T E C T U R E S
N E U R A L A R C H I T E C T U R E S E A R C H
AdaNet (Google)
Different types, ranges, dependencies. E.g. neural net:
Everything user has to decide before running the algorithm
H Y P E R PA R A M E T E R S
Complex interactions, different effects on performance
H Y P E R PA R A M E T E R S
H Y P E R PA R A M E T E R S PA C E
B L A C K B O X O P T I M I Z AT I O N
• Typically, represent the AutoML problem as a large n-dimensional
(Euclidean) search space: configuration is a point in that space
H Y P E R PA R A M E T E R S PA C E ( 2 D )
H Y P E R PA R A M E T E R S PA C E
Grid search
H Y P E R PA R A M E T E R S PA C E
Random search
H Y P E R PA R A M E T E R S PA C E
Guided (model-based) search
• Start with a few (random) hyperparameter configurations
• Build a surrogate model to predict how well other
configurations will work (probabilistic, regression)
• Most popular: Gaussian processes, Random Forests
• To avoid a greedy search, use an acquisition function to trade
off exploration and exploitation.
• The next configuration is the best one under that function
B AY E S I A N O P T I M I Z AT I O N
• Repeat
• Stopping criterion:
• Fixed budget
(time, evaluations)
• Min. distance
between configs
• Threshold for
acquisition function
• …
• Overfits quite easily
B AY E S I A N O P T I M I Z AT I O N
https://raw.githubusercontent.com/
mlr-org/mlrMBO/master/docs/
articles/helpers/animation-.gif
• Lot of research into surrogate models:
• Gaussian processes: extrapolate well, not scalable
• Sparse GPs (select ‘inducing points’ that minimize info loss)
• Multi-task GPs (transfer surrogate models from other tasks)
• RandomForest: more scalable, extrapolates badly
• Bayesian Neural Networks (expensive, sensitive to hyperparams)
• …
S U R R O G AT E M O D E L S
• SMAC: Random Forests (with random configs in between):
S U R R O G AT E M O D E L S
(minimizing)
• SMAC: Random Forests (with random configs in between):
• After 12 iterations: extrapolation remains an issue
S U R R O G AT E M O D E L S
Randall Olson et al. (2016) GECCO
G E N E T I C P R O G R A M M I N G
G E N E T I C P R O G R A M M I N G
S P E E D I N G U P E VA L U AT I O N S
Testing on a smaller sample of the data is faster, may give a good
estimate of final performance (or maybe not)
S P E E D I N G U P E VA L U AT I O N S
Hyperband: based on ‘successive halving’
S P E E D I N G U P E VA L U AT I O N S
Hyperband: based on ‘successive halving’
S P E E D I N G U P E VA L U AT I O N S
DAUB: computes upper bounds
M E TA - L E A R N I N G
• Learn across datasets
• Warm-starting: start with configurations/pipelines/
architectures that worked well on similar datasets
• Meta-models:
• Describe datasets with characteristics (metafeatures)
• Predict performance, runtime,… for new datasets
• Active testing:
• Don’t use meta-features, consider datasets to be
similar if algorithms behave similarly
• Learn dataset similarity and try promising
configurations at the same time
• Transfer learning: reuse surrogate models from similar
datasets
M E TA - L E A R N I N G + O P E N M L
• Find similar datasets
• 20,000+ datasets, each with 130+ meta-features
• Reuse (millions of) prior model evaluations:
• Meta-models: E.g. predict performance or training time
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Model the effect/importance of hyperparameters
• Deeper understanding of algorithm performance:
• WHY models perform well (or not)
• How is performance related to dataset properties
• Never-ending AutoML:
• AutoML methods built on top of OpenML get
increasingly better as more meta-data is added
Auto-SKLearn: find similar datasets on OpenML, start with best
configurations:
Adaptive Bayesian Lin. Regress.:
- build surrogate models from prior
data on many OpenML tasks
- coupled through a neural net
trained on evaluations on new
dataset
Valerio Perrone et al. (2017) MetaLearning@NIPS
Matthias Feurer et al. (2016) NIPS
M E TA - L E A R N I N G : WA R M S TA RT I N G
SVM hyperparameter
importance according to
OpenML
Jan van Rijn et al. (2017) AutoML@ICML
Many other possible ways:
- Learn which hyperparameters should be tuned at all
- Learn good value ranges
- Learn good default values
M E TA - L E A R N I N G
H Y P E R PA R A M E T E R I M P O RTA N C E
Track score while adding:
- irrelevant features (dotted)
- redundant features (full)
- Redundant features are bad
(colinearity): SVMs, NBayes
- RandomForests, Boosting
very sensitive to both!
- Tuning hyperparameters
doesn’t help -> select feats.
M E TA - L E A R N I N G : P R O F I L I N G
Add white noise to features.
- RandomForests, Boosting,
NBayes very sensitive
- kNN, SVM much less
- Hyperparameter tuning
helps (regularization)!
M E TA - L E A R N I N G : P R O F I L I N G
Add noise to labels
- All algorithms hurt
- NBayes, kNN slightly
more robust
- Tuning doesn’t help.
M E TA - L E A R N I N G : P R O F I L I N G
• 100+ dataset meta-features and 100000+ experiment runtimes
• Build meta-models (Random Forest works well)
• Some algorithms more predictable than others
M E TA - L E A R N I N G :
R U N T I M E P R E D I C T I O N
A U T O - N O T E B O O K S
Automatically generate notebooks with interesting
analyses for any OpenML dataset (in progress):
R E A L - W O R L D E X A M P L E :
D R U G D I S C O V E RY
• Predict how well a drug
inhibits protein function (and
diseases depending on it)
• Problem: lab work (data) is
expensive
• Solution:
• use all available data on all
available protein targets
• use meta-learning to learn
how to build optimal
models for specific proteins
• Auto-builds significantly better pipelines/models
(Olier et al., Machine Learning 107(1), 2018)
• When data is scarce, build better models by transfer from
genetically similar targets (Sadawi et al, ChemInformatics, under review)
• Use trained models to build better molecule representations
• Collaboration with Exscientia Ltd (partner of GlaxoSmithKline)
ChEMBL DB: 1.4M compounds,
10k proteins,12.8M activities
Molecule
representations
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
16,000 regression datasets
x52 pipelines (on OpenML)
meta-model
all data on
new protein
optimal models
to predict activity
Thank You
@open_ml
OpenML
Questions?

Contenu connexe

Plus de Joaquin Vanschoren

Plus de Joaquin Vanschoren (12)

OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Dernier

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 

Dernier (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

AutoML Sao Paulo

  • 1. D E M O C R AT I Z I N G & A U T O M AT I N G M A C H I N E L E A R N I N G dr. ir. Joaquin Vanschoren, TU/e @open_ml, @joavanschorenwww.openml.org
  • 2. S L O A N D I G I TA L S K Y S U R V E Y
  • 3. move data out of people’s labs and minds, and into online tools that help us structure and reuse the information in new ways -M. Nielsen Designed serendipity Broadcast data, so people may use it in unexpected ways Broadcast questions, combine many minds to solve it Somewhere, someone has just the right data or ideas to help Requires frictionless collaboration Open, accessible, organized data (collective working memory) Contribute new data/ideas in seconds, not days N E T W O R K E D S C I E N C E
  • 4. D E M O C R AT I Z E M A C H I N E L E A R N I N G Machine Learning: far from frictionless Lots of manual work, trial-and-error: • How much, which data do I need? • How should I clean the data? • How should I represent the data (features)? • How should I preprocess the data? • Which algorithms should I use? • How should I tune all algorithm parameters? • How do I properly evaluate the models? • What if the data evolves over time?
  • 5. D E M O C R AT I Z E M A C H I N E L E A R N I N G Empower anybody to do data science well Provide easy access to reproducible data/code/models Build easily on prior work: Similar datasets, promising pipelines/models, prior evaluations,… Allow frictionless collaboration Crowdsource hard problems, share results easily (automatically) Organized, reproducible, open results (trust, reputation) Simplify, speed up data science process (AutoML) Automate drudge work (e.g. hyperparameter optimization) Learn from past, build ever better algorithms, models: Meta-learning, pipeline generation, neural architecture search,…
  • 6. Use machine learning to help people do machine learning
  • 7. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY
  • 8. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E
  • 9. W H AT I F W E C A N A N A LY S E D ATA C O L L A B O R AT I V E LY O N W E B S C A L E I N R E A L T I M E
  • 10. OpenML: Frictionless, collaborative, and automated machine learning Build ML models using many open-source libraries Run locally (or wherever), auto-upload all results Auto-annotated, organized, well-formatted datasets. Bring your own data for easier analysis Auto-generate clear (machine-readable) tasks What are your goals, evaluation metrics? Find out what works (and what doesn’t) Write scripts (bots) that automatically find best models (with or without a PhD)
  • 12. Data (tabular) easily uploaded or referenced (URL) It starts with data
  • 14. Data can live in existing repositories (e.g. Zenodo) -> API allows integrations (auto-sharing, DOIs,…) -> registered via URL, transparent to users interoperability Limited data formats: ARFF (CSV + header) -> CSV importer (auto-annotates features) -> Working on FrictionlessData support
  • 15. Tasks contain data, goals, procedures. Auto-build + evaluate models correctly so you can focus on the science optimize accuracy Predict target T Why analyze this data? 10-fold Xval
  • 16. Collaborate in real time online optimize accuracy Predict target T 10-fold Xval
  • 17. Auto-run algorithms/workflows on any task Integrated in many machine learning tools (+ APIs) How? Which algorithms?
  • 18. Integrated in many machine learning tools (+ APIs)
  • 19. Run locally (or wherever you want) Integrated in many machine learning tools (+ APIs)
  • 20. reproducible, linked to data, flows, authors and all other experiments Experiments auto-uploaded, evaluated online What works?
  • 22. Publish, and track your impact
  • 23. OpenML Community 5000+ registered users, 10000+ 30-day active users
  • 25. R E S T ( X M L / J S O N ) , P Y T H O N , R , J AVA , … O P E N M L A P I • Get Data, Flows, Models, Evaluations • Download data, meta-data and run files • Upload new data, flows, runs • Upload/download optimization traces • Upload/download meta-models • Build AutoML bots that run against OpenML Interact with OpenML any way you want
  • 26. Learning to learn Bots that learn from all prior experiments Automate drudge work, help people build models
  • 27. Join us! (and change the world) OpenML Active open source community Regular hackathons We need bright people - ML experts - Developers - UX
  • 28. A U T O M AT I C M A C H I N E L E A R N I N G Can algorithms be trained to automatically build end- to-end machine learning systems? Use machine learning to do better machine learning Can we turn Solution = data + manual exploration + computation Solution = data + computation (x100) Into
  • 29. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
  • 30. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
  • 31. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T
  • 32. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T (IBM)
  • 33. E X P L O S I O N O F C O M M E R C I A L I N T E R E S T (MicroSoft Azure)
  • 34. E X P L O S I O N O F F U N D A M E N TA L R E S E A R C H , H I G H - Q U A L I T Y O P E N S O U R C E T O O L S We’ll focus on these (and the concepts behind them)
  • 35. A U T O M AT I C M A C H I N E L E A R N I N G Not about automating data scientists • Efficient exploration of techniques • Automate the tedious aspects (inner loop) • Make every data scientist a super data scientist • Democratisation • Allow individuals, small companies to use machine learning effectively (at lower cost) • Open source tools and platforms • Data Science • Better understand algorithms, develop better ones • Self-learning algorithms
  • 36. M A C H I N E L E A R N I N G P I P E L I N E S
  • 37. machine learning pipelinesA U T O M AT I N G M A C H I N E L E A R N I N G P I P E L I N E S
  • 38. N E U R A L A R C H I T E C T U R E S
  • 39. N E U R A L A R C H I T E C T U R E S E A R C H AdaNet (Google)
  • 40. Different types, ranges, dependencies. E.g. neural net: Everything user has to decide before running the algorithm H Y P E R PA R A M E T E R S
  • 41. Complex interactions, different effects on performance H Y P E R PA R A M E T E R S
  • 42. H Y P E R PA R A M E T E R S PA C E
  • 43. B L A C K B O X O P T I M I Z AT I O N • Typically, represent the AutoML problem as a large n-dimensional (Euclidean) search space: configuration is a point in that space
  • 44. H Y P E R PA R A M E T E R S PA C E ( 2 D )
  • 45. H Y P E R PA R A M E T E R S PA C E Grid search
  • 46. H Y P E R PA R A M E T E R S PA C E Random search
  • 47. H Y P E R PA R A M E T E R S PA C E Guided (model-based) search
  • 48. • Start with a few (random) hyperparameter configurations • Build a surrogate model to predict how well other configurations will work (probabilistic, regression) • Most popular: Gaussian processes, Random Forests • To avoid a greedy search, use an acquisition function to trade off exploration and exploitation. • The next configuration is the best one under that function B AY E S I A N O P T I M I Z AT I O N
  • 49. • Repeat • Stopping criterion: • Fixed budget (time, evaluations) • Min. distance between configs • Threshold for acquisition function • … • Overfits quite easily B AY E S I A N O P T I M I Z AT I O N https://raw.githubusercontent.com/ mlr-org/mlrMBO/master/docs/ articles/helpers/animation-.gif
  • 50. • Lot of research into surrogate models: • Gaussian processes: extrapolate well, not scalable • Sparse GPs (select ‘inducing points’ that minimize info loss) • Multi-task GPs (transfer surrogate models from other tasks) • RandomForest: more scalable, extrapolates badly • Bayesian Neural Networks (expensive, sensitive to hyperparams) • … S U R R O G AT E M O D E L S
  • 51. • SMAC: Random Forests (with random configs in between): S U R R O G AT E M O D E L S (minimizing)
  • 52. • SMAC: Random Forests (with random configs in between): • After 12 iterations: extrapolation remains an issue S U R R O G AT E M O D E L S
  • 53. Randall Olson et al. (2016) GECCO G E N E T I C P R O G R A M M I N G
  • 54. G E N E T I C P R O G R A M M I N G
  • 55. S P E E D I N G U P E VA L U AT I O N S Testing on a smaller sample of the data is faster, may give a good estimate of final performance (or maybe not)
  • 56. S P E E D I N G U P E VA L U AT I O N S Hyperband: based on ‘successive halving’
  • 57. S P E E D I N G U P E VA L U AT I O N S Hyperband: based on ‘successive halving’
  • 58. S P E E D I N G U P E VA L U AT I O N S DAUB: computes upper bounds
  • 59. M E TA - L E A R N I N G • Learn across datasets • Warm-starting: start with configurations/pipelines/ architectures that worked well on similar datasets • Meta-models: • Describe datasets with characteristics (metafeatures) • Predict performance, runtime,… for new datasets • Active testing: • Don’t use meta-features, consider datasets to be similar if algorithms behave similarly • Learn dataset similarity and try promising configurations at the same time • Transfer learning: reuse surrogate models from similar datasets
  • 60. M E TA - L E A R N I N G + O P E N M L • Find similar datasets • 20,000+ datasets, each with 130+ meta-features • Reuse (millions of) prior model evaluations: • Meta-models: E.g. predict performance or training time • Reuse results on many hyperparameter settings • Surrogate models: predict best hyperparameter settings • Model the effect/importance of hyperparameters • Deeper understanding of algorithm performance: • WHY models perform well (or not) • How is performance related to dataset properties • Never-ending AutoML: • AutoML methods built on top of OpenML get increasingly better as more meta-data is added
  • 61. Auto-SKLearn: find similar datasets on OpenML, start with best configurations: Adaptive Bayesian Lin. Regress.: - build surrogate models from prior data on many OpenML tasks - coupled through a neural net trained on evaluations on new dataset Valerio Perrone et al. (2017) MetaLearning@NIPS Matthias Feurer et al. (2016) NIPS M E TA - L E A R N I N G : WA R M S TA RT I N G
  • 62. SVM hyperparameter importance according to OpenML Jan van Rijn et al. (2017) AutoML@ICML Many other possible ways: - Learn which hyperparameters should be tuned at all - Learn good value ranges - Learn good default values M E TA - L E A R N I N G H Y P E R PA R A M E T E R I M P O RTA N C E
  • 63. Track score while adding: - irrelevant features (dotted) - redundant features (full) - Redundant features are bad (colinearity): SVMs, NBayes - RandomForests, Boosting very sensitive to both! - Tuning hyperparameters doesn’t help -> select feats. M E TA - L E A R N I N G : P R O F I L I N G
  • 64. Add white noise to features. - RandomForests, Boosting, NBayes very sensitive - kNN, SVM much less - Hyperparameter tuning helps (regularization)! M E TA - L E A R N I N G : P R O F I L I N G
  • 65. Add noise to labels - All algorithms hurt - NBayes, kNN slightly more robust - Tuning doesn’t help. M E TA - L E A R N I N G : P R O F I L I N G
  • 66. • 100+ dataset meta-features and 100000+ experiment runtimes • Build meta-models (Random Forest works well) • Some algorithms more predictable than others M E TA - L E A R N I N G : R U N T I M E P R E D I C T I O N
  • 67. A U T O - N O T E B O O K S Automatically generate notebooks with interesting analyses for any OpenML dataset (in progress):
  • 68. R E A L - W O R L D E X A M P L E : D R U G D I S C O V E RY • Predict how well a drug inhibits protein function (and diseases depending on it) • Problem: lab work (data) is expensive • Solution: • use all available data on all available protein targets • use meta-learning to learn how to build optimal models for specific proteins
  • 69. • Auto-builds significantly better pipelines/models (Olier et al., Machine Learning 107(1), 2018) • When data is scarce, build better models by transfer from genetically similar targets (Sadawi et al, ChemInformatics, under review) • Use trained models to build better molecule representations • Collaboration with Exscientia Ltd (partner of GlaxoSmithKline) ChEMBL DB: 1.4M compounds, 10k proteins,12.8M activities Molecule representations MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 16,000 regression datasets x52 pipelines (on OpenML) meta-model all data on new protein optimal models to predict activity