SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
T O WA R D S A
D ATA S C I E N C E
C O L L A B O R AT O RY
J O A Q U I N VA N S C H O R E N , B E R N D B I S C H L , F R A N K H U T T E R , M I C H E L E
S E B A G , B A L A Z S K E G L , M A T T H I A S S C H M I D , G I U L I O N A P O L I TA N O , K A T Y
W O L S T E N C R O F T, A L A N R . W I L L I A M S , N E I L L A W R E N C E
I D A 2 0 1 5
Data science ecosystem
Domain scientist: collects and
analyses scientific data to study
or discover phenomena
Data scientist: designs and
analyses algorithms to solve
specific problems or benchmark
them on a collection of datasets
Computer scientist:
implements, optimises and
maintains existing algorithms, for
robust use by many different
people (often industry)
Data
scientist
Domain
scientist
Computer
Scientists
Gap: domain and data scientists
Domain scientists:
- doubts about latest/best
methods
- simple models on complex data
Data scientists:
- don’t speak the language of
scientists, misses opportunities
- don’t know how to access rich
scientific databases
- complex models on simple
(old) data
Data
scientist
Domain
scientist
Computer
Scientists
M A C H I N E L E A R N I N G R E S E A R C H
Complex because we assume that the researcher is an expert
Modify learning
algorithm
Re-code
features
Tweak
parameters
Check
calibration
Discover
leakage
Upsample
minority class
Feature
selection Thanks to R. Caruana
Gap: domain and data scientists
Slows down research
- Knowledge transfer through
literature is slow (jargon, tacit
knowledge), limited
- Research output has little cross-
domain relevance
- Collaborations are small-scale
(biased), take time
- Friction: understanding data
formats, code, email, contracts
takes too long
- Lots of time spent on tasks that
others could do much faster, or
automated altogether
Data
scientist
Domain
scientist
Computer
Scientists
Gap: academia and industry
Innovative enterprises:
- debilitated by lack of access to
expertise and data
- large companies collect/buy data
and talent
- limited interaction (code, appl’s)
Healthcare:
- complex diseases (cancer,
Alzheimer’s, drug resistant viruses)
- heavily interconnected data
(genotype/phenotype), novel
methods needed (missing, hd data)
- requires large-scale, real-time
collaboration, privacy trade-offs
Data
scientist
Domain
scientist
Computer
Scientists
Challenge: manpower
Data scientist shortage
- McKinsey: 140k-190k data
scientists needed in USA, by 2018
- large companies buy talent,
inhibits societal and scientific goals
(healthcare, climate,…)
Data science needs to scale
- training armies of data scientists?
- enable frictionless collaboration:
easy access to best tools, data,
training materials
- democratize data science
- automate drudge work
Data
scientist
Domain
scientist
Computer
Scientists
85% medical research resources are wasted
- underpowered, irreproducible research
- associations/effects are false, exaggerated
- flexibility in experiment design, statistical analysis
- translation into applications is inefficient, impossible
Some subdomains, e.g. genetic epidemiology, score better:
- Large-scale, interdisciplinary collaboration
- Strong replication culture
- Open sharing of data, protocols, software
- Adopting better (statistical) methods
- Continued education in research methods, statistical literacy
magnetopause
bow
shock
solar wind
solar wind
magnetosheath
PLASMA PHYSICS
SURFACE WAVES AND
PROCESSES
Nicolas Aunai
LABELLING (CATALOGUING) OF EVENTS STILL DONE MANUALLY
Research different.
Polymaths: Solve math problems through
massive collaboration (not competition)
Broadcast question, combine
many minds to solve it
Solved hard problems in weeks
Many (joint) publications
Research different.
SDSS: Robotic telescope, data publicly online (SkyServer)
+1 million distinct users
vs. 10.000 astronomers
Broadcast data, allow many minds to ask the right questions
Thousands of papers, citations
Next: SynopticTelescope
Research different.
How do you label a million galaxies?
Offer right tools so that anybody can contribute, instantly
Many novel discoveries by scientists and citizens
More? Read ‘Networked science’ by M. Nielsen
Why? Designed serendipity
Broadcasting data fosters spontaneous,
unexpected discoveries
What’s hard for one scientist
is easy for another: connect minds
How? Remove friction
Organized body of compatible
scientific data (and tools) online
Micro-contributions: seconds, not days
Easy, organised communication
Track who did what, give credit
A data science collaboratory: inputs
Millions of real, open datasets are generated
• Drug activity, gene expressions, astronomical observations, text,…
Extensive toolboxes exist to analyse data
• MLR, SKLearn, RapidMiner, KNIME, WEKA, AzureML,…
Massive amounts of experiments are run, most of this
information is lost forever (in people’s heads)
• No learning across people, labs, fields
• Start from scratch, slows research and innovation
Computer scientists: Connect data science
environments to network, automate sharing
A data science collaboratory: actors
Data scientists: Design/run algorithms, share results.
Automate algorithm selection, optimisation
Domain scientists: Extract, share actionable datasets
from large (open) scientific databases
D A TA S C I E N C E C O L L A B O R A T O RY
Organized data: Data and tools easy to find, search. Experiments connected
to data, code, people anywhere
Easy to use: Automated, reproducible sharing from data science tools
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure: Build reputation and trust (e-citations, social interaction)
Change scale of collaboration
• ‘GitHub for data science’: start projects, commit data +
experiments, issue questions
• Scientists can issue (open or limited) invitation to work on
something: analyse given data, solve a problem,..
• Others respond by adding data, running experiments
showing how well given techniques work
• Collaboratory organises everything, everyone can build
directly on each other’s latest results
• Global organisation: state-of-the-art reference for
scientists, industry, students
Change speed of collaboration
• Easy to use so that researchers can focus on their work,
not on drudge work
• Share automatically from your favourite data science tool,
infrastructure
• Automated annotations of data, code (meta-data)
• Collaborate in real-time: stream results, discuss online
• `Bots’ that automatically run machine learning algorithms,
clean data, intelligently optimise parameter (lots of data
to push research in this direction)
New incentives
• Track and clearly show who contributed what to a study.
Link online study to paper publications.
• Build online reputation: Track reuse of shared data,
algorithms, results (e-citations)
• Save lots of time, collaborate more often, publish more
Pre-network era
• Analysing new data, she tries
many techniques (many fail),
and dives into literature.
• Runs expensive grid searches,
painstakingly keeping logs
• She hunts down and tries to
reproduce prior work
• Reviewers take months to
read, ask her to rerun things
because she made mistakes
• Annotate everything and
upload to central database
• 1 year later, fails to reproduce
Elisa’s story
Collaboratory
• She uploads data, gets automatic
analysis, list of similar data and
studies, prior results, people
• All experiments are automatically
uploaded, organised, compared
• Prior results are auto-linked, she
can reuse or reproduce them
• Elisa connects to trusted people
who comment on her work in real
time, reviewers are impressed
• A plugin auto-exports her data in
full detail
• Her work is easily reused, cited
Building blocks: data
• Scientific data registries
• Bio-informatics: Many databases,
common data formats (FASTA),
model formats (SBML)
• Pharmacology: data exploration
portals: OpenPHACTS, BioGPS,
MyGene.info
Building blocks: code
Code: GitHub, Python/Jupiter Notebooks, KnitR, GitXiv,
recomputation.org, Docker
Workflows: myExperiment, SHIWA, Galaxy Tool Shed
Platforms:
• BioSharing.org: registry for databases, standards,
policies in biological and biomedical sciences
• Kaggle: data mining challenge platform
• SEEK: store, link data and mathematical models,
ResearchObjects: bundles of paper, data, code
Initiatives:
• Data Carpentry: Workshops for new data scientists
• STRATOS Initiative: guidelines for design and analysis of
observational medical research
• ISA-tools: standards for describing and aggregating
studies
F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G
Easy to use: Integrated in ML environments.Automated, reproducible sharing
Organized data: Experiments connected to data, code, people anywhere
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure*: Build reputation and trust (e-citations, social interaction)
OpenML
Data (ARFF) uploaded or referenced, versioned
analysed, characterized, organised online
analysed, characterized, organised online
analysed, characterized, organised online
Tasks contain data, goals, procedures.
Readable by tools, automates experimentation
Train-test
splits
Evaluation
measure
Results organized online: realtime overview
Results organized online: realtime overview
Train-test
splits
Evaluation
measure
Flows (code) from various tools, versioned.
Integrations + APIs (REST, Java, R, Python,…)
Integrations + APIs (REST, Java, R, Python,…)
reproducible, linked to data, flows and authors
Experiments auto-uploaded, evaluated online
reproducible, linked to data, flows and authors
Experiments auto-uploaded, evaluated online
Experiments auto-uploaded, evaluated online
Explore
Reuse
Share
• Search by
keywords or
properties
• Filters
• Tagging
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
Data
• Wiki-like
descriptions
• Analysis and
visualisation of
features
• Auto-calculation
of large range
of meta-features
Data
• Simple: Number/perc/log of instances, classes, features, dimensionality,
NAs, …
• Statistical: Default acc, Min/max/mean distinct values, skewness, kurtosis,
stdev, mutual information, …
• Information-theoretic: Entropy (class, num/nom feats), signal-noise ratio,…
• Landmarkers (AUC, ErrRate, Kappa): Decision stumps, Naive Bayes, kNN,
RandomTree, J48(pruned), JRip, NBTree, lin.SVM, …
• Subsampling landmarkers (partial learning curves)
• Streaming landmarkers: Evaluations in previous window, previous best,
per-algorithm significant win/loss
• Change detectors: HoeffdingAdwin, HoeffdingDDM
• Domain specific meta-features (e.g. molecular descriptors)
• Many not included yet
Meta-features (120+)
• Example: Classification
on click prediction
dataset, using 10-fold
CV and AUC
• People submit results
(e.g. predictions)
• Server-side evaluation
(many measures)
• All results organized
online, per algorithm,
parameter setting
• Online visualizations:
every dot is a run
plotted by score
Tasks
• Leaderboards visualize progress over time: who delivered breakthroughs
when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others
• Real-time: clear who submitted first, others can improve immediately
Classroom challenges
Classroom challenges
• All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
• Detailed run info
• Author, data, flow,
parameter settings,
result files, …
• Evaluation details
(e.g., results per
sample)
Share
Explore
Reuse
R API
Idem for Java, Python
Tutorial: http://www.openml.org/guide
List datasets List tasks
List flows
List runs and evaluations
R API
Idem for Java, Python
Tutorial: http://www.openml.org/guide
Download datasets Download tasks
Download flows
R API
Download run
Download run with predictions
Idem for Java, Python
Tutorial: http://www.openml.org/guide
Reuse
Explore
Share
WEKA
OpenML extension
in plugin manager
MOA
RapidMiner: 3 new operators
R API
Run a task
(with mlr):
And upload:
Python Notebooks
Directly load data in notebooks through API
Studies (e-papers)
- Online counterpart of a paper, linkable
- Add data, code, experiments (new or old)
- Public or shared within circle
Circles
Create collaborations with trusted researchers
Share results within team prior to publication
Altmetrics
- Measure real impact of your work
- Reuse, downloads, likes of data, code, projects,…
- Online reputation (more sharing)
Online collaboration (soon)
OpenML in drug discovery
ChEMBL database
SMILEs'
Molecular'
proper0es'
(e.g.'MW,'LogP)'
Fingerprints'
(e.g.'FP2,'FP3,'
FP4,'MACSS)'
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
10.000+ regression datasets
1.4M compounds,10k proteins,
12,8M activities
Predict which compounds (drugs) will inhibit certain proteins (and hence viruses, parasites,…)
2750 targets have >10 compounds, x 4 fingerprints
OpenML in drug discovery
cforest
ctree
earth
fnn
gbm
glmnet
lassolmnnetpcr
plsr
rforest
ridge
rtree
ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024*
0.697* 0.427* 0.725+ 0.627*
|
msdfeat< 0.1729
mutualinfo< 0.7182
skewresp< 0.3315
nentropyfeat< 1.926
netcharge1>=0.9999
netcharge3< 13.85
hydphob21< −0.09666
glmnet
fnn pcr
rforest
ridge rforest
rforest ridge
Predict best algorithm with meta-models
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
Metafeatures:
- simple, statistical, info-theoretic, landmarkers
- target: aliphatic index, hydrophobicity, net
charge, mol. weight, sequence length, …
I. Olier et al. MetaSel@ECMLPKDD 2015
Best
algorithm
overall
(10f CV)
I. Olier et al. MetaSel@ECMLPKDD 2015
Random
Forest
meta-classifier
(10f CV)
OpenML Community
Jan-Jun 2015
Used all over the world
Users: 530+ registered, 400 7-day, 1700 30-day active
1000s of datasets, flows, 450000+ runs
- Open Source, on GitHub
- Regular workshops, hackathons
Join OpenML
Next workshop:
- Lorentz Center (Leiden),
14-18 March 2016
T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek

Contenu connexe

Plus de Joaquin Vanschoren

Plus de Joaquin Vanschoren (12)

Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Dernier

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

Dernier (20)

Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

Towards a Data Science Collaboratory

  • 1. T O WA R D S A D ATA S C I E N C E C O L L A B O R AT O RY J O A Q U I N VA N S C H O R E N , B E R N D B I S C H L , F R A N K H U T T E R , M I C H E L E S E B A G , B A L A Z S K E G L , M A T T H I A S S C H M I D , G I U L I O N A P O L I TA N O , K A T Y W O L S T E N C R O F T, A L A N R . W I L L I A M S , N E I L L A W R E N C E I D A 2 0 1 5
  • 2. Data science ecosystem Domain scientist: collects and analyses scientific data to study or discover phenomena Data scientist: designs and analyses algorithms to solve specific problems or benchmark them on a collection of datasets Computer scientist: implements, optimises and maintains existing algorithms, for robust use by many different people (often industry) Data scientist Domain scientist Computer Scientists
  • 3. Gap: domain and data scientists Domain scientists: - doubts about latest/best methods - simple models on complex data Data scientists: - don’t speak the language of scientists, misses opportunities - don’t know how to access rich scientific databases - complex models on simple (old) data Data scientist Domain scientist Computer Scientists
  • 4. M A C H I N E L E A R N I N G R E S E A R C H Complex because we assume that the researcher is an expert Modify learning algorithm Re-code features Tweak parameters Check calibration Discover leakage Upsample minority class Feature selection Thanks to R. Caruana
  • 5. Gap: domain and data scientists Slows down research - Knowledge transfer through literature is slow (jargon, tacit knowledge), limited - Research output has little cross- domain relevance - Collaborations are small-scale (biased), take time - Friction: understanding data formats, code, email, contracts takes too long - Lots of time spent on tasks that others could do much faster, or automated altogether Data scientist Domain scientist Computer Scientists
  • 6. Gap: academia and industry Innovative enterprises: - debilitated by lack of access to expertise and data - large companies collect/buy data and talent - limited interaction (code, appl’s) Healthcare: - complex diseases (cancer, Alzheimer’s, drug resistant viruses) - heavily interconnected data (genotype/phenotype), novel methods needed (missing, hd data) - requires large-scale, real-time collaboration, privacy trade-offs Data scientist Domain scientist Computer Scientists
  • 7. Challenge: manpower Data scientist shortage - McKinsey: 140k-190k data scientists needed in USA, by 2018 - large companies buy talent, inhibits societal and scientific goals (healthcare, climate,…) Data science needs to scale - training armies of data scientists? - enable frictionless collaboration: easy access to best tools, data, training materials - democratize data science - automate drudge work Data scientist Domain scientist Computer Scientists
  • 8.
  • 9. 85% medical research resources are wasted - underpowered, irreproducible research - associations/effects are false, exaggerated - flexibility in experiment design, statistical analysis - translation into applications is inefficient, impossible Some subdomains, e.g. genetic epidemiology, score better: - Large-scale, interdisciplinary collaboration - Strong replication culture - Open sharing of data, protocols, software - Adopting better (statistical) methods - Continued education in research methods, statistical literacy
  • 10. magnetopause bow shock solar wind solar wind magnetosheath PLASMA PHYSICS SURFACE WAVES AND PROCESSES Nicolas Aunai
  • 11. LABELLING (CATALOGUING) OF EVENTS STILL DONE MANUALLY
  • 12. Research different. Polymaths: Solve math problems through massive collaboration (not competition) Broadcast question, combine many minds to solve it Solved hard problems in weeks Many (joint) publications
  • 13. Research different. SDSS: Robotic telescope, data publicly online (SkyServer) +1 million distinct users vs. 10.000 astronomers Broadcast data, allow many minds to ask the right questions Thousands of papers, citations Next: SynopticTelescope
  • 14. Research different. How do you label a million galaxies? Offer right tools so that anybody can contribute, instantly Many novel discoveries by scientists and citizens More? Read ‘Networked science’ by M. Nielsen
  • 15. Why? Designed serendipity Broadcasting data fosters spontaneous, unexpected discoveries What’s hard for one scientist is easy for another: connect minds How? Remove friction Organized body of compatible scientific data (and tools) online Micro-contributions: seconds, not days Easy, organised communication Track who did what, give credit
  • 16. A data science collaboratory: inputs Millions of real, open datasets are generated • Drug activity, gene expressions, astronomical observations, text,… Extensive toolboxes exist to analyse data • MLR, SKLearn, RapidMiner, KNIME, WEKA, AzureML,… Massive amounts of experiments are run, most of this information is lost forever (in people’s heads) • No learning across people, labs, fields • Start from scratch, slows research and innovation
  • 17. Computer scientists: Connect data science environments to network, automate sharing A data science collaboratory: actors Data scientists: Design/run algorithms, share results. Automate algorithm selection, optimisation Domain scientists: Extract, share actionable datasets from large (open) scientific databases
  • 18. D A TA S C I E N C E C O L L A B O R A T O RY Organized data: Data and tools easy to find, search. Experiments connected to data, code, people anywhere Easy to use: Automated, reproducible sharing from data science tools Easy to contribute: Post single dataset, algorithm, experiment, comment Reward structure: Build reputation and trust (e-citations, social interaction)
  • 19. Change scale of collaboration • ‘GitHub for data science’: start projects, commit data + experiments, issue questions • Scientists can issue (open or limited) invitation to work on something: analyse given data, solve a problem,.. • Others respond by adding data, running experiments showing how well given techniques work • Collaboratory organises everything, everyone can build directly on each other’s latest results • Global organisation: state-of-the-art reference for scientists, industry, students
  • 20. Change speed of collaboration • Easy to use so that researchers can focus on their work, not on drudge work • Share automatically from your favourite data science tool, infrastructure • Automated annotations of data, code (meta-data) • Collaborate in real-time: stream results, discuss online • `Bots’ that automatically run machine learning algorithms, clean data, intelligently optimise parameter (lots of data to push research in this direction)
  • 21. New incentives • Track and clearly show who contributed what to a study. Link online study to paper publications. • Build online reputation: Track reuse of shared data, algorithms, results (e-citations) • Save lots of time, collaborate more often, publish more
  • 22. Pre-network era • Analysing new data, she tries many techniques (many fail), and dives into literature. • Runs expensive grid searches, painstakingly keeping logs • She hunts down and tries to reproduce prior work • Reviewers take months to read, ask her to rerun things because she made mistakes • Annotate everything and upload to central database • 1 year later, fails to reproduce Elisa’s story Collaboratory • She uploads data, gets automatic analysis, list of similar data and studies, prior results, people • All experiments are automatically uploaded, organised, compared • Prior results are auto-linked, she can reuse or reproduce them • Elisa connects to trusted people who comment on her work in real time, reviewers are impressed • A plugin auto-exports her data in full detail • Her work is easily reused, cited
  • 23. Building blocks: data • Scientific data registries • Bio-informatics: Many databases, common data formats (FASTA), model formats (SBML) • Pharmacology: data exploration portals: OpenPHACTS, BioGPS, MyGene.info
  • 24. Building blocks: code Code: GitHub, Python/Jupiter Notebooks, KnitR, GitXiv, recomputation.org, Docker Workflows: myExperiment, SHIWA, Galaxy Tool Shed
  • 25. Platforms: • BioSharing.org: registry for databases, standards, policies in biological and biomedical sciences • Kaggle: data mining challenge platform • SEEK: store, link data and mathematical models, ResearchObjects: bundles of paper, data, code Initiatives: • Data Carpentry: Workshops for new data scientists • STRATOS Initiative: guidelines for design and analysis of observational medical research • ISA-tools: standards for describing and aggregating studies
  • 26. F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G Easy to use: Integrated in ML environments.Automated, reproducible sharing Organized data: Experiments connected to data, code, people anywhere Easy to contribute: Post single dataset, algorithm, experiment, comment Reward structure*: Build reputation and trust (e-citations, social interaction) OpenML
  • 27. Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online
  • 30. Tasks contain data, goals, procedures. Readable by tools, automates experimentation Train-test splits Evaluation measure Results organized online: realtime overview
  • 31. Results organized online: realtime overview Train-test splits Evaluation measure
  • 32. Flows (code) from various tools, versioned. Integrations + APIs (REST, Java, R, Python,…)
  • 33. Integrations + APIs (REST, Java, R, Python,…)
  • 34. reproducible, linked to data, flows and authors Experiments auto-uploaded, evaluated online
  • 35. reproducible, linked to data, flows and authors Experiments auto-uploaded, evaluated online
  • 38.
  • 39. • Search by keywords or properties • Filters • Tagging Data
  • 40. • Wiki-like descriptions • Analysis and visualisation of features Data
  • 41. • Wiki-like descriptions • Analysis and visualisation of features • Auto-calculation of large range of meta-features Data
  • 42. • Simple: Number/perc/log of instances, classes, features, dimensionality, NAs, … • Statistical: Default acc, Min/max/mean distinct values, skewness, kurtosis, stdev, mutual information, … • Information-theoretic: Entropy (class, num/nom feats), signal-noise ratio,… • Landmarkers (AUC, ErrRate, Kappa): Decision stumps, Naive Bayes, kNN, RandomTree, J48(pruned), JRip, NBTree, lin.SVM, … • Subsampling landmarkers (partial learning curves) • Streaming landmarkers: Evaluations in previous window, previous best, per-algorithm significant win/loss • Change detectors: HoeffdingAdwin, HoeffdingDDM • Domain specific meta-features (e.g. molecular descriptors) • Many not included yet Meta-features (120+)
  • 43. • Example: Classification on click prediction dataset, using 10-fold CV and AUC • People submit results (e.g. predictions) • Server-side evaluation (many measures) • All results organized online, per algorithm, parameter setting • Online visualizations: every dot is a run plotted by score Tasks
  • 44. • Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions • Collaborative: all code and data available, learn from others • Real-time: clear who submitted first, others can improve immediately
  • 47. • All results obtained with same flow organised online • Results linked to data sets, parameter settings -> trends/comparisons • Visualisations (dots are models, ranked by score, colored by parameters)
  • 48. • Detailed run info • Author, data, flow, parameter settings, result files, … • Evaluation details (e.g., results per sample)
  • 50. R API Idem for Java, Python Tutorial: http://www.openml.org/guide List datasets List tasks List flows List runs and evaluations
  • 51. R API Idem for Java, Python Tutorial: http://www.openml.org/guide Download datasets Download tasks Download flows
  • 52. R API Download run Download run with predictions Idem for Java, Python Tutorial: http://www.openml.org/guide
  • 55. MOA
  • 56. RapidMiner: 3 new operators
  • 57. R API Run a task (with mlr): And upload:
  • 58. Python Notebooks Directly load data in notebooks through API
  • 59. Studies (e-papers) - Online counterpart of a paper, linkable - Add data, code, experiments (new or old) - Public or shared within circle Circles Create collaborations with trusted researchers Share results within team prior to publication Altmetrics - Measure real impact of your work - Reuse, downloads, likes of data, code, projects,… - Online reputation (more sharing) Online collaboration (soon)
  • 60. OpenML in drug discovery ChEMBL database SMILEs' Molecular' proper0es' (e.g.'MW,'LogP)' Fingerprints' (e.g.'FP2,'FP3,' FP4,'MACSS)' MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 10.000+ regression datasets 1.4M compounds,10k proteins, 12,8M activities Predict which compounds (drugs) will inhibit certain proteins (and hence viruses, parasites,…) 2750 targets have >10 compounds, x 4 fingerprints
  • 61.
  • 62. OpenML in drug discovery cforest ctree earth fnn gbm glmnet lassolmnnetpcr plsr rforest ridge rtree ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024* 0.697* 0.427* 0.725+ 0.627* | msdfeat< 0.1729 mutualinfo< 0.7182 skewresp< 0.3315 nentropyfeat< 1.926 netcharge1>=0.9999 netcharge3< 13.85 hydphob21< −0.09666 glmnet fnn pcr rforest ridge rforest rforest ridge Predict best algorithm with meta-models MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! Metafeatures: - simple, statistical, info-theoretic, landmarkers - target: aliphatic index, hydrophobicity, net charge, mol. weight, sequence length, …
  • 63. I. Olier et al. MetaSel@ECMLPKDD 2015 Best algorithm overall (10f CV)
  • 64. I. Olier et al. MetaSel@ECMLPKDD 2015 Random Forest meta-classifier (10f CV)
  • 65. OpenML Community Jan-Jun 2015 Used all over the world Users: 530+ registered, 400 7-day, 1700 30-day active 1000s of datasets, flows, 450000+ runs
  • 66. - Open Source, on GitHub - Regular workshops, hackathons Join OpenML Next workshop: - Lorentz Center (Leiden), 14-18 March 2016
  • 67. T H A N K Y O U Joaquin Vanschoren Jan van Rijn Bernd Bischl Matthias Feurer Michel Lang Nenad Tomašev Giuseppe Casalicchio Luis Torgo You? #OpenMLFarzan Majdani Jakob Bossek