Data-driven research requires many people from different domains to collaborate efficiently. The domain scientist collects and analyzes scientific data, the data scientist develops new techniques, and the tool developer implements, optimizes and maintains existing techniques to be used throughout science and industry. Today, however, this data science expertise lies fragmented in loosely connected communities and scattered over many people, making it very hard to find the right expertise, data and tools at the right time. Collaborations are typically small and cross-domain knowledge transfer through the literature is slow. Although progress has been made, it is far from easy for one to build on the latest results of the other and collaborate effortlessly across domains. This slows down data-driven research and innovation, drives up costs and exacerbates the risks associated with the inappropriate use of data science techniques. We propose to create an open, online collaboration platform, a `collaboratory’ for data-driven research, that brings together data scientists, domain scientists and tool developers on the same platform. It will enable data scientists to evaluate their latest techniques on many current scientific datasets, allow domain scientists to discover which techniques work best on their data, and engage tool developers to share in the latest developments. It will change the scale of collaborations from small to potentially massive, and from periodic to real-time. This will be an inclusive movement operating across academia, healthcare, and industry, and empower more students to engage in data science.
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
Towards a Data Science Collaboratory
1. T O WA R D S A
D ATA S C I E N C E
C O L L A B O R AT O RY
J O A Q U I N VA N S C H O R E N , B E R N D B I S C H L , F R A N K H U T T E R , M I C H E L E
S E B A G , B A L A Z S K E G L , M A T T H I A S S C H M I D , G I U L I O N A P O L I TA N O , K A T Y
W O L S T E N C R O F T, A L A N R . W I L L I A M S , N E I L L A W R E N C E
I D A 2 0 1 5
2. Data science ecosystem
Domain scientist: collects and
analyses scientific data to study
or discover phenomena
Data scientist: designs and
analyses algorithms to solve
specific problems or benchmark
them on a collection of datasets
Computer scientist:
implements, optimises and
maintains existing algorithms, for
robust use by many different
people (often industry)
Data
scientist
Domain
scientist
Computer
Scientists
3. Gap: domain and data scientists
Domain scientists:
- doubts about latest/best
methods
- simple models on complex data
Data scientists:
- don’t speak the language of
scientists, misses opportunities
- don’t know how to access rich
scientific databases
- complex models on simple
(old) data
Data
scientist
Domain
scientist
Computer
Scientists
4. M A C H I N E L E A R N I N G R E S E A R C H
Complex because we assume that the researcher is an expert
Modify learning
algorithm
Re-code
features
Tweak
parameters
Check
calibration
Discover
leakage
Upsample
minority class
Feature
selection Thanks to R. Caruana
5. Gap: domain and data scientists
Slows down research
- Knowledge transfer through
literature is slow (jargon, tacit
knowledge), limited
- Research output has little cross-
domain relevance
- Collaborations are small-scale
(biased), take time
- Friction: understanding data
formats, code, email, contracts
takes too long
- Lots of time spent on tasks that
others could do much faster, or
automated altogether
Data
scientist
Domain
scientist
Computer
Scientists
6. Gap: academia and industry
Innovative enterprises:
- debilitated by lack of access to
expertise and data
- large companies collect/buy data
and talent
- limited interaction (code, appl’s)
Healthcare:
- complex diseases (cancer,
Alzheimer’s, drug resistant viruses)
- heavily interconnected data
(genotype/phenotype), novel
methods needed (missing, hd data)
- requires large-scale, real-time
collaboration, privacy trade-offs
Data
scientist
Domain
scientist
Computer
Scientists
7. Challenge: manpower
Data scientist shortage
- McKinsey: 140k-190k data
scientists needed in USA, by 2018
- large companies buy talent,
inhibits societal and scientific goals
(healthcare, climate,…)
Data science needs to scale
- training armies of data scientists?
- enable frictionless collaboration:
easy access to best tools, data,
training materials
- democratize data science
- automate drudge work
Data
scientist
Domain
scientist
Computer
Scientists
8.
9. 85% medical research resources are wasted
- underpowered, irreproducible research
- associations/effects are false, exaggerated
- flexibility in experiment design, statistical analysis
- translation into applications is inefficient, impossible
Some subdomains, e.g. genetic epidemiology, score better:
- Large-scale, interdisciplinary collaboration
- Strong replication culture
- Open sharing of data, protocols, software
- Adopting better (statistical) methods
- Continued education in research methods, statistical literacy
12. Research different.
Polymaths: Solve math problems through
massive collaboration (not competition)
Broadcast question, combine
many minds to solve it
Solved hard problems in weeks
Many (joint) publications
13. Research different.
SDSS: Robotic telescope, data publicly online (SkyServer)
+1 million distinct users
vs. 10.000 astronomers
Broadcast data, allow many minds to ask the right questions
Thousands of papers, citations
Next: SynopticTelescope
14. Research different.
How do you label a million galaxies?
Offer right tools so that anybody can contribute, instantly
Many novel discoveries by scientists and citizens
More? Read ‘Networked science’ by M. Nielsen
15. Why? Designed serendipity
Broadcasting data fosters spontaneous,
unexpected discoveries
What’s hard for one scientist
is easy for another: connect minds
How? Remove friction
Organized body of compatible
scientific data (and tools) online
Micro-contributions: seconds, not days
Easy, organised communication
Track who did what, give credit
16. A data science collaboratory: inputs
Millions of real, open datasets are generated
• Drug activity, gene expressions, astronomical observations, text,…
Extensive toolboxes exist to analyse data
• MLR, SKLearn, RapidMiner, KNIME, WEKA, AzureML,…
Massive amounts of experiments are run, most of this
information is lost forever (in people’s heads)
• No learning across people, labs, fields
• Start from scratch, slows research and innovation
17. Computer scientists: Connect data science
environments to network, automate sharing
A data science collaboratory: actors
Data scientists: Design/run algorithms, share results.
Automate algorithm selection, optimisation
Domain scientists: Extract, share actionable datasets
from large (open) scientific databases
18. D A TA S C I E N C E C O L L A B O R A T O RY
Organized data: Data and tools easy to find, search. Experiments connected
to data, code, people anywhere
Easy to use: Automated, reproducible sharing from data science tools
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure: Build reputation and trust (e-citations, social interaction)
19. Change scale of collaboration
• ‘GitHub for data science’: start projects, commit data +
experiments, issue questions
• Scientists can issue (open or limited) invitation to work on
something: analyse given data, solve a problem,..
• Others respond by adding data, running experiments
showing how well given techniques work
• Collaboratory organises everything, everyone can build
directly on each other’s latest results
• Global organisation: state-of-the-art reference for
scientists, industry, students
20. Change speed of collaboration
• Easy to use so that researchers can focus on their work,
not on drudge work
• Share automatically from your favourite data science tool,
infrastructure
• Automated annotations of data, code (meta-data)
• Collaborate in real-time: stream results, discuss online
• `Bots’ that automatically run machine learning algorithms,
clean data, intelligently optimise parameter (lots of data
to push research in this direction)
21. New incentives
• Track and clearly show who contributed what to a study.
Link online study to paper publications.
• Build online reputation: Track reuse of shared data,
algorithms, results (e-citations)
• Save lots of time, collaborate more often, publish more
22. Pre-network era
• Analysing new data, she tries
many techniques (many fail),
and dives into literature.
• Runs expensive grid searches,
painstakingly keeping logs
• She hunts down and tries to
reproduce prior work
• Reviewers take months to
read, ask her to rerun things
because she made mistakes
• Annotate everything and
upload to central database
• 1 year later, fails to reproduce
Elisa’s story
Collaboratory
• She uploads data, gets automatic
analysis, list of similar data and
studies, prior results, people
• All experiments are automatically
uploaded, organised, compared
• Prior results are auto-linked, she
can reuse or reproduce them
• Elisa connects to trusted people
who comment on her work in real
time, reviewers are impressed
• A plugin auto-exports her data in
full detail
• Her work is easily reused, cited
23. Building blocks: data
• Scientific data registries
• Bio-informatics: Many databases,
common data formats (FASTA),
model formats (SBML)
• Pharmacology: data exploration
portals: OpenPHACTS, BioGPS,
MyGene.info
25. Platforms:
• BioSharing.org: registry for databases, standards,
policies in biological and biomedical sciences
• Kaggle: data mining challenge platform
• SEEK: store, link data and mathematical models,
ResearchObjects: bundles of paper, data, code
Initiatives:
• Data Carpentry: Workshops for new data scientists
• STRATOS Initiative: guidelines for design and analysis of
observational medical research
• ISA-tools: standards for describing and aggregating
studies
26. F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G
Easy to use: Integrated in ML environments.Automated, reproducible sharing
Organized data: Experiments connected to data, code, people anywhere
Easy to contribute: Post single dataset, algorithm, experiment, comment
Reward structure*: Build reputation and trust (e-citations, social interaction)
OpenML
27. Data (ARFF) uploaded or referenced, versioned
analysed, characterized, organised online
43. • Example: Classification
on click prediction
dataset, using 10-fold
CV and AUC
• People submit results
(e.g. predictions)
• Server-side evaluation
(many measures)
• All results organized
online, per algorithm,
parameter setting
• Online visualizations:
every dot is a run
plotted by score
Tasks
44. • Leaderboards visualize progress over time: who delivered breakthroughs
when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others
• Real-time: clear who submitted first, others can improve immediately
47. • All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
48. • Detailed run info
• Author, data, flow,
parameter settings,
result files, …
• Evaluation details
(e.g., results per
sample)
59. Studies (e-papers)
- Online counterpart of a paper, linkable
- Add data, code, experiments (new or old)
- Public or shared within circle
Circles
Create collaborations with trusted researchers
Share results within team prior to publication
Altmetrics
- Measure real impact of your work
- Reuse, downloads, likes of data, code, projects,…
- Online reputation (more sharing)
Online collaboration (soon)
63. I. Olier et al. MetaSel@ECMLPKDD 2015
Best
algorithm
overall
(10f CV)
64. I. Olier et al. MetaSel@ECMLPKDD 2015
Random
Forest
meta-classifier
(10f CV)
65. OpenML Community
Jan-Jun 2015
Used all over the world
Users: 530+ registered, 400 7-day, 1700 30-day active
1000s of datasets, flows, 450000+ runs
66. - Open Source, on GitHub
- Regular workshops, hackathons
Join OpenML
Next workshop:
- Lorentz Center (Leiden),
14-18 March 2016
67. T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek