Scaling API-first – The story of a global engineering organization
Reproducible, Open Data Science in the Life Sciences
1. Reproducible, Open
Data Science in the
Life Sciences
Digital Research 2013, 9th-10th September 2013
Eamonn Maguire
Lead Software Engineer
University of Oxford e-Research Centre
2. “data science” - the storage, management and analysis of data sets
or...
What is “data science”...
these days, all hard sciences are “data sciences”
"data science" - formalizing a hypothesis given a set of observations
and assumptions, grabbing data about that hypothesis, testing it and
analyzing it to either confirm or falsify the hypothesis.
we shift the focus of science from performing physical experiments
where data is the by-product used to test a hypothesis, to working
directly with the data
both definitions have different levels of validity in terms of the etymology of the word
“science”, but in this presentation, both go very much hand in hand.
3. Why reproducible and open?
open
experiments are
expensive...and often
funded publicly.
data from experiments
may extend way beyond
the realms originally
intended...
one without the other is really no use at all.
reproducible
findings need to be
robust...and testable by the
wider scientific community...
provided with
data, metadata, analysis
methods, algorithms
enabled by
data, metadata, analysis methods,
algorithms
5. Planning
Planning
What is your hypothesis?
What do you need to prove/disprove this?
You need an experimental design.
Preferably balanced groups, but enough samples to make it statistically valid.
If you need to generate the data...there’s an app for that.
+
Design Wizard
Plan your experiment by answering questions about what you want to measure and how
you want to measure it...then let the tool create the design
Creates the ISA-Tab stub, leaving you to fill in which files match which biological samples.
6. Planning
Planning
You also need a data
management strategy.
Which ontologies, minimal
information checklists and
exchange formats can be used
for my domain?
What are the requirements of
my funder for data deposition?
Which databases support my
data?
7. Data Collection
Data Collection
Use existing
data
Perform new
experiment
New experiment data Use existing data
Collect the data and metadata
from an experiment
Or use existing data and
metadata to test a hypothesis...
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
8. Data Collection
Use existing
data
Perform new
experiment
New experiment data
Excel
Create templates to fit the
type of experiments to be
described
Curate your experiment
using a desktop-based,
platform independent tool.
Describe & curate your
experiment with geographically
distributed collaborators
Check out http://isa-tools.org
to download.
Enables the creation of meaningful
experiment data in a simple
extendable format
12. Analysis
Analysis
The interesting bit...doing something with our data and metadata...
Analysis of ISA Tab data in
the R language. Brings
together the context and data
to enable more meaningful
analysis.
Also suggests packages to
use for analysis based on the
data types in the ISA Tab file.
Publication coming soon...
Analysis of ISA-Tab data in the
Galaxy Environment.
Creates Galaxy Library objects
from ISA-Tab files.
Analysis of ISA-Tab data in the
GenomeSpace Environment.
Load and edit files stored on distributed
servers.
Created by Brad Chapman at the
Harvard School for Public Health
13. Visualization
Visualization
Check out your experiment, visualize experimental design
Visual Compression of Workflow
Visualizations with Automated
Detection of Macro Motifs
Maguire et al, 2013
IEEE TVCG
Taxonomy−based Glyph Design –
with a Case Study on Visualizing
Workflows of Biological Experiments
Maguire et al, 2012
IEEE TVCG
15. Publication - current work
Visualization
See http://www.slideshare.net/GigaScience/scott-edmunds-ismb-talk-on-big-data-
publishing for a use case showing how we achieve this.
Analysis results
Data files
Publications
Metadata
Encapsulates all the
information about the
experiment, providing links
to the data files,
publications and analysis
protocols
Analysis workflows in the Galaxy
Environment
Workflows
Presentations
Logs
16. Box it all up
The role of a data scientist, (or in the life sciences, a bioinformatician) is multi fold
We’ve presented a suite of tools to help the data scientist in the management of their data and
support the creation of open, meaningful life science data
Data
Scientist
VisualizationData
ManagementData Collection PublicationPlanning
a b c e f
Analysis
d
“data science” - the storage, management and analysis of data sets
"data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing
data about that hypothesis, testing it and analyzing it to either confirm or falsify the hypothesis.
With the systems we have in place for data discovery paired with data already created with the ISA
suite of tools, we make possible data integration
17. The workflow of a “data scientist”...
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Making science with data possible...
18. The workflow of a “data scientist”...
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
And this...
Making science with data possible...
19. The workflow of a “data scientist”...
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Making science with data possible...
20. The workflow of a “data scientist”...
...
Making science with data possible...
Data
Scientist
Analysis
Planning
Data
Management
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Publication
Data
Scientist
Analysis
Planning
Data
Management
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Publication
Data
Scientist
Planning
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Publication
Data
Scientist
Analysis
Planning
Data
Management
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Publication
21. Recent Publications
Visual Compression of Workflow
Visualizations with Automated
Detection of Macro Motifs
Maguire et al, 2013
IEEE TVCG
Taxonomy−based Glyph Design –
with a Case Study on Visualizing
Workflows of Biological Experiments
Maguire et al, 2012
IEEE TVCG
ISA software suite:
Overview of ISA-Tab and first set of tools
Rocca-Serra et al, 2010
Bioinformatics
Towards interoperable bioscience data: Pre-
senting the ISA Commons, authored by more than
50 collaborators at over 30 scientific organizations
around the globe.
Sansone et al, 2012
Nature Genetics
OntoMaton: a Bioportal powered
ontology widget for Google Spreadsheets.
Maguire et al, 2013
Bioinformatics
The Harvard Stem Cell Discovery Engine:
an integrated repository and analysis system for
Ho Sui et al, 2012
Nucleic Acids Research
Taxonomy-based Glyph Design
Maguire et al, 2012
IEEE TVCG
Visualizing (ISA based) workflows of
biological experiments
Standardizing data: ISA-Tab-Nano: ISA-Tab
extension for nanotechnology applications au-
thored by over 20 organizations inlc. government
agencies, academia and industry.
Baker et al, 2013
Nature Nanotechnology
MetaboLights: an open-access repository
for metabolomics at EBI powered by ISA.
Haug et al, 2013
Nucleic Acids Research
The ToxBank Data Warehouse: a research
cluster of 7 EU FP7 Health systems toxicology and
toxicogenomics projects develops the ISAtoRDF
module
Kohonen et al, 2013
Molecular Informatics
22. Thanks to
ISA team
Susanna-Assunta Sansone
Philippe Rocca-Serra
Eamonn Maguire
Alejandra Gonzalez-Beltran
Contributors
Marco Brandizi
Natalija Sklyar
Brad Chapman
Bob MacCallum
Kenneth Haug
Pablo Conesa
Audrey Kauffman
Funders
& Our Many Collaborators!
S te m C e ll C om m ons
Nanotechnology
Informatics Working
Group