The Refinery Platform (http://www.refinery-platform.org) is a web-based data visualization and analysis system for epigenomic and genomic data designed to support reproducible biomedical research. The analysis backend employs the Galaxy Workbench and connects to a data repository based on the ISA-Tab data description format. In my talk I will discuss the exploratory visualization tools that we have integrated into Refinery.
Visualization Tools for the Refinery Platform - Supporting reproducible research with provenance visualization
1. Visualization Tools for the
Refinery Platform
Nils Gehlenborg, PhD
HARVARD MEDICAL SCHOOL・CENTER FOR BIOMEDICAL INFORMATICS
SUPPORTING REPRODUCIBLE RESEARCH
WITH PROVENANCE VISUALIZATION
REPRODUCIBLE RESEARCH
3. Health & Science
The new scientific revolution: Reproducibility at last
ByBy Joel AchenbachJoel Achenbach January 27January 27
Diederik Stapel, a professor of social psychology in the Netherlands, had been a rock-star scientist — regularlyDiederik Stapel, a professor of social psychology in the Netherlands, had been a rock-star scientist — regularly
appearing on television and publishing in top journals. Among his striking discoveries was that people exposed toappearing on television and publishing in top journals. Among his striking discoveries was that people exposed to
litter and abandoned objects are more likely to be bigoted.litter and abandoned objects are more likely to be bigoted.
And yet there was often something odd about Stapel’s research. When students asked to see the data behind hisAnd yet there was often something odd about Stapel’s research. When students asked to see the data behind his
work, he couldn’t produce it readily. And colleagues would sometimes look at his data and think: It’s beautiful.work, he couldn’t produce it readily. And colleagues would sometimes look at his data and think: It’s beautiful.
Too beautiful. Most scientists have messy data, contradictory data, incomplete data, ambiguous data. This dataToo beautiful. Most scientists have messy data, contradictory data, incomplete data, ambiguous data. This data
waswas too good to be truetoo good to be true..
In late 2011, Stapel admitted that he’d been fabricating data for many years.In late 2011, Stapel admitted that he’d been fabricating data for many years.
The Stapel case was an outlier, an extreme example of scientific fraud. But this and several other high-profileThe Stapel case was an outlier, an extreme example of scientific fraud. But this and several other high-profile
cases of misconduct resonated in the scientific community because of a much broader, more pernicious problem:cases of misconduct resonated in the scientific community because of a much broader, more pernicious problem:
Too often, experimental results can’t be reproduced.Too often, experimental results can’t be reproduced.
That doesn’t mean the results are fraudulent or even wrong. But in science, a result is supposed to be verifiable byThat doesn’t mean the results are fraudulent or even wrong. But in science, a result is supposed to be verifiable by
a subsequent experiment. An irreproducible result is inherently squishy.a subsequent experiment. An irreproducible result is inherently squishy.
And so there’s a movement afoot, and building momentum rapidly. Roughly four centuries after the invention ofAnd so there’s a movement afoot, and building momentum rapidly. Roughly four centuries after the invention of
the scientific method, the leaders of the scientific community are recalibrating their requirements, pushing forthe scientific method, the leaders of the scientific community are recalibrating their requirements, pushing for
the sharing of data and greater experimental transparency.the sharing of data and greater experimental transparency.
Top-tier journals, such as Science and Nature, haveTop-tier journals, such as Science and Nature, have announced new guidelinesannounced new guidelines for the research they publish.for the research they publish.
“We need to go back to basics,” said Ritu Dhand, the editorial director of the Nature group of journals. “We need“We need to go back to basics,” said Ritu Dhand, the editorial director of the Nature group of journals. “We need
to train our students over what is okay and what is not okay, and not assume that they know.”to train our students over what is okay and what is not okay, and not assume that they know.”
The pharmaceutical companies are part of this movement. Big Pharma has massive amounts of money at stakeThe pharmaceutical companies are part of this movement. Big Pharma has massive amounts of money at stake
and wants to see more rigorous pre-clinical results from outside laboratories. The academic laboratories act asand wants to see more rigorous pre-clinical results from outside laboratories. The academic laboratories act as
5. 1. Statistical issues
2. No access to data
3. No access to software
4. Insufficient description of experimental protocols
5. Insufficient description of data analysis process
…
CHALLENGES FOR REPRODUCIBILITY
6. N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
Refinery Platform
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
7. N Gehlenborg et al. , manuscript in preparation
DATA REPOSITORY
8. N Gehlenborg et al. , manuscript in preparation
DATA REPOSITORY Meta Data
9. N Gehlenborg et al. , manuscript in preparation
DATA REPOSITORY Meta Data
TREATMENT
CELL LINE
TIME POINT
…
11. DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Derived
Data
Meta Data
12. DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Derived
Data
Provenance
Meta Data
PROTOCOLS
13. DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Derived
Data
Provenance
Meta Data
PROTOCOLS
ALGORITHMS
14. DATA REPOSITORY
N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Derived
Data
Experiment Graph
Provenance
Meta Data
15.
16.
17.
18. N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
Refinery Platform
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
20. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
21. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
REST
API
22. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
REST
API
23. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY
Tools
REST
API
24. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY Toolshed
Tools
REST
API
25. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY Toolshed
Workflow Editor
Tools
REST
API
26. ANALYSIS PIPELINES
N Gehlenborg et al. , manuscript in preparation
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINES
GALAXY Toolshed
Workflow Editor
Tools
REST
API
Workflow Inputs
Workflow Outputs
27. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
28. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
29. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
30. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
Derived
Data
31. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
Derived
Data
32. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
WORKFLOW &
PARAMETERS
Derived
Data
33. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
Derived
Data
34. N Gehlenborg et al. , manuscript in preparation
AssaySampleSource Raw
Data
Derived
Data
Experiment Graph
ANALYSIS PIPELINES
Derived
Data
Derived
Data
35. N Gehlenborg et al. , manuscript in preparation
ANALYSIS PIPELINES
36. N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
Refinery Platform
|
DATA REPOSITORY
VISUALIZATION TOOLS
ANALYSIS PIPELINESISA-TAB ISA-TAB
39. N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
40. N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
USE CASES
1. Collaboration between computational and experimental labs
2. Repository for large-scale, data-generating projects
3. Integration with existing repositories
78. N Gehlenborg et al. , manuscript in preparation
REPRODUCIBLE AND INTEGRATIVE ANALYSIS
79. HARVARD MEDICAL SCHOOL
JOHANNES KEPLER UNIVERSITY LINZ Stefan Luger
Samuel Gratzl
Holger Stitz
Marc Streit
HARVARD CHAN SCHOOL OF PUBLIC HEALTH
Funding
NIH/NHGRI K99 HG007583 & Harvard Stem Cell Institute
Ilya Sytchev
Shannan Ho Sui
Winston Hide
Acknowledgements
Richard Park
Psalm Haseley
Anton Xue
Peter J Park
80. Methods to Enhance the Reproducibility of
Precision Medicine
Pacific Symposium on Biocomputing
The Big Island of Hawaii
January 4-8, 2016
people.fas.harvard.edu/~manrai/
http://bit.ly/patient-driven http://bit.ly/psb16-reproducibility
WE ARE HIRING!
http://j.mp/refinery-developer-jr
GREAT CONFERENCES!