Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Scott Edmunds, GigaScience/BGI Hong Kong
ICG7, Hong Kong, 1st December 2012

www.gigasciencejournal.com

The challenges integrating papers + data:
Technical issues:
•Data volumes: (1.2 zettabytes generated globally each year)

•>Exponential growth of genomics data

•Technical challenges (VMs/cloud, compression)

Cultural issues:
•Lack of incentives (Data DOIs)

•Data licensing (CC-BY, CC0)

•Journal/funder policies
Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

The challenges integrating papers + data:
Technical issues:
•Data volumes: (1.2 zettabytes generated globally each year)

•>Exponential growth of genomics data

•Technical challenges (VMs/cloud, compression)

Cultural issues:
•Lack of incentives (Data DOIs)

•Data licensing (CC-BY, CC0)

•Journal/funder policies
Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
* T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham

Why is this important?
• Transparency
• Reproducibility
• Re-use

“Faked research
is endemic in
China”

Source: New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html

Why is this important?

475, 267 (2011)

―Wide distribution of information is key to scientific progress,
yet traditionally, Chinese scientists have not systematically
released data or research findings, even after publication.―

―There have been widespread complaints from scientists
inside and outside China about this lack of transparency. ‖

―Usually incomplete and unsystematic, [what little supporting
data released] are of little value to researchers and there is
evidence that this drives down a paper's citation numbers.‖
Source: Nature 475, 267 (2011) http://www.nature.com/news/2011/110720/full/475267a.html?

Global Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of ―retraction index‖ with
higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Global Issue: unrepeatability of scientific results
Out of 18 microarray papers, results
from 10 could not be reproduced

Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses.
Nature Genetics 41: 149-155.

Sharing aids authors…

Sharing Detailed
Research Data Is
Associated with
Increased Citation Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308

Every 10 datasets collected contributes to at least 4 papers in
the following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment
Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Rice v Wheat: consequences of publically available
genome data.

rice wheat
700
600
500
400
300
200
100
0

Our first DOI:

To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J;
Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y;
Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X;
Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-
2482 isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:

1. Citations (~100) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science

―Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the
Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew
it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements
governing how his team could use data collected on the strain. Luckily, one team had released its data
under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his
colleagues to join the international research effort and publish their work without wasting time on
legal wrangling.‖

1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the
US, affecting about 4000 people and resulting in over 50
deaths. All tested positive for an unusual and little-known
Shiga-toxin–producing E. coli bacterium. The strain was initially
analysed by scientists at BGI-Shenzhen in China, working
together with those in Hamburg, and three days later a draft
genome was released under an open data licence. This
generated interest from bioinformaticians on four continents. 24
hours after the release of the genome it had been assembled.
Within a week two dozen reports had been filed on an open-
source site dedicated to the analysis of the strain. These
analyses provided crucial information about the strain’s
virulence and resistance genes – how it spreads and which
antibiotics are effective against it. They produced results in
time to help contain the outbreak. By July 2011, scientists
published papers based on this work. By opening up their early
sequencing results to international collaboration, researchers in
Hamburg produced results that were quickly tested by a wide
range of experts, used to produce new knowledge and
ultimately to control a public health emergency.

Not just (data) quantity, but quality
1. Lack of sufficient metadata

2. Lack of interoperability

1. Long tail of curation (“Democratization” of “Big-Data”)

Better handling of metadata…
Novel tools/formats for data interoperability/handling.
Cloud
solutions?


Tools making work more easily reproducible…

Interoperability/Ease of use Workflows

Data quality assessment

Large-Scale Data
Journal/Database
In conjunction with:

Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Commisioning Editor: Nicole Nogoy, PhD
Lead Curator: Tam Sneddon D.Phil
Data Platform: Peter Li, PhD

Addressing the reproducibility gap:
Computable methods/workflow systems
Bioinformatics
Development Biomedical and bioinformatics research Publishing

Redefining what is a paper in the era of big-data?

goal: Executable Research Objects

Citable DOI

Integrating workflows into papers…

Anatomy of a Publication
Idea

Study

Metadata

Data
Analysis

Answer

Anatomy of a Data Publication
Idea

Study

Metadata

Data
Analysis

Answer

Publication

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

Data
Publication

• Background

• Methods

• Results (Data)
doi:10.5524/100035

doi:10.1186/2047-217X-1-3

Methods +
Data +
Publication

• Background

• Methods DOI for workflows?

• Results (Data)
doi:10.5524/100035

doi:10.1186/2047-217X-1-3

Data Methods Analysis

doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3

DOI: A + DOI: X = DOI: 1


doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3


DOI: B + DOI: X = DOI: 2


doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3



DOI: A + DOI: Y = DOI: 3


doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3



DOI: A + DOI: Y = DOI: 3

A, B, C… X, Y, Z… = 4, 5, 6…

Different shaped publishable objects
Data
Papers

Executable
(Methods)
Papers

Analysis
Papers

Different shaped publishable objects
Different levels of granularity

Experiment e.g. doi:10.5524/100001 Papers
(e.g. ACRG project)

e.g. doi:10.5524/100001-2 Data/
Datasets Micropubs
(e.g. cancer type)

e.g. doi:10.5524/100001-2000
Sample or doi:10.5524/100001_xyz
(e.g. specimen xyz)

Smaller still? Facts/Assertions (~1014 in literature) Nanopubs

Adding “value” publishing data

• Scope for different shaped publishable objects
• Scope for publishing methods/executable papers
• Peer review of data problematic
– Post publication peer review
– Change criteria (assess on transparency/access only)
– Better use of workflows/cloud/VMs

DOIs are cheap*, data is precious: maximise its use
* ish

• Transparency
• Reproducibility
• Re-use
} = Credit

Thanks to: Shaoguang Liang (BGI-SZ)
Laurie Goodman Tin-Lap Lee (CUHK)
Tam Sneddon Huayen Gao (CUHK)
Nicole Nogoy Qiong Luo (HKUST)
Alexandra Basford Senghong Wang (HKUST)
Peter Li Yan Zhou (HKUST)
Jesse Si Zhe Cogini
editorial@gigasciencejournal.com
Contact us: database@gigasciencejournal.com

@gigascience

Follow us: facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/

www.gigadb.org

Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Similaire à Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era (20)

Plus de GigaScience, BGI Hong Kong

Plus de GigaScience, BGI Hong Kong (20)

Dernier

Dernier (20)

Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Notes de l'éditeur