3. Its all about the re-use
To do this everything needs to be free
and accessible to be read by humans &
machines*
* See: http://www.biomedcentral.com/about/datamining
Take home message:
4. Challenges/Opportunities in the Data-Driven Era
Quick response to climate change, food security & disease outbreaks
Using networking power of the internet to tackle problems
Can ask new questions & find hidden patterns & connections
Build on each others efforts quicker & more efficiently
More collaborations across more disciplines
Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Enables:
Enabled by:
Removing silos, standards/formats, open-access/data
Challenges:
5. Not enabled by: paywalls, silos, dead trees
18121665 1869
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and
computational methods, which support the scholarship,
remain largely inaccessible --- Jon B. Buckheit and David L.
Donoho, WaveLab and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication
• If there is interest in data, only to monetise & repackage
6. Problem: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
7. Growing Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
At current % increase by 2045 as
many papers published as
retracted!
9. GigaSolution: Deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expertise from:
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
10. • Data
• Software
• Review
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researchers who release data to public repositories with
a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data set
would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
New incentives/credit
11. Anatomy of a Publication
Data
Idea
Study
Analysis
Answer
Metadata
12. Anatomy of a Data Publication
Data
Idea
Study
Analysis
Answer
Metadata
13. Fail – submitter is
provided error report
Pass – dataset is
uploaded to GigaDB.
Submission Workflow
Curator makes dataset public (can
be set as future date if required)DataCite
XML file
Excel
submission file
Submitter logs in to
GigaDB website and
uploads Excel submission
GigaDB
DOI
assigned
Files
Submitter provides
files by ftp or Aspera
XML is generated and
registered with DataCite
Curator Review
Curator contacts submitter with
DOI citation and to arrange file
transfer (and resolve any other
questions/issues).
DOI 10.5524/100003
Genomic data from the crab-
eating macaque/cynomolgus
monkey (Macaca fascicularis)
(2011)
Public GigaDB dataset
See: http://database.oxfordjournals.org/content/2014/bau018.abstract
17. BGI Datasets Get DOIs
Plants
Chinese cabbage
Cucumber
Foxtail millet
Pigeonpea
Potato
Sorghum
Wheat A+B
Rice
Microbe/metagenomics
E. Coli O104:H4 TY-2482
T2D gut metagenome
Bulk pooled insects
T. Tengcongensis proteome
Cell-Lines
Chinese Hamster Ovary
Mouse methylomes
Cancer quantitative protemicsHuman
Asian individual (YH)
- DNA Methylome
- Genome Assembly v1+2
- Transcriptome
Cancer (14TB)
Single cell bladder cancer
HBV infected exomes
Ancient DNA
- Saqqaq Eskimo
- Aboriginal Australian
Vertebrates
Darwin’s Finch
Giant panda Macaque
-Chinese rhesus
-Crab-eating
Mini-Pig
Naked mole rat
Parrot, Puerto Rican
Penguin
- Emperor penguin
- Adelie penguin
Pigeon, domestic
Polar bear
DA and F344 rats
Sheep
Tibetan antelope
Other
fMRI & Retinal waves
Invertebrate
Ant
- Florida carpenter ant
- Jerdon’s jumping ant
- Leaf-cutter ant
Roundworm
Schistosoma
Silkworm
Parasitic nematode
Pacific oyster
Released pre-publication
Paper Published in GigaScience
20. To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X;
Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
(2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
Our first DOI:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
24. SOURCE
USE/REUSE
PUBLISH
INTEGRATION WITH
DOMAIN-SPECIFIC
DATABASES VIA ISA-TOOLS
NARRATIVE DATA
(SOCIAL)
MEDIA
DATA PRODUCTION
Sneddon,T.P., Zhe,X.S., Edmunds,S.C., et al. GigaDB: promoting data dissemination and
reproducibility. Database (2014) Vol. 2014: article ID bau018; doi:10.1093/database/bau018
27. How are we supporting data
reproducibility?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
~21,000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/~21,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
28. New & more transparent peer-review:
The GigaScience way:
8 referees downloaded & tested data, then signed reports
29. New & more transparent peer-review:
The GigaScience way:
Real-time open-review = paper in arXiv + blogged reviews
30. Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main
Galaxy server users
Over 1000 papers
citing Galaxy use
Over 55 Galaxy
servers deployed
Open source
32. SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also will be available to download by >36K Galaxy users in
37. Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Thanks to:
@gigascience
facebook.com/GigaScience
blogs.biomedcentral.com/gigablog/
Peter Li
Chris Hunter
Jesse Si Zhe
Nicole Nogoy
Laurie Goodman
Rob Davidson
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
Notes de l'éditeur
** these are examples of datasets we have in GigaDB
Quite a few of them were released pre-publication
We want to push – better quality metadata
Working with ISA (investigator study assay) commons to enable this
Good to be a leader in this field – NPG are following in our footsteps!