Scott Edmunds: Data Dissemination in the era of "Big-Data"
1. Bio-IT World Asia Meeting, 7th June 2012 Scott Edmunds
Data dissemination in the era of “big data”
William Gibson: "Information is the currency of the future world”
Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems
themselves”
www.gigasciencejournal.com
2. Is data “the new oil”?
1.2 zettabytes (1021) of electronic data generated each year1
Data
Deluge?
1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
3. Global Sequencing Capacity
Data Production
5.6 Tb / day
> 1500X of human genome / day
Multiple Supercomputing Centers
157 TB Flops
20 TB Memory
14.7 PB Storage
4. BGI Sequencing Capacity
Sequencers Data Production
137 Illumina/HiSeq 2000 5.6 Tb / day
27 LifeTech/SOLiD 4 > 1500X of human genome / day
1 454 GS FLX+ 137
2 Illumina iScan Multiple Supercomputing Centers
1 Illumina MiSeq 157 TB Flops
1 Ion Torrent 20 TB Memory
14.7 PB Storage
5. Now taking submissions…
Large-Scale Data:
Journal/Database/Platform
In conjunction with:
Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Assistant Editor: Alexandra Basford, PhD
Lead BioCurator: Tam Sneddon, Dphil
Data Platform: Peter Li, PhD
www.gigasciencejournal.com
9. There are many hurdles…
Technical: too large volumes
too heterogeneous
no home for many data types
too time consuming
Cultural: inertia
no incentives to share
unaware of how
?
19. Incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)
Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating a citable reference, as it can
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)
20. Datacitation: Datacite and DOIs
Digital Object Identifiers (DOIs)
offer a solution
Mostly widely used identifier for Dataset
scientific articles Yancheva et al (2007). Analyses on
Researchers, authors, publishers sediment of Lake Maar. PANGAEA.
know how to use them doi:10.1594/PANGAEA.587840
Put datasets on the same playing
field as articles
“increase acceptance of research data as
Aims to: legitimate, citable contributions to the
scholarly record”.
“data generated in the course of research
are just as valuable to the ongoing academic
discourse as papers and monographs”.
21. Datacitation: Datacite and DOIs
Central metadata repository:
• >1 million entries to date
• Stability
• Data discoverability
• Open & harvestable
• Potential to track &
credit use
22. Data publishing/DOI
New journal format combines standard manuscript
publication with an extensive database to host all
associated data, and integrated tools.
Data hosting will follow standard funding agency
and community guidelines.
DOI assignment available for submitted data to
allow ease of finding and citing datasets, as well as for
citation tracking.
www.gigasciencejournal.com
24. BGI Datasets Get DOI®s
Invertebrate
Many released pre-publication…
Ant PLANTS
- Florida carpenter ant Chinese cabbage
Vertebrates
- Jerdon’s jumping ant Cucumber
Giant panda Macaque
- Leaf-cutter ant Foxtail millet
- Chinese rhesus
Roundworm Pigeonpea
- Crab-eating
Schistosoma Potato
Mini-Pig
Silkworm Sorghum
Naked mole rat
Penguin
Human - Emperor penguin
Asian individual (YH) - Adelie penguin
- DNA Methylome Pigeon, domestic
- Genome Assembly Polar bear
- Transcriptome Sheep
doi:10.5524/100004
Cancer (14TB) Tibetan antelope
Ancient DNA Microbe
- Saqqaq Eskimo E. Coli O104:H4 TY-2482
- Aboriginal Australian
Cell-Line
Chinese Hamster Ovary
25. For data citation to work, needs:
• Proven utility/potential user base.
• Acceptance/inclusion by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
27. • Data submitted to NCBI databases:
- Raw data SRA:SRA046843
- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000
- SNPs dbSNP:1056306
- CNVs
-
-
InDels
SV
} dbVAR:nstd63
• Submission to public databases complemented by
its citable form in GigaDB (doi:10.5524/100012).
33. Datacitation: tracking?
DataCite metadata in harvestable form (OAI-PMH)
Plans in 2012 to link central metadata repository with WoS
- Will finally track and credit use!
To be continued…
36. Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
39. “The way that the genetic data of the 2011 E. coli strain were disseminated
globally suggests a more effective approach for tackling public health
problems. Both groups put their sequencing data on the Internet, so scientists
the world over could immediately begin their own analysis of the bug's
makeup. BGI scientists also are using Twitter to communicate their latest
findings.”
“German scientists and their colleagues at the Beijing Genomics Institute in China have
been working on uncovering secrets of the outbreak. BGI scientists revised their draft
genetic sequence of the E. coli strain and have been sharing their data with dozens of
scientists around the world as a way to "crowdsource" this data. By publishing their data
publicy and freely, these other scientists can have a look at the genetic structure, and try
to sort it out for themselves.”
41. Downstream consequences:
1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 2012)
3. Speed/legal-freedom
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and
publish their work without wasting time on legal wrangling.”
44. The era of the data consumer?
Free access to data – but analysis hubs/nodes for will form around it
?
45. GDSAP: Genomic Data Submission
and Analytical platform
Big data
from the
Data, Data, Data… “Sequencing
Oil Field”
Data
Modeling
Pipeline
design
Tin-Lap Lee, CUHK
Validation
Commercial
applications
“Apps”
48. Papers in the era of big-data
$1000 genome = million $ peer-review?
To review: (>6TBp, >1500 datasets)
S3 = $15,000
EC2 (BLASTx) = $500,000
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops
49. Papers in the era of big-data
goal: Executable Research Objects
Citable DOI
50. Papers in the era of big-data
goal: Executable Research Objects
Stage 1: Wilson GA, Dhami P, Feber A, Cortázar D, Suzuki Y, Schulz R, Schär P, Beck S:
Resources for methylome analysis suitable for gene knockout studies of
potential epigenome modifiers. GigaScience 2012, 1:3. (in press)
GigaDB hosting all data + tools (84GB total): doi:10.5524/100035
+
Partial (~80%) integration of workflow into our data platform.
(all the data processing steps, but not the enrichment analysis)
Stage 2: Papers fully integrating all data + all workflows in our platform.
51. Papers in the era of big-data
Interested in Reproducible Research?
Take part in our session on: “Cloud and workflows for reproducible bioinformatics”
Submit to:
• Rapid review/Open Access/High-visibility
• Article Processing Charge covered by BGI
• Hosting of any test datasets/workflows in GigaDB
52. Thanks to:
Laurie Goodman Alexandra Basford
Tam Sneddon Peter Li
Tin-Lap Lee (CUHK) Qiong Luo (HKUST)
scott@gigasciencejournal.com
Contact us:
editorial@gigasciencejournal.com
@gigascience
Follow us: facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
www.gigasciencejournal.com
Notes de l'éditeur
Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
Helps reproducibility, but some debate over whether it can help that much regarding scaling.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.