This document discusses a new open-access journal called GigaScience that publishes large biological datasets. It aims to improve data sharing by assigning digital object identifiers (DOIs) to published datasets to make them easily citable and trackable. The journal faces challenges regarding reproducibility, usability, and adherence to standards. It works to address these by providing tools for data access, encouraging standards compliance, and integrating datasets into its expanding repository.
Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration
1. A Journal’s Perspective on Data
Standards and Biocuration
Alexandra Basford, PhD
w w w. g i g a s c i e n c e j o u r n a l . c o m
2. Overview
/ The Curation
Challenges of a
Introduction Journal/Database
Reproducibility/Reuse
Data Publishing
Utility/Usability
Our DOI
Adventures
Standards/Searchability/
Sharing
3. Overview
/ The Curation
Challenges of a
Introduction Journal/Database
How do we deal with “big data”?
Reproducibility/Reuse
Data Publishing
Utility/Usability
Our DOI
Adventures
Standards/Searchability/
Sharing
7. is a new open-access open-
data journal for the publication of all types of
biological studies that use or create large-
scale data sets
The scope spans the biomedical and life sciences,
including:
- “Omics” - Ecology
- Imaging - Medicine
- Neuroscience - Systems biology
… “big and sharable”
Published by
in partnership with
8. Editorial Board – International
Stephan Beck, UK Stephen O'Brien, USA
Alvis Brazma, UK Hanchuan Peng, USA
Ann-Shyn Chiang, Taiwan Russell Poldrack, USA
Richard Durbin, UK Ming Qi, China/USA
Paul Flicek, UK Susanna-Assunta Sansone, UK
Robert Hanner, Canada Michael Schatz, USA
Yoshihide Hayashizaki, Japan David Schwartz, USA
Henning Hermjakob, UK Fritz Sommer, USA
Wolfgang Huber, Germany Lincoln Stein, Canada
Gary King, USA Sumio Sugano, Japan
Tin-Lap Lee, Hong Kong Thomas Wachtler, Germany
Donald Moerman, Canada Jun Wang, China
Karen Nelson, USA Alistair Young, New Zealand
Francis Ouellette, Canada Zang Yufeng, China
Lennart Hammarström, Sweden Marie Zins, France
Paul Horton, Japan
9. Editorial Board – Multidisciplinary
Stephan Beck, Epigenomics Stephen O'Brien, Genomics
Alvis Brazma, Transcriptomics Hanchuan Peng, Imaging/Neuro
Ann-Shyn Chiang, Neuroscience Russell Poldrack, Neuroscience
Richard Durbin, Genetics/Genomics Ming Qi, Genetics
Paul Flicek, Genomics Susanna-Assunta Sansone, Standards
Robert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud Computing
Yoshihide Hayashizaki, Genomics David Schwartz, Optical Mapping
Henning Hermjakob, Proteomics Fritz Sommer, Neuroscience
Wolfgang Huber, Functional Genomics Lincoln Stein, Cloud Computing
Gary King, Medicine Sumio Sugano, Genomics
Tin-Lap Lee, Genomics Thomas Wachtler, Neuroscience
Donald Moerman, Functional Genomics Jun Wang, Genomics
Karen Nelson, Metagenomics Alistair Young, Medical Imaging
Francis Ouellette, Genomics Zang Yufeng, Neuroscience
Lennart Hammarström, Immuno/Genetics Marie Zins, Medicine
Paul Horton, Genetics/Tools
14. An Unusual Format
• GigaScience combines standard manuscript
publication with an ever expanding database
• Evolving data repository
– Integrating tools for public access, viewing, and analysis of
the stored data
– Improvements driven by community input
• All datasets are assigned data digital object
identifiers (DOIs) to make them easy to access, track,
and cite
&
15. Data Sharing Hurdles
• Technical
– too large volumes
– too heterogeneous
– no home for many data types
• Economic
– too expensive
– no long-term funding
• Cultural
– inertia
– no incentives to share
– unaware of how ?
– too time consuming
16. Changing Trends
Cultural shift towards data sharing.
Growing/widening user base.
The long tail of new “big-data” producers?
Curation, cutation, curation
?
17. Use of Data = Importance + Usability
subjective? easier to assess
18. Challenges for a Journal/Database
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Shari
ng
Data publishing/DOI DOI®
19. Why DOI®s?
• Guarantee of permanency .org
• Clear method for data tracking and data citation,
allowing:
– Increased the searchability (and hopefully use) of data
– Credit for data production, making it clear who produced
the data and when
– Credit to original authors for their data’s use
– The ability to track and receive feedback on data usage
– A data citation metric potentially rivaling and
complementary to the impact factor
– The potential make the data available and receive credit
for it earlier, then later publishing papers on the dataset
20. Largest Sequencing Capacity in the World
Sequencers Data Production
137 Illumina/HiSeq 2000 5.6 Tb / day
27 LifeTech/SOLiD 4 > 1500X of human genome / day
16 AB/3730xl + 110 MegaBACEs
Multiple Supercomputing Centers
2 Illumina iScan
157 TB Flops
20 TB Memory
12.6 PB Storage
23. Datasets
Vertebrates
Invertebrates Giant panda Plants
Macaque Chinese cabbage
Ant
- Chinese rhesus Cucumber
- Florida carpenter ant
- Crab-eating
- Jerdon’s jumping ant Foxtail millet
Naked mole rat Pigeonpea
- Leaf-cutter ant
Penguin Potato
Roundworm
- Emperor penguin Sorghum
Silkworm
- Adelie penguin
Pigeon, domestic
Human
Polar bear
Asian individual (YH)
Sheep
- DNA Methylome
Tibetan antelope
- Genome Assembly
- Transcriptome Microbe
Ancient DNA (coming soon)
E. Coli O104:H4 TY-2482
- Saqqaq Eskimo
- Aboriginal Australian Cell Line
Chinese Hamster Ovary
25. Our First DOI®
To maximize its utility to the research community and aid those fighting the current
epidemic, genomic data is released here into the public domain under a CC0
license. Until the publication of research papers on the assembly and whole-
genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring
rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
34. • Data also submitted to NCBI (including SV data
to dbVar)
• Submission to public databases complemented
by its citable form in GigaDB:
- Assemblies of three strains - Raw data
- SNPs - InDels
- CNVs - SV
38. Progress!
We begin issuing
data DOIs Journals accept
articles with data August
July that have data DOIs
Data DOIs listed in journal
October
articles
Data DOIs are properly cited in the
November
reference section of journal articles
(It’s been a busy year.)
39. Challenges for a Journal/Database
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Shari
ng
Data publishing/DOI DOI®
40. Challenges for /
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Shari
ng
✔Data publishing/DOI DOI®
41. Reproducibility/Reuse
• BGI Cloud Computing resources for
handling and analyzing large-scale data.
• Integrated tools to promote more
widespread access, viewing, and analysis
of data.
• Encourage and aid use of workflow
systems for methods (e.g. submission of
Galaxy XML files).
42. Utility/Usability = ease of access
• Special series/hub for cloud-based tools
- Technical notes: test tools in the BGI-Cloud.
- Tools + test data (BGI or user) in one place.
- Aids reproducibility.
- Aids reviewers (free)
- Aids authors: visibility (pubmed, etc.)
hosting (included/free offers)
–contact us: editorial@gigasciencejournal.com
Oledoe flickr cc
44. Standards/Searchability/Sharing
• ISA-Tab compatibility to aid and promote
best practice in metadata reporting.
• All supporting data must be publically
available.
• Ask for MIBBI compliance and use of
reporting checklists.
• Part of the Biosharing network and the
International Neuroinformatics
Coordinating Facility.
45. Big Data
•Initiated 505 plant and animal genome
projects
•Completed fine or draft genome maps for
over 100 species
ldl.genomics.cn •Finished the sequencing of about 200
species
46. Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Assistant Editor: Alexandra Basford, PhD
Contact: editorial@gigasciencejournal.com
Follow GigaScience on Twitter @GigaScience
w w w. g i g a s c i e n c e j o u r n a l . c o m
w w w. g i g a D B . o r g
Notes de l'éditeur
Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
Have all of the metadata fields, working on integrating the tools.