Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

A Journal’s Perspective on Data
Standards and Biocuration
Alexandra Basford, PhD

w w w. g i g a s c i e n c e j o u r n a l . c o m

Overview
/ The Curation
Challenges of a
Introduction Journal/Database

Reproducibility/Reuse
Data Publishing

Utility/Usability
Our DOI
Adventures
Standards/Searchability/
Sharing

Overview
/ The Curation
Challenges of a
Introduction Journal/Database
How do we deal with “big data”?
Data Publishing

Utility/Usability
Our DOI
Adventures
Standards/Searchability/
Sharing

w w w. g ig asci en cej o u rn al . co m

is a new open-access open-
data journal for the publication of all types of
biological studies that use or create large-
scale data sets

The scope spans the biomedical and life sciences,
including:
- “Omics” - Ecology
- Imaging - Medicine
- Neuroscience - Systems biology

… “big and sharable”
Published by
in partnership with

Editorial Board – International

Stephan Beck, UK Stephen O'Brien, USA
Alvis Brazma, UK Hanchuan Peng, USA
Ann-Shyn Chiang, Taiwan Russell Poldrack, USA
Richard Durbin, UK Ming Qi, China/USA
Paul Flicek, UK Susanna-Assunta Sansone, UK
Robert Hanner, Canada Michael Schatz, USA
Yoshihide Hayashizaki, Japan David Schwartz, USA
Henning Hermjakob, UK Fritz Sommer, USA
Wolfgang Huber, Germany Lincoln Stein, Canada
Gary King, USA Sumio Sugano, Japan
Tin-Lap Lee, Hong Kong Thomas Wachtler, Germany
Donald Moerman, Canada Jun Wang, China
Karen Nelson, USA Alistair Young, New Zealand
Francis Ouellette, Canada Zang Yufeng, China
Lennart Hammarström, Sweden Marie Zins, France
Paul Horton, Japan

Editorial Board – Multidisciplinary

Stephan Beck, Epigenomics Stephen O'Brien, Genomics
Alvis Brazma, Transcriptomics Hanchuan Peng, Imaging/Neuro
Ann-Shyn Chiang, Neuroscience Russell Poldrack, Neuroscience
Richard Durbin, Genetics/Genomics Ming Qi, Genetics
Paul Flicek, Genomics Susanna-Assunta Sansone, Standards
Robert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud Computing
Yoshihide Hayashizaki, Genomics David Schwartz, Optical Mapping
Henning Hermjakob, Proteomics Fritz Sommer, Neuroscience
Wolfgang Huber, Functional Genomics Lincoln Stein, Cloud Computing
Gary King, Medicine Sumio Sugano, Genomics
Tin-Lap Lee, Genomics Thomas Wachtler, Neuroscience
Donald Moerman, Functional Genomics Jun Wang, Genomics
Karen Nelson, Metagenomics Alistair Young, Medical Imaging
Francis Ouellette, Genomics Zang Yufeng, Neuroscience
Lennart Hammarström, Immuno/Genetics Marie Zins, Medicine
Paul Horton, Genetics/Tools

An Unusual Format
• GigaScience combines standard manuscript
publication with an ever expanding database
• Evolving data repository
– Integrating tools for public access, viewing, and analysis of
the stored data
– Improvements driven by community input
• All datasets are assigned data digital object
identifiers (DOIs) to make them easy to access, track,
and cite

&

Data Sharing Hurdles
• Technical
– too large volumes
– too heterogeneous
– no home for many data types
• Economic
– too expensive
– no long-term funding
• Cultural
– inertia
– no incentives to share
– unaware of how ?
– too time consuming

Changing Trends

Cultural shift towards data sharing.

Growing/widening user base.

The long tail of new “big-data” producers?

Curation, cutation, curation

?

Use of Data = Importance + Usability

subjective? easier to assess

Challenges for a Journal/Database


Utility/Usability

Standards/Searchability/Shari
ng

Data publishing/DOI DOI®

Why DOI®s?
• Guarantee of permanency .org
• Clear method for data tracking and data citation,
allowing:
– Increased the searchability (and hopefully use) of data
– Credit for data production, making it clear who produced
the data and when
– Credit to original authors for their data’s use
– The ability to track and receive feedback on data usage
– A data citation metric potentially rivaling and
complementary to the impact factor
– The potential make the data available and receive credit
for it earlier, then later publishing papers on the dataset

Largest Sequencing Capacity in the World

Sequencers Data Production
137 Illumina/HiSeq 2000 5.6 Tb / day
27 LifeTech/SOLiD 4 > 1500X of human genome / day
16 AB/3730xl + 110 MegaBACEs
Multiple Supercomputing Centers
2 Illumina iScan
157 TB Flops
20 TB Memory
12.6 PB Storage

Datasets
Vertebrates
Invertebrates Giant panda Plants
Macaque Chinese cabbage
Ant
- Chinese rhesus Cucumber
- Florida carpenter ant
- Crab-eating
- Jerdon’s jumping ant Foxtail millet
Naked mole rat Pigeonpea
- Leaf-cutter ant
Penguin Potato
Roundworm
- Emperor penguin Sorghum
Silkworm
- Adelie penguin
Pigeon, domestic
Human
Polar bear
Asian individual (YH)
Sheep
- DNA Methylome
Tibetan antelope
- Genome Assembly
- Transcriptome Microbe
Ancient DNA (coming soon)
E. Coli O104:H4 TY-2482
- Saqqaq Eskimo
- Aboriginal Australian Cell Line
Chinese Hamster Ovary

Our First DOI®

To maximize its utility to the research community and aid those fighting the current
epidemic, genomic data is released here into the public domain under a CC0
license. Until the publication of research papers on the assembly and whole-
genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring
rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

N Engl J Med 2011; 365:718-724.

Sorghum as the New Gold Standard

• Data also submitted to NCBI (including SV data
to dbVar)
• Submission to public databases complemented
by its citable form in GigaDB:
- Assemblies of three strains - Raw data
- SNPs - InDels
- CNVs - SV

Progress!
We begin issuing
data DOIs Journals accept
articles with data August
July that have data DOIs

Data DOIs listed in journal
October
articles

Data DOIs are properly cited in the
November
reference section of journal articles
(It’s been a busy year.)

Challenges for /


Utility/Usability

Standards/Searchability/Shari
ng

✔Data publishing/DOI DOI®

• BGI Cloud Computing resources for
handling and analyzing large-scale data.
• Integrated tools to promote more
widespread access, viewing, and analysis
of data.
• Encourage and aid use of workflow
systems for methods (e.g. submission of
Galaxy XML files).

Utility/Usability = ease of access
• Special series/hub for cloud-based tools
- Technical notes: test tools in the BGI-Cloud.
- Tools + test data (BGI or user) in one place.
- Aids reproducibility.
- Aids reviewers (free)
- Aids authors: visibility (pubmed, etc.)
hosting (included/free offers)
–contact us: editorial@gigasciencejournal.com
Oledoe flickr cc

Utility/Usability = tools

Tin-Lap Lee, CUHK

Standards/Searchability/Sharing
• ISA-Tab compatibility to aid and promote
best practice in metadata reporting.
• All supporting data must be publically
available.
• Ask for MIBBI compliance and use of
reporting checklists.
• Part of the Biosharing network and the
International Neuroinformatics
Coordinating Facility.

Big Data
•Initiated 505 plant and animal genome
projects
•Completed fine or draft genome maps for
over 100 species

ldl.genomics.cn •Finished the sequencing of about 200
species

Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Assistant Editor: Alexandra Basford, PhD

Contact: editorial@gigasciencejournal.com
Follow GigaScience on Twitter @GigaScience

w w w. g i g a s c i e n c e j o u r n a l . c o m
w w w. g i g a D B . o r g

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

En vedette

En vedette (20)

Similaire à Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Similaire à Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration (20)

Plus de GigaScience, BGI Hong Kong

Plus de GigaScience, BGI Hong Kong (20)

Dernier

Dernier (20)

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Notes de l'éditeur