NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
1. Data Consultant,
Honorary Academic Editor
Associate Director,
Principal Investigator
iDASH meeting, San Diego, Sept 15-16, 2014
The rise of the data-centric
research and publication enterprises
Susanna-Assunta Sansone, PhD
@biosharing
@isatools
@scientificdata
Board of Directors; Technical Advisory Board;
Coordinating Editors; Sector Lead
4. Worldwide movement for FAIR data
Credit: Barend Mons
http://bd2k.nih.gov/workshops.html#ADDS
5. Doing my fair share of work
Increase the level of annotation at the source, tracking provenance and using community standards
Notes and narrative Spreadsheets and tables Linked data and nanopublications
Notes in Lab Books
(information for humans)
Spreadsheets and Tables
( the compromise)
Facts as RDF statements
(information for machines)
Working with and for:
6. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
6
• make annotation explicit
and discoverable
• structure the descriptions for
consistency
• ensure/regulate access
• deposit and publish
• etc….
To make any dataset ‘FAIR’, one
must have standards, tools and
best practices to:
• report sufficient details
• capture all salient features of
the experimental workflow
7. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
7
…breath and depth
of the experimental context
…is pivotal
8. sample characteristic(s)
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
8
experimental design
experimental variable(s)
technology(s)
measurement(s)
protocols(s)
data file(s)
......
9. The role of reporting or content standards
Community-developed “norms” set to structure and enrich the
description of datasets, facilitating understanding, sharing and reuse
Including minimum
information reporting
requirements, or
checklists to report the
same core, essential
information
Including controlled
vocabularies, taxonomies,
thesauri, ontologies etc. to
use the same word and
refer to the same ‘thing’
Including conceptual
model, conceptual
schema from which an
exchange format is
derived to allow data to
flow from one system to
another
10. A community mobilization - some examples
de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
11. Organizational and operational structures - quite diverse
de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
12. Fragmentation, duplications and gaps
Technologically-delineated
views of the world
12
Biologically-delineated
views of the world
Generic features (‘common core’)
- description of source biomaterial
- experimental design components
Arrays
MS MS
Gels
Columns
Scanning Arrays &
Scanning
NMR
FTIR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
To compare and integrate data we need interoperable standards
13. Growing number of reporting standards
~ 156
~ 70
~ 334
Source: BioPortal
Databases,
annotation,
curation
tools
implementing
standards
miame
MIAPA
MIRIAM
MIQAS
MIX
MIGEN
CIMR
MIAPE
MIASE
REMARK
MIQE
CONSORT
MISFISHIE….
MAGE-Tab
GCDML
SRAxml
SOFT
FASTA
DICOM
SBRML
MzML
GELML
SEDML…
ISA-Tab
CML
MITAB
AAO
CHEBI
OBI
PATO ENVO
MOD
TEDDY
BTO
IDO…
XAO
PRO
DO
VO
15. BioSharing works to map the landscape of content standards in the
life sciences, broadly covering biological, natural and
biomedical sciences
The web-based, curated and searchable registry works to ensure the
standards are informative and discoverable, monitoring their
development, evolution also their use in databases
and adoption in data policies.
16. BioSharing’s goal is to assist stakeholders to make informed decisions:
• researchers, developers and curators who lack support and guidance on how to
best navigate and select the various content standards and understand their
maturity, or find databases that implement them;
• funders, journals, and librarians because they do not have enough information to
make informed decisions on which content standards or database should be
recommended in their policies, or funded or implemented.
18. Core functionalities:
• search and filtering
• submissions forms to add new records
• “claim” functionality of existing records
• person’s profile (as maintainer of
records) associated to the ORCID
profile
• visualization and views of content
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
1
8
Current content:
• Over 500
• Over 600
19. Registering and cataloging is just step one; the next include:
• Develop assessment criteria for usability and popularity of standards
CTSA Omics
Data Standards
Working Group
20. Registering and cataloging is just step one; the next include:
• Develop assessment criteria for usability and popularity of standards
• Associate standards to data policies and databases
• Assemble journal and funder policies re data storage
• Make fully cross-searchable
• Continue to embed it in the ecosystem of complementary registries
21. Registering and cataloging is just step one; the next include:
• Develop assessment criteria for usability and popularity of standards
• Associate standards to data policies and databases
• Assemble journal and funder policies re data storage
• Make fully cross-searchable
• Continue to embed it in the ecosystem of complementary registries
22. Registering and cataloging is just step one; the next include:
• Develop assessment criteria for usability and popularity of standards
• Associate standards to data policies and databases
• Assemble journal and funder policies re data storage
• Make fully cross-searchable
• Continue to embed it in the ecosystem of complementary registries
23.
24.
25. General-purpose, configurable format for
the description of experimental metadata
Designed to support:
• provenance tracking
• use of community minimal reporting
guidelines and terminologies
- reference system to link to (CDISC)
SDTM files; further connections
explored via
Designed to be converted to:
• a growing number of other metadata
formats, e.g. used by EBI repositories
• RDF representation with mapping to
several ontologies, incl. PROV-O to
deliver
analysis
method
script
Data file or
record in a
database
26.
27. ISA powers data collection, curation resources and repositories, e.g.:
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
28. Embedding and in activities
CEDAR:
Centre for Extended Data Annotation and Retrieval
(PI: Musen; pending notification of award)
The centre will take advantage of the recent growth in
community-driven metadata standards to develop
innovative methods to facilitate the annotation,
cataloguing, and retrieval of dataset collections.
(pending final decision and notification of award)
29. Role of publishers as “agents of change”
• Data has to become an integral part
of the scholarly communications
• Responsibilities lie across several
stakeholder groups: researchers,
data centers, librarians, funding
agencies and publishers
• Publishers occupy a leverage point
in this process
30. Launched on May 27th, 2014
Credit for sharing
your data
Focused on reuse
and reproducibility
Peer reviewed,
curated
Promoting Community
Data Repositories
Open Access
A new online-only publication for descriptions of scientifically valuable datasets
in the life, environmental and biomedical sciences, but not limited to these
Supported by:
31. Data Descriptor: narrative and structure
Experimental metadata or
structured component
(in-house curated, machine-readable
formats)
Article or
narrative component
(PDF and HTML)
32. Data Descriptor: narrative and structure
Experimental metadata or
structured component
(in-house curated, machine-readable
formats)
Article or
narrative component
(PDF and HTML)
33. Data Descriptor - focus on reuse
Detailed descriptions of methods and technical analyses supporting quality
of the measurements; does not contain tests of new scientific hypotheses
Sections:
• Title
• Abstract
• Background & Summary
• Methods
• Technical Validation
• Data Records
• Usage Notes
• Figures & Tables
• References
• Data Citations
In traditional publications this
information is not provided in a
sufficiently detailed manner
However this information is
essential for understanding,
reusing, and reproducing
datasets
34. Relation with traditional articles - content
Scientific hypotheses:
Synthesis
Analysis
Conclusions
Methods and technical analyses supporting the quality
of the measurements:
What did I do to generate the data?
How was the data processed?
Where is the data?
Who did what when
35. Relation with traditional articles - time
BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy)
AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s)
AFTER: expand on your research articles, adding further information for reuse of the data
36. Citations of and links to data files - databases
Joint Declaration of Data Citation Principles by
the Data Citation Synthesis Group
37. Value added component integrated in a
growing ecosystem
We currently recognize over
50 public data repositories
Research
papers
Data
Data
records
Descriptors
38. Peer review process focused on quality and reuse
Evaluation is not be based on the perceived impact or novelty of the findings
• Experimental rigour and technical data quality
o Methodologically sound
o Technical validation experiments and statistical analyses
o Depth, coverage, size, and/or completeness of data sufficient for the types
of applications
• Completeness of the description
o Sufficient details to allow others to reproduce the results, reuse or
integrate it with other data
o Compliance with relevant minimum information or reporting standards
• Integrity of the data files and repository record
o Data files match the descriptions in the Data Descriptor
o Deposited in the most appropriate available data repository
39. • Neuroscience, ecology, epidemiology, environmental science, functional
genomics, metabolomics, toxicology etc.
• New previously published individual datasets, curated aggregation and
citizen science:
o a fuller, more in-depth look at the data processing steps, supported by
additional data files and code from each step
o additional tutorial-like information for scientists interested in reusing or
integrating the data with their own
• Datasets in figshare, Dryad and domain specific databases
• Code deposited in figshare and GitHub
• First collection:
39
Current content is diverse - bimonthly releases
40. • Neuroscience, ecology, epidemiology, environmental science, functional
genomics, metabolomics, toxicology etc.
• New previously published individual datasets, curated aggregation and
citizen science:
o a fuller, more in-depth look at the data processing steps, supported by
additional data files and code from each step
o additional tutorial-like information for scientists interested in reusing or
integrating the data with their own
• Datasets in figshare, Dryad and domain specific databases
• Code deposited in figshare and GitHub
• First collection:
40
Current content is diverse - bimonthly releases
41. Acknowledgements
Advisory Boards and Collaborators
Philippe
Rocca-Serra, PhD
Alejandra
Gonzalez-Beltran, PhD
Eamonn
Maguire
Milo
Thurston, PhD
Visit
nature.com/scientificdata
Email
scientificdata@nature.com
Tweet
@ScientificData
Honorary Academic Editor
Susanna-Assunta Sansone, PhD
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Victoria Newman
Advisory Panel and Editorial Board including
senior researchers, funders, librarians and curators