DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
My FAIR share of the work - Diamond Light Source - Dec 2018
1. The FAIR principles: theory and practices
My fair share of the work to enable FAIRness
Susanna-Assunta Sansone, PhD
Diamond Light Source, Oxfordshire, UK, 12 December 2018 Slides at: https://www.slideshare.net/SusannaSansone
Associate Professor, Associate Director
ORCiD: 0000-0001-5306-5690
Twitter: @SusannaASansone
Consultant and Founding Academic Editor
2. • Increasing number of discoveries made using other people’s data
Better data = better science and more efficiently
Datasets SOPs Figures, Photos Workflows Slides Codes Tools DatabasesAlgorithmsDocument
3. • Increasing number of discoveries made using other people’s data
• We need data that are
§ Discoverable by humans and machines
§ Retrievable and structured in standard format(s)
§ Self-described so that third parties can make sense of it
§ Intended to outlive the experiment for which they were collected
Better data = better science and more efficiently
4. Key problems with data:
low findability and understandability
• Not always (well cited) and stored
o True for any other digital asset
• Poorly described for third party reuse
o Different level of details and annotation
• Reporting and annotation activities are perceived as time
consuming
o Often rushed and minimally done
6. • A crisis in confidence in research integrity in certain fields
Driving forces of change
https://retractionwatch.com/2011/05/04/the-importance-of-
being-reproducible-keith-baggerly-tells-the-anil-potti-story/
https://doi.org/10.1371/journal.pmed.0020124
Crimes and misdemeanors of science
7. • A crisis in confidence in research integrity in certain fields
• New data types and multidisciplinary activities
Engineering the Imagination: Disability, Prostheses and the Body
Engineering and cultural studies
Exploring Water Re-use - the nexus of politics, technology and economics
Before and After Halley: Medieval Visions of Modern Science
Astrophysics and medieval studies
The ontogeny of bone microstructure as a model of programmed transformation in 4D materials
Archaeology, anthropology and mechanical engineering
How can we improve Healthcare IT when most people are blind to its poor engineering?
ICT, medicine and engineering
People, Pollinators & Pesticides in Peri-Urban Farming
Biology, zoology, law & policy
Systemic Risk: Mathematical Modelling and Interdisciplinary Approaches
Mathematics and economics
Driving forces of change
8. • A crisis in confidence in research integrity in certain fields
• New data types and multidisciplinary activities
• The changing world of scholarly publishing
Driving forces of change
9. • A crisis in confidence in research integrity in certain fields
• New data types and multidisciplinary activities
• The changing world of scholarly publishing
• Data-relates mandates and policies by funders
• Data management in a regulatory context
• The need for recognition and credit
Driving forces of change
10. A set of principles to
enhance the value of all
digital resources
Developed and endorsed by researchers,
service providers, publishers, funding agencies,
industry partners; including but not limited to
individuals part of:
2014
2016
11. Findable
• Globally unique, resolvable, and persistent identifiers
• Machine-readable metadata to support structured search
Accessible
• Clearly defined access and security protocols
Interoperable
• Extensible machine interpretable formats for data + metadata
• Linked to other resources
Reusable
• Provide licensing, provenance, and use community-standards
The FAIR Principles – in a nutshell
Emphasis is on enhancing the ability of machines to automatically find
and use the data, in addition to supporting its reuse by individual
12. The invisible machinery
• Identifiers and metadata to be implemented by technical
experts in tools, registries, catalogues, databases, services
• It is essential to make standards ‘invisible’ to lay users, who
often have little or no familiarity with them
13. • Descriptors for a digital object that help to understand what it
is, where to find it, how to access it etc.
• The type of metadata depends also on the digital object
• The depth and breadth of metadata varies according to
their purpose
▪ e.g. reproducibility requires richer metadata then citation
Metadata – fundamentals
Illustration by Jørgen Stamp
digitalbevaring.dk CC BY 2.5 DenmarkIllustration by Jørgen Stamp
digitalbevaring.dk CC BY 2.5 Denmark
15. “….We support effort to promote voluntary knowledge diffusion and technology transfer on mutually
agreed terms and conditions. Consistent with this approach, we support appropriate efforts to promote
open science and facilitate appropriate access to publicly funded research results on findable,
accessible, interoperable and reusable (FAIR) principles….”
http://europa.eu/rapid/press-release_STATEMENT-16-2967_en.htm
G20 Leaders’ Communique Hangzhou Summit
18. FAIR-supporting tools and services being developed in
major EU and USA biomedical infrastructure
programmes, e.g.
€19 million
2015 - 2019 $95.5 million
2017 - 2020
€3.3 billion
2014 - 2020
€20 million
2018 - 2022
19. • Publishers occupy a “leverage point” in this process
• Data has became an integral part of the scholarly communications
• ….and FAIR has opened up new (business) opportunity for some….
FAIR-enabling data journals and publishers’ services
21. Notes in Lab Books
(information for humans)
Spreadsheets andTables
( the compromise)
Facts as RDF statements
(information for machines)
Notes and narrative Spreadsheets and tables Linked data and data publication
Notes in Lab Books
(information for humans)
Spreadsheets andTables
( the compromise)
Facts as R
(informat
n Lab Books
ation for humans)
Spreadsheets andTables
( the compromise)
Facts as RDF statements
(information for machine
Increase the level of annotation at the source, tracking
provenance and using community standards
Our group’s R&D activities
Information management, and interoperability of applications;
data reproducibility and the evolution of scholarly publishing
22. model and related formats
Initiated in
2003
Helps researchers to:
describe multi-modal experiments
follow community-developed standards
curate, analyze, release, share and publish
23. • Domain-level descriptors that are essential for interpretation,
verification and reproducibility of datasets
• The depth and breadth of descriptors vary according to the
type of study performed, generally allowing
▪ experimental components (e.g., design, conditions, parameters),
▪ fundamental biological entities and biomaterial (e.g., samples,
genes, cells),
▪ complex concepts (such as bioprocesses, tissues and diseases),
▪ instruments, analytical process and the mathematical models, and
▪ their instantiation in computational simulations (from the molecular
level through to whole populations of individuals)
to be harmonized with respect to structure, format and
annotation
Richer metadata: for Interoperability and Reuse
24. Nowadays ISA
(format and/or tools)
powers 28 public resources, e.g.,
as well as a number of
‘internal’ resources
ISA is now also a native
Galaxy Data Type
and data journals e.g.,
31. Mar15
Jun15
Dec15
Jun16
Aug15
May16
Sep16
Mar17
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development
SOP and
metadata
strawman
<DATS>
name
DATS
v1.1
May17
DATS v2.0
(with access
metadata,
WG7)
DATS v2.1
(schema.org
JSON-LD)
DATS
v2.2
Metadata
specification V1.0
with JSON
schema
Use cases
workshop
1st DATS
workshop
WG3 formed;
telecons start;
dissemination via
2nd DATS
workshop
WG7 formed;
telecons start
WG12 formed;
telecons start
Evaluation & iterative refinement Continued evaluation & consolidation
primarily metadata modelers
primarily implementers
Defining the model: technical and social engineering
32. Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
(v1.0, v1.1, v2.0, v2.1, v2.2)
The development process in a nutshell
❖ BioProject
❖ BioSample
❖ PRIDE-ml
❖ MAGE-tab
❖ GA4GH schema
❖ SRA xml
❖ ISA
❖ ….
❖ DataCite
❖ RIF-CS
❖ DCAT
❖ PROV,
❖ VOID
❖ Dublin Core
❖ schema.org
❖ ….
33. 12 international
teams, plus
commercial cloud
service providers
(Amazon, Google,
Microsoft)
Who we are:
What is a Data
Commons:
A collection of technical
components (software, protocols,
standards, tools) that:
• work together and connect
directly to the cloud
• permit access, use, and
analysis of data to support
biomedical research
Test data
34. Domain-specific metadata standards for datasets
MIAME
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE
….
REMARK
CONSORT
SRAxml
SOFT FASTA
DICOM
MzML
SBRML
SEDML
…
GELML
ISA
CML
MITAB
AAO
CHEBIOBI
PATO ENVO
MOD
BTO
IDO
…
TEDDY
PRO
XAO
DO
VO
de jure
standard
organizations
de facto
grass-roots
groups
350+
150+
700+
~1300
Formats Terminologies Guidelines Identifiers
9
36. • Perspective and focus vary, ranging:
§ from standards with a specific biological or clinical domain of study
(e.g. neuroscience) or significance (e.g. model processes)
§ to the technology used (e.g. imaging modality)
• Motivation is different, spanning:
§ creation of new standards (to fill a gap)
§ mapping and harmonization of complementary or contrasting efforts
§ extensions and repurposing of existing standards
• Stakeholders are diverse, including those:
§ involved in managing, serving, curating, preserving, publishing or
regulating data and/or other digital objects
§ academia, industry, governmental sectors, and funding agencies
§ producers but also also consumers of the standards, as domain (and
not just technical) expertise is a must
A complex landscape
37. Technologically-delineated
views of the world
Biologically-delineated
views of the world
Generic features ( common core )
- description of source biomaterial
- experimental design components
Arrays &
Scanning
…
Columns
Gels
MS MS
FTIR
NMR
Columns
…
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
neuroscience
Fragmentation, duplications and gaps
Arrays
Scanning
…
38. Arrays
Scanning
… Arrays &
Scanning
…
Columns
Gels
MS MS
FTIR
NMR
Columns
…
transcriptomics
proteomics
metabolomics
Modularization to combine and validate
plant biology
epidemiology
neuroscience
Proteomics-based
investigations of
neurodegenerative diseases
Proteomics and metabolomics-
based investigations of
neurodegenerative diseases
42. Accelerate the discovery, selection and use of these resources
Increase their visibility, reuse, adoption and citation
43. Databases and
data repositories
Community standards,
focusing on metadata and identifier schemas
Formats Terminologies Guidelines
Mapping the landscape of these resources
Data policies
by funders, journals and
other organizations
Identifiers
• Providing functionalities to search, visualize and create custom views
• Working to assess the FAIRness of these digital resources
44. Databases and
data repositories
Community standards,
focusing on metadata and identifier schemas
Formats Terminologies Guidelines
Tracking maturity and evolution
Data policies
by funders, journals and
other organizations
Identifiers
Ready for use, implementation, or recommendation
In development
Status uncertain
Deprecated as subsumed or superseded
All records are manually curated
in-house, verified and claimed by the
community behind each resource
48. Ensures that standards, databases, repositories, policies are:
• Findable, e.g., by providing DOIs, functionality to register, claim, maintain,
interlink, classify, search and discover them
• Accessible, e.g., identifying their level of openness and/or licence type
• Interoperable, e.g., highlighting which repositories implement the same
standards to structure and exchange data
• Reusable, e.g., knowing the coverage of a standard and its level of
endorsement by a number of repositories should encourage its use or
extension in neighbouring domains, rather than reinvention
FAIRsharing enables the FAIR principles
50. Metrics to assess FAIRness
• A proposed core set of 14 semi-quantitative metrics (measurable
indicators) for the evaluation of FAIRness
• FAIRness is an aspirational target and reflects the extent to which a
digital resource addresses the FAIR principles as per the expectations
defined by a community
FAIRmetrics.org
also part of:
51. • They must ensure the public
registration of their identifier
schemes (FM-F1A), (secure)
access protocols (FM-A1.1),
knowledge representation
languages (FM-I1), licenses
(FM-R1.1), provenance
specifications (FM-R1.2)
• 14 universal metrics covering each of the FAIR sub-principles
• The metrics demand evidence from the community, some of which may
require specific new actions
• Digital resource providers must provide a web-accessible document with
machine-readable metadata (FM-F2, FM-F3), detail identifier
management (FM-F1B), metadata longevity (FM-A2), and any
additional authorization procedures (FM-A.2)
• They must provide evidence of ability to find the digital resource in
search results (FM-F4), linking to other resources (FM-I3), FAIRness of
linked resources (FM-I2), and meeting community standards (FM-R1.3)
52. Currently, two prototypes to assess FAIRness
• FAIRsharing works to serve as:
• Registry to describe digital assets, such as databases/repositories, standards,
policies, enhancing their discoverability (schema.org), citability (DOIs)
• Look up service for identifier schemas and standards (phase 1: now)
• Validation service against metadata standards (phase 2: planned)
53. Pre-print at: https://doi.org/10.1101/245183
Authored by 68 authors, representing the
FAIRsharing community of core adopters, advisory
board members, and key collaborator, who are
stakeholders from academia, industry, funding
agencies, standards organizations, infrastructure
providers and scholarly publishers
RDA FAIRsharing WG:
https://rd-alliance.org/group/fairsharing-registry-connecting-
data-policies-standards-databases.html
accepted by
54. More work planned in funded ELIXIR-related projects
• H2020 “EOSC-Life”
brings together the 13 Biological and Medical ESFRI
research infrastructures to create an open
collaborative digital space
§ FAIRification guidance by UK (Oxford)
• Innovative Medicine Initiative “FAIRplus”:
brings together representatives of several ELIXIR Nodes.
Janssen, AZ, Eli Lilly, GSK, Novartis, Bayer, BI to address
a specific IMI call
§ FAIRification cookbook by UK (Oxford)
55. • Better data = better science
§ improving FAIRness of data will increase potential for reuse
• A variety of activities are ongoing to support FAIRness
§ work in progress …..on all fronts
• This is not just about a technology challenges
§ we need FAIR-supportive data policies and culture changes
Summary
56. Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Massimiliano
Izzo, PhD
Research Software Engineer
Peter
McQuilton, PhD
Knowledge Engineer
Allyson
Lister, PhD
Knowledge Engineer
Melanie
Adekale, PhD
Biocurator Contractor
Delphine
Dauga, PhD
Biocurator Contractor
Better data = better science
Susanna-Assunta
Sansone, PhD
Associate Professor, Associate
Director
Ramon
Granell
Research Software and
Knowledge Engineer
Dominique
Batista
Research Software and
Knowledge Engineer
Milo
Thurston, DPhD
Research Software Engineer
We work with and for
to make data and other digital research outputs