This document discusses the need for persistent identifiers (PIDs) in structural biology research. It notes that structural biology projects using electron microscopes and synchrotrons produce large amounts of data from complex experiments and workflows. PIDs are needed to identify samples, experiments, datasets, software and facilitate linking metadata to improve findability, accessibility and reproducibility of the data according to FAIR principles. However, minting large quantities of PIDs in real time for structural biology projects poses challenges around metadata schemas, linking data together, and integrating existing facility solutions.
Formation of low mass protostars and their circumstellar disks
Why we need PIDs for Structural Biology - EOSC Symposium, Budapest, 2019
1. Why we need PIDs for Structural
Biology
Marcus Povey
EOSC Symposium, Budapest
November 2019
2. Instruct-ERIC
• Instruct-ERIC helps facilitate access to
cutting edge research infrastructure
within the domain of Structural Biology
• We have centres all around Europe
• 13 member countries, with 2 observers
• Funded through direct member country
contribution at the ministerial level
4. Our Mission
• ACCESS – Facilitating access to cutting
edge research infrastructure and
methods
• FACILITY – Helping research
infrastructures manage their equipment,
and representing their interests
• COMMUNITY – Contributing to the wider
scientific community as a whole, and
helping researchers, projects and
infrastructures work better together
• DATA – Improving access to research data,
and facilitating Open Access
• ARIA Cloud!
DATA
FACILITY
ACCESS
COMMUNITY
5. Structural Biologists
use Microscopes
• 12 Samples (grids) in a loader
• Each grid can potentially have multiple
structures that are of interest (projects
with 96 well grids are underway)
• Outputs ~1-3TB of HD Video per day
6. Electron Microscopy
Researcher submits a
proposal for access
Researcher produces a
sample locally
Sample is loaded
onto a grid
Grid goes into Electron
Microscope
Micrographs go into
pre-processor
Particle picking, auto
& manual processing
Datasets are
analysed by 10s of
software packages
3D structure
determined
Structure deposited
into PDB/EM-DB
Researcher submits a
publication to journal
7. There are a lot of
things to track…
• Number of sample grids
• Potentially multiplied by samples on
a grid
• Multiplied by grids in a microscope
• Multiplied by frames of video
• Multiplied by number of microscopes per
facility
• … multiplied by the number of facilities.
8. ... But wait, there’s
more!
• We need to know the data processing
workflows used
• We need to identify samples and
associated metadata
• We need to know a given machine’s
configuration
• Software and software versions used to
process and analyse data
• Researchers involved in project
• Funding applications (proposals)
10. Crystallography
Researcher submits a
proposal for access
Researcher produces a
sample locally
Sample added to
crystal plate
Crystal plate imaged
regularly
Crystals loaded onto
pins
Crystals shot with X-
Rays at synchrotron
Diffraction pattern
auto-analysis and re-
running
3D structure
determined
Structure deposited
into PDB
Researcher submits a
publication to journal
11. Why Identify? –
Improving workflows
• Different samples need to prepared in
different ways
• Not all experiments are successful
• Do-overs are expensive!
12. Why Identify? - FAIR
• Want to be FAIR! (improve findability,
interoperability and reproducibility of
data sets)
• Commitment to Open Access
• But… Data sets are too large to practically
move about
• Machine configurations are often only
available on the machine itself
• Software gets modified
• How do we make this findable, accessible
and reusable?
13. Some Problems to Consider
• Will require minting of large quantities of PIDs
• … In near real time
• Metadata schema around existing PIDs seem focussed around
publications
• ... But we’d need to extend (and make it machine readable)
• … ditto how best to link / graph data together
• Some facilities have rolled their own solutions, how to cooperate?