A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
1. Converged IT and Data Commons
Simon Twigger, Ph.D.
1
Molecular Med Tri-Con
February 13th 2018
2. BioTeam
2
est, Objective, Vendor and
nology agnostic
ears bridging the gap
een IT and Science
by scientists forced to
IT to get the job done
ual company with
3. About Me
3
Strategic assessments,
Cloud (AWS/Google),
DevOps, Data Commons,
software development,
Whole genome sequencing resource identifies 18 new
candidate genes for autism spectrum disorder
Nature Neuroscience 20, 602–611 (2017)
4. Overview
‣ Scope
• Organization’s perspective
• Planning & implementation considerations
‣ Strategy around ‘data commons’
• How might it support the ‘bigger picture’?
‣ Implementation of a data commons
• What does it involve, what tools/tech might be useful
4
6. +
- Preliminary
Data
Supporting
Data
Run
Experiments
Raw
Data
Management
Analyses
Archive
Data
Publish
Reuse
Data
Generic Scientist’s Data ‘Journey’
Neutral
Experience
Likely external
Download as
needed.
Have Equipment
& instruments
Not consistent
Rarely any plan
Data spread out
No structure unless
using core facility
Instrument backup
uncertain
Compute OK
Have software
Have tools
Backup ?
Hard to find
Cant track across
project
Storage limits
Save long term
Physical data -
slides
Rarely reused
May not be readily
reusable by others
Hard to find data
Rely on original
person
or manual hunting
Can find own data
May not use others
No real issues
May store pub’d
data in one spot
Submit to GEO,
etc
7. What is a data commons?
7
An integrated (converged) environment that
provides access to shared data, compute and
analytic tools at a scale (or convenience)
greater than that typically available
Data Commons
Data Compute Tools
8. Where are the key areas for your research
What is in common for you?
8
Data Commons
Data
Compute
Tools
9. Do you need a commons….?
9
Lots of need for
compute, just need a
cluster?
Data Commons
Lots of data, not much
in common
Just need more
storage, and/or data
management?
Data
Common
sSignificant amounts of
data in common, plus
compute and tools
10. What problem are we solving, is a commons the
answer…
Strategy
10
‣ What does our data/compute/tools usage look like?
‣ What are the common issues that a Commons might
help with?
‣ What should our Commons contain?
‣ How does a Data Commons fit with our longer term
goals?
‣ How will we measure success?
11. To get some real information on what’s going on,
what the real problems (opportunities!) are
Digital Asset Inventory
‣ Experimental Approaches
• What type of analyses are they doing, what obstacles are getting in
their way?
• What data is the input, what is the output, file formats?
• Data volumes, storage & compute requirements
‣ Data Management
• Data management plan (ha!), “Wild West”?
• Metadata, descriptors, ontologies?
‣ Search/Retrieval/Sharing
• How do they go back and find old data, what do they search on
• What do they share (if anything), with whom
11
12. To get some real information on what’s going on,
what the real problems (opportunities!) are
Digital Asset Inventory
‣ Informatics/Core groups
• Algorithms, software (version control!), pipelines, data types,
data volumes, software packaging & deployment
• Workflow, workflow tools, data movement
• Data sharing, Data archiving & retrieval
‣ IT staff
• Current storage, rate of data growth, data lifecycle
• Compute resources, usage; Network, data flow
‣ Leadership
• Primary goals, data management strategy, budget, risks
12
13. It could help address a number of challenges for a
variety of audiences
Reasons for a Commons
‣ Scientists
• manage data, find data, ideally share data
• Democratize access to data & associated compute
‣ IT
• Manage storage, reduce duplication, ensure backups and DR
• Security - ensure the environment is appropriate for the data being
use (e.g. clinical)
• Consolidate compute into fewer environments, converge towards a
common platform…
‣ Organizational Leadership
• Promote management/sharing/reuse of data, leverage existing data
for new discoveries, reduce risk 13
15. Lab2 Lab2
Raw
Data
Final
Data
Reuse
Data is generated, added to a ‘commons’
environment for others to use
General Commons Data Flow
15
Lab Lab
Core
Raw
Data
Final
Data
Data Commons
Publish
16. Potential Stakeholders
‣ Scientists
‣ Division/Group Head
‣ PI/Lab Head/Lab manager
‣ Lab Tech
‣ Postdoc, Student, etc.
‣ Collaborator
‣ Informatics Team Members
‣ Informatics team lead
‣ Data scientist
‣ Core Labs
‣ Head of Core Lab
‣ Core lab manager (if different from the
Head of the Core)
‣ Scientist within the core lab
‣ Information Technology Team
Members
‣ Person in charge of compute, HPC, VMs,
Containers, Cloud, etc
‣ Person in charge of storage, etc.
‣ Person in charge of managing backups,
replication, and archiving
‣ Person in charge of storage capacity
planning
‣ Person in charge of network, data
movement to and from HPC, storage
‣ Person in charge of maintaining commons-
related systems, deployment, updates,
maintenance.
‣ Security and Compliance Office
‣ Leadership
‣ Persons responsible for strategic IT
decisions and purchasing
‣ Billing - to assign storage costs to specific
groups/users
‣ Legal - to be able to find data to respond to
formal requests for information (e.g. FOIA),
institute legal holds, data retention policies
‣ Non-human users (scripts, etc.)
‣ Scripts written to find data, add metadata,
move data, catalog usage, etc.
16
17. Define use cases, stories, competency questions
Nail down the details
17
As a: Scientist
I want: as much useful metadata associated with my data files as
possible, while doing as little extra work as possible (preferably no
extra work) to add this metadata.
So that: I can benefit from searching, reporting, organization, etc.
that comes with high quality metadata without having to take away
time and effort from research to add the metadata manually.
(the defining) User Story…
Example Competency questions
https://biocaddie.org/workgroup-3-group-links
25. Considerations
‣ Define the Commons for you
• Address real pain points for your community
• What does success look like?
‣ Its a complex engineering challenge
• Databases, containers, compute, network, storage, etc.
• ‘Just clone the repo’ never quite works as hoped…
‣ Its a complex social engineering challenge..
• Common metadata, formats, sharing, collaboration
• Scientists would rather share their tooth brush than…
‣ Its (ideally) a long term commitment
• Funding, ‘evolvability’ to avoid technology lock-in
25
26. Metrics for these…?
Goals for a Commons
‣ Scientists
• Can manage data, find data, securely share data
• Have ready access to data & associated compute
‣ IT
• Has visibility into storage, has reduced duplication
• Have ensured backups and enabled (and tested) DR
• Are confident that the environment is appropriate for the data being used (e.g. clinical)
• Have consolidated compute into fewer environments and are converging towards a
common platform…
‣ Organizational Leadership
• Can demonstrate sharing/reuse of data
• Have examples of leveraging existing data for new discoveries
• Can quantify the reduction in risk
26
Notes de l'éditeur
Scope - your institution or company has decided that a Data Commons is needed - now what?
What problem are we solving?
General view of scientist’s data journey - many areas are OK, things are getting done, however, room for improvement in many areas, and particularly in data management and reuse
Greater Scale = Data is key, but its not just more data, compute or tools, also more access for more people who couldn’t get at these types of resources previously
One or more of these needs to be in common and significantly ‘big’/important/painful for a commons approach to make sense
Data commons really requires a reasonable amount of Data in Common (common formats, commonly used, commonly accessed,
Needs to address a real problem, have a demonstrable impact on something important. How might you find what these things are - can’t guess, you have to ask people and one way is to conduct a digital asset inventory
Product development, Talk to the users, find out what their problems are, particularly as it relates to issues that a Data Commons might help with - Here’s some questions the research staff
Product development, Talk to the users, find out what their problems are, particularly as it relates to issues that a Data Commons might help with (more storage, compute, analysis, more access to all of the above). IT is a critical partner as they need to be on board, will have much to contribute and potentially have much to benefit from this type of environment. Leadership can help articulate the bigger vision, what their main goals are, primary concerns, budget, etc.
Lots of potential benefits from a Commons environment, however, which one(s) are relevant and important to your organization/constituency?
Lots of groups/people to consider with a project of this nature
What metadata attributes are needed, what terms will be used, what QC is necessary
Note - scientists aren’t great at metadata…
This is a really good set of high quality open source platforms
Docker, DockStore, Packer, etc. are all great ways to go, either to reuse the Gen3 platform or to use to create your own environment.
Dockstore - OICR, interesting
Shifter - Containers on HPC, Docker-compatible
Singularity - Containers on HPC,
This is a really good set of high quality open source platforms
Complex engineering - do you have the staff with the skills to pull this off?
Social - Build it and they probably won’t come unless there’s clearly something in it for them
Long term - its not just the first, sexy, 3-6m commons initiative, its the long term dedication to data management, data sharing
Lots of potential benefits from a Commons environment, however, which one(s) are relevant and important to your organization/constituency?