Converged IT and Data Commons

Converged IT and Data Commons
Simon Twigger, Ph.D.
1
Molecular Med Tri-Con
February 13th 2018

BioTeam
2
est, Objective, Vendor and
nology agnostic
ears bridging the gap
een IT and Science
by scientists forced to
IT to get the job done
ual company with

About Me
3
Strategic assessments,
Cloud (AWS/Google),
DevOps, Data Commons,
software development,
Whole genome sequencing resource identifies 18 new
candidate genes for autism spectrum disorder
Nature Neuroscience 20, 602–611 (2017)

Overview
‣ Scope
• Organization’s perspective
• Planning & implementation considerations
‣ Strategy around ‘data commons’
• How might it support the ‘bigger picture’?
‣ Implementation of a data commons
• What does it involve, what tools/tech might be useful
4

+
- Preliminary
Data
Supporting
Data
Run
Experiments
Raw
Data
Management
Analyses
Archive
Data
Publish
Reuse
Data
Generic Scientist’s Data ‘Journey’
Neutral
Experience
Likely external
Download as
needed.
Have Equipment
& instruments
Not consistent
Rarely any plan
Data spread out
No structure unless
using core facility
Instrument backup
uncertain
Compute OK
Have software
Have tools
Backup ?
Hard to find
Cant track across
project
Storage limits
Save long term
Physical data -
slides
Rarely reused
May not be readily
reusable by others
Hard to find data
Rely on original
person
or manual hunting
Can find own data
May not use others
No real issues
May store pub’d
data in one spot
Submit to GEO,
etc

What is a data commons?
7
An integrated (converged) environment that
provides access to shared data, compute and
analytic tools at a scale (or convenience)
greater than that typically available
Data Commons
Data Compute Tools

Where are the key areas for your research
What is in common for you?
8
Data Commons
Data
Compute
Tools

Do you need a commons….?
9
Lots of need for
compute, just need a
cluster?
Data Commons
Lots of data, not much
in common
Just need more
storage, and/or data
management?
Data
Common
sSignificant amounts of
data in common, plus
compute and tools

What problem are we solving, is a commons the
answer…
Strategy
10
‣ What does our data/compute/tools usage look like?
‣ What are the common issues that a Commons might
help with?
‣ What should our Commons contain?
‣ How does a Data Commons fit with our longer term
goals?
‣ How will we measure success?

To get some real information on what’s going on,
what the real problems (opportunities!) are
Digital Asset Inventory
‣ Experimental Approaches
• What type of analyses are they doing, what obstacles are getting in
their way?
• What data is the input, what is the output, file formats?
• Data volumes, storage & compute requirements
‣ Data Management
• Data management plan (ha!), “Wild West”?
• Metadata, descriptors, ontologies?
‣ Search/Retrieval/Sharing
• How do they go back and find old data, what do they search on
• What do they share (if anything), with whom
11

To get some real information on what’s going on,
what the real problems (opportunities!) are
Digital Asset Inventory
‣ Informatics/Core groups
• Algorithms, software (version control!), pipelines, data types,
data volumes, software packaging & deployment
• Workflow, workflow tools, data movement
• Data sharing, Data archiving & retrieval
‣ IT staff
• Current storage, rate of data growth, data lifecycle
• Compute resources, usage; Network, data flow
‣ Leadership
• Primary goals, data management strategy, budget, risks
12

It could help address a number of challenges for a
variety of audiences
Reasons for a Commons
‣ Scientists
• manage data, find data, ideally share data
• Democratize access to data & associated compute
‣ IT
• Manage storage, reduce duplication, ensure backups and DR
• Security - ensure the environment is appropriate for the data being
use (e.g. clinical)
• Consolidate compute into fewer environments, converge towards a
common platform…
‣ Organizational Leadership
• Promote management/sharing/reuse of data, leverage existing data
for new discoveries, reduce risk 13

Implementation considerations
14

Lab2 Lab2
Raw
Data
Final
Data
Reuse
Data is generated, added to a ‘commons’
environment for others to use
General Commons Data Flow
15
Lab Lab
Core
Raw
Data
Final
Data
Data Commons
Publish

Potential Stakeholders
‣ Scientists
‣ Division/Group Head
‣ PI/Lab Head/Lab manager
‣ Lab Tech
‣ Postdoc, Student, etc.
‣ Collaborator
‣ Informatics Team Members
‣ Informatics team lead
‣ Data scientist
‣ Core Labs
‣ Head of Core Lab
‣ Core lab manager (if different from the
Head of the Core)
‣ Scientist within the core lab
‣ Information Technology Team
Members
‣ Person in charge of compute, HPC, VMs,
Containers, Cloud, etc
‣ Person in charge of storage, etc.
‣ Person in charge of managing backups,
replication, and archiving
‣ Person in charge of storage capacity
planning
‣ Person in charge of network, data
movement to and from HPC, storage
‣ Person in charge of maintaining commons-
related systems, deployment, updates,
maintenance.
‣ Security and Compliance Office
‣ Leadership
‣ Persons responsible for strategic IT
decisions and purchasing
‣ Billing - to assign storage costs to specific
groups/users
‣ Legal - to be able to find data to respond to
formal requests for information (e.g. FOIA),
institute legal holds, data retention policies
‣ Non-human users (scripts, etc.)
‣ Scripts written to find data, add metadata,
move data, catalog usage, etc.
16

Define use cases, stories, competency questions
Nail down the details
17
As a: Scientist
I want: as much useful metadata associated with my data files as
possible, while doing as little extra work as possible (preferably no
extra work) to add this metadata.
So that: I can benefit from searching, reporting, organization, etc.
that comes with high quality metadata without having to take away
time and effort from research to add the metadata manually.
(the defining) User Story…
Example Competency questions
https://biocaddie.org/workgroup-3-group-links

Generic Commons Architecture
19

Data Processing for the Commons
20

Tools and Technologies
21
https://cdis.uchicago.edu/gen3

22
https://dockstore.org/
https://github.com/NERSC/shifter
Containers, workflows, containers on HPC
http://singularity.lbl.gov/
http://geekyap.blogspot.ch/2016/11/docker-vs-singularity-vs-shifter-in-hpc.html
http://biocontainers.pro/

23
https://irods.org/
http://www.arcitecta.com/
http://www.starfishstorage.com/
Mediaflux
Data and storage management, metadata

Considerations
‣ Define the Commons for you
• Address real pain points for your community
• What does success look like?
‣ Its a complex engineering challenge
• Databases, containers, compute, network, storage, etc.
• ‘Just clone the repo’ never quite works as hoped…
‣ Its a complex social engineering challenge..
• Common metadata, formats, sharing, collaboration
• Scientists would rather share their tooth brush than…
‣ Its (ideally) a long term commitment
• Funding, ‘evolvability’ to avoid technology lock-in
25

Metrics for these…?
Goals for a Commons
‣ Scientists
• Can manage data, find data, securely share data
• Have ready access to data & associated compute
‣ IT
• Has visibility into storage, has reduced duplication
• Have ensured backups and enabled (and tested) DR
• Are confident that the environment is appropriate for the data being used (e.g. clinical)
• Have consolidated compute into fewer environments and are converging towards a
common platform…
‣ Organizational Leadership
• Can demonstrate sharing/reuse of data
• Have examples of leveraging existing data for new discoveries
• Can quantify the reduction in risk
26

Converged IT and Data Commons

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Converged IT and Data Commons

Similaire à Converged IT and Data Commons (20)

Plus de Simon Twigger

Plus de Simon Twigger (9)

Dernier

Dernier (20)

Converged IT and Data Commons

Notes de l'éditeur