SlideShare une entreprise Scribd logo
1  sur  26
Converged IT and Data Commons
Simon Twigger, Ph.D.
1
Molecular Med Tri-Con
February 13th 2018
BioTeam
2
est, Objective, Vendor and
nology agnostic
ears bridging the gap
een IT and Science
by scientists forced to
IT to get the job done
ual company with
About Me
3
Strategic assessments,
Cloud (AWS/Google),
DevOps, Data Commons,
software development,
Whole genome sequencing resource identifies 18 new
candidate genes for autism spectrum disorder
Nature Neuroscience 20, 602–611 (2017)
Overview
‣ Scope
• Organization’s perspective
• Planning & implementation considerations
‣ Strategy around ‘data commons’
• How might it support the ‘bigger picture’?
‣ Implementation of a data commons
• What does it involve, what tools/tech might be useful
4
Strategy for a Data Commons
5
+
- Preliminary
Data
Supporting
Data
Run
Experiments
Raw
Data
Management
Analyses
Archive
Data
Publish
Reuse
Data
Generic Scientist’s Data ‘Journey’
Neutral
Experience
Likely external
Download as
needed.
Have Equipment
& instruments
Not consistent
Rarely any plan
Data spread out
No structure unless
using core facility
Instrument backup
uncertain
Compute OK
Have software
Have tools
Backup ?
Hard to find
Cant track across
project
Storage limits
Save long term
Physical data -
slides
Rarely reused
May not be readily
reusable by others
Hard to find data
Rely on original
person
or manual hunting
Can find own data
May not use others
No real issues
May store pub’d
data in one spot
Submit to GEO,
etc
What is a data commons?
7
An integrated (converged) environment that
provides access to shared data, compute and
analytic tools at a scale (or convenience)
greater than that typically available
Data Commons
Data Compute Tools
Where are the key areas for your research
What is in common for you?
8
Data Commons
Data
Compute
Tools
Do you need a commons….?
9
Lots of need for
compute, just need a
cluster?
Data Commons
Lots of data, not much
in common
Just need more
storage, and/or data
management?
Data
Common
sSignificant amounts of
data in common, plus
compute and tools
What problem are we solving, is a commons the
answer…
Strategy
10
‣ What does our data/compute/tools usage look like?
‣ What are the common issues that a Commons might
help with?
‣ What should our Commons contain?
‣ How does a Data Commons fit with our longer term
goals?
‣ How will we measure success?
To get some real information on what’s going on,
what the real problems (opportunities!) are
Digital Asset Inventory
‣ Experimental Approaches
• What type of analyses are they doing, what obstacles are getting in
their way?
• What data is the input, what is the output, file formats?
• Data volumes, storage & compute requirements
‣ Data Management
• Data management plan (ha!), “Wild West”?
• Metadata, descriptors, ontologies?
‣ Search/Retrieval/Sharing
• How do they go back and find old data, what do they search on
• What do they share (if anything), with whom
11
To get some real information on what’s going on,
what the real problems (opportunities!) are
Digital Asset Inventory
‣ Informatics/Core groups
• Algorithms, software (version control!), pipelines, data types,
data volumes, software packaging & deployment
• Workflow, workflow tools, data movement
• Data sharing, Data archiving & retrieval
‣ IT staff
• Current storage, rate of data growth, data lifecycle
• Compute resources, usage; Network, data flow
‣ Leadership
• Primary goals, data management strategy, budget, risks
12
It could help address a number of challenges for a
variety of audiences
Reasons for a Commons
‣ Scientists
• manage data, find data, ideally share data
• Democratize access to data & associated compute
‣ IT
• Manage storage, reduce duplication, ensure backups and DR
• Security - ensure the environment is appropriate for the data being
use (e.g. clinical)
• Consolidate compute into fewer environments, converge towards a
common platform…
‣ Organizational Leadership
• Promote management/sharing/reuse of data, leverage existing data
for new discoveries, reduce risk 13
Implementation considerations
14
Lab2 Lab2
Raw
Data
Final
Data
Reuse
Data is generated, added to a ‘commons’
environment for others to use
General Commons Data Flow
15
Lab Lab
Core
Raw
Data
Final
Data
Data Commons
Publish
Potential Stakeholders
‣ Scientists
‣ Division/Group Head
‣ PI/Lab Head/Lab manager
‣ Lab Tech
‣ Postdoc, Student, etc.
‣ Collaborator
‣ Informatics Team Members
‣ Informatics team lead
‣ Data scientist
‣ Core Labs
‣ Head of Core Lab
‣ Core lab manager (if different from the
Head of the Core)
‣ Scientist within the core lab
‣ Information Technology Team
Members
‣ Person in charge of compute, HPC, VMs,
Containers, Cloud, etc
‣ Person in charge of storage, etc.
‣ Person in charge of managing backups,
replication, and archiving
‣ Person in charge of storage capacity
planning
‣ Person in charge of network, data
movement to and from HPC, storage
‣ Person in charge of maintaining commons-
related systems, deployment, updates,
maintenance.
‣ Security and Compliance Office
‣ Leadership
‣ Persons responsible for strategic IT
decisions and purchasing
‣ Billing - to assign storage costs to specific
groups/users
‣ Legal - to be able to find data to respond to
formal requests for information (e.g. FOIA),
institute legal holds, data retention policies
‣ Non-human users (scripts, etc.)
‣ Scripts written to find data, add metadata,
move data, catalog usage, etc.
16
Define use cases, stories, competency questions
Nail down the details
17
As a: Scientist
I want: as much useful metadata associated with my data files as
possible, while doing as little extra work as possible (preferably no
extra work) to add this metadata.
So that: I can benefit from searching, reporting, organization, etc.
that comes with high quality metadata without having to take away
time and effort from research to add the metadata manually.
(the defining) User Story…
Example Competency questions
https://biocaddie.org/workgroup-3-group-links
NIH Data Commons Stack
18
Generic Commons Architecture
19
Data Processing for the Commons
20
Tools and Technologies
21
https://cdis.uchicago.edu/gen3
Tools and Technologies
22
https://dockstore.org/
https://github.com/NERSC/shifter
Containers, workflows, containers on HPC
http://singularity.lbl.gov/
http://geekyap.blogspot.ch/2016/11/docker-vs-singularity-vs-shifter-in-hpc.html
http://biocontainers.pro/
Tools and Technologies
23
https://irods.org/
http://www.arcitecta.com/
http://www.starfishstorage.com/
Mediaflux
Data and storage management, metadata
Final Thoughts
24
Considerations
‣ Define the Commons for you
• Address real pain points for your community
• What does success look like?
‣ Its a complex engineering challenge
• Databases, containers, compute, network, storage, etc.
• ‘Just clone the repo’ never quite works as hoped…
‣ Its a complex social engineering challenge..
• Common metadata, formats, sharing, collaboration
• Scientists would rather share their tooth brush than…
‣ Its (ideally) a long term commitment
• Funding, ‘evolvability’ to avoid technology lock-in
25
Metrics for these…?
Goals for a Commons
‣ Scientists
• Can manage data, find data, securely share data
• Have ready access to data & associated compute
‣ IT
• Has visibility into storage, has reduced duplication
• Have ensured backups and enabled (and tested) DR
• Are confident that the environment is appropriate for the data being used (e.g. clinical)
• Have consolidated compute into fewer environments and are converging towards a
common platform…
‣ Organizational Leadership
• Can demonstrate sharing/reuse of data
• Have examples of leveraging existing data for new discoveries
• Can quantify the reduction in risk
26

Contenu connexe

Tendances

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPRobert Oostenveld
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...Varsha Khodiyar
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Anita de Waard
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data ManagementOpenAIRE
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databasestusharjadhav2611
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsBeth Plale
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningSteven Van Vaerenbergh
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsBeth Plale
 

Tendances (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big data road map
Big data road mapBig data road map
Big data road map
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data Management
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine Learning
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 

Similaire à Converged IT and Data Commons

No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData ManagementUlrike Wittig
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsAnita de Waard
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...Projeto RCAAP
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 

Similaire à Converged IT and Data Commons (20)

No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData Management
 
Data management plans
Data management plansData management plans
Data management plans
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Open Science Governance and Regulation/Simon Hodson
Open Science Governance and Regulation/Simon HodsonOpen Science Governance and Regulation/Simon Hodson
Open Science Governance and Regulation/Simon Hodson
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Data management plans
Data management plansData management plans
Data management plans
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 

Plus de Simon Twigger

A Distributed Annotation Pipeline for MSSNG
A Distributed Annotation Pipeline for MSSNGA Distributed Annotation Pipeline for MSSNG
A Distributed Annotation Pipeline for MSSNGSimon Twigger
 
DevOps and Automation for Bioinformaticians
DevOps and Automation for BioinformaticiansDevOps and Automation for Bioinformaticians
DevOps and Automation for BioinformaticiansSimon Twigger
 
the iPad - an interface for Biologists?
the iPad - an interface for Biologists?the iPad - an interface for Biologists?
the iPad - an interface for Biologists?Simon Twigger
 
Using Ontologies to accelerate candidate gene identification
Using Ontologies to accelerate candidate gene identificationUsing Ontologies to accelerate candidate gene identification
Using Ontologies to accelerate candidate gene identificationSimon Twigger
 
Semantic Web Approaches to Candidate Gene Identification
Semantic Web Approaches to Candidate Gene IdentificationSemantic Web Approaches to Candidate Gene Identification
Semantic Web Approaches to Candidate Gene IdentificationSimon Twigger
 
Helping Haiti - a semantic web approach to crisis information management
Helping Haiti - a semantic web approach to crisis information managementHelping Haiti - a semantic web approach to crisis information management
Helping Haiti - a semantic web approach to crisis information managementSimon Twigger
 
Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...
Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...
Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...Simon Twigger
 
Virtual Proteomics Analysis Cluster in the Cloud
Virtual Proteomics Analysis Cluster in the CloudVirtual Proteomics Analysis Cluster in the Cloud
Virtual Proteomics Analysis Cluster in the CloudSimon Twigger
 

Plus de Simon Twigger (9)

A Distributed Annotation Pipeline for MSSNG
A Distributed Annotation Pipeline for MSSNGA Distributed Annotation Pipeline for MSSNG
A Distributed Annotation Pipeline for MSSNG
 
DevOps and Automation for Bioinformaticians
DevOps and Automation for BioinformaticiansDevOps and Automation for Bioinformaticians
DevOps and Automation for Bioinformaticians
 
NCBO DBP
NCBO DBPNCBO DBP
NCBO DBP
 
the iPad - an interface for Biologists?
the iPad - an interface for Biologists?the iPad - an interface for Biologists?
the iPad - an interface for Biologists?
 
Using Ontologies to accelerate candidate gene identification
Using Ontologies to accelerate candidate gene identificationUsing Ontologies to accelerate candidate gene identification
Using Ontologies to accelerate candidate gene identification
 
Semantic Web Approaches to Candidate Gene Identification
Semantic Web Approaches to Candidate Gene IdentificationSemantic Web Approaches to Candidate Gene Identification
Semantic Web Approaches to Candidate Gene Identification
 
Helping Haiti - a semantic web approach to crisis information management
Helping Haiti - a semantic web approach to crisis information managementHelping Haiti - a semantic web approach to crisis information management
Helping Haiti - a semantic web approach to crisis information management
 
Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...
Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...
Using the NCBO Web Services for Concept Recognition and Ontology Annotation o...
 
Virtual Proteomics Analysis Cluster in the Cloud
Virtual Proteomics Analysis Cluster in the CloudVirtual Proteomics Analysis Cluster in the Cloud
Virtual Proteomics Analysis Cluster in the Cloud
 

Dernier

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Dernier (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

Converged IT and Data Commons

  • 1. Converged IT and Data Commons Simon Twigger, Ph.D. 1 Molecular Med Tri-Con February 13th 2018
  • 2. BioTeam 2 est, Objective, Vendor and nology agnostic ears bridging the gap een IT and Science by scientists forced to IT to get the job done ual company with
  • 3. About Me 3 Strategic assessments, Cloud (AWS/Google), DevOps, Data Commons, software development, Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder Nature Neuroscience 20, 602–611 (2017)
  • 4. Overview ‣ Scope • Organization’s perspective • Planning & implementation considerations ‣ Strategy around ‘data commons’ • How might it support the ‘bigger picture’? ‣ Implementation of a data commons • What does it involve, what tools/tech might be useful 4
  • 5. Strategy for a Data Commons 5
  • 6. + - Preliminary Data Supporting Data Run Experiments Raw Data Management Analyses Archive Data Publish Reuse Data Generic Scientist’s Data ‘Journey’ Neutral Experience Likely external Download as needed. Have Equipment & instruments Not consistent Rarely any plan Data spread out No structure unless using core facility Instrument backup uncertain Compute OK Have software Have tools Backup ? Hard to find Cant track across project Storage limits Save long term Physical data - slides Rarely reused May not be readily reusable by others Hard to find data Rely on original person or manual hunting Can find own data May not use others No real issues May store pub’d data in one spot Submit to GEO, etc
  • 7. What is a data commons? 7 An integrated (converged) environment that provides access to shared data, compute and analytic tools at a scale (or convenience) greater than that typically available Data Commons Data Compute Tools
  • 8. Where are the key areas for your research What is in common for you? 8 Data Commons Data Compute Tools
  • 9. Do you need a commons….? 9 Lots of need for compute, just need a cluster? Data Commons Lots of data, not much in common Just need more storage, and/or data management? Data Common sSignificant amounts of data in common, plus compute and tools
  • 10. What problem are we solving, is a commons the answer… Strategy 10 ‣ What does our data/compute/tools usage look like? ‣ What are the common issues that a Commons might help with? ‣ What should our Commons contain? ‣ How does a Data Commons fit with our longer term goals? ‣ How will we measure success?
  • 11. To get some real information on what’s going on, what the real problems (opportunities!) are Digital Asset Inventory ‣ Experimental Approaches • What type of analyses are they doing, what obstacles are getting in their way? • What data is the input, what is the output, file formats? • Data volumes, storage & compute requirements ‣ Data Management • Data management plan (ha!), “Wild West”? • Metadata, descriptors, ontologies? ‣ Search/Retrieval/Sharing • How do they go back and find old data, what do they search on • What do they share (if anything), with whom 11
  • 12. To get some real information on what’s going on, what the real problems (opportunities!) are Digital Asset Inventory ‣ Informatics/Core groups • Algorithms, software (version control!), pipelines, data types, data volumes, software packaging & deployment • Workflow, workflow tools, data movement • Data sharing, Data archiving & retrieval ‣ IT staff • Current storage, rate of data growth, data lifecycle • Compute resources, usage; Network, data flow ‣ Leadership • Primary goals, data management strategy, budget, risks 12
  • 13. It could help address a number of challenges for a variety of audiences Reasons for a Commons ‣ Scientists • manage data, find data, ideally share data • Democratize access to data & associated compute ‣ IT • Manage storage, reduce duplication, ensure backups and DR • Security - ensure the environment is appropriate for the data being use (e.g. clinical) • Consolidate compute into fewer environments, converge towards a common platform… ‣ Organizational Leadership • Promote management/sharing/reuse of data, leverage existing data for new discoveries, reduce risk 13
  • 15. Lab2 Lab2 Raw Data Final Data Reuse Data is generated, added to a ‘commons’ environment for others to use General Commons Data Flow 15 Lab Lab Core Raw Data Final Data Data Commons Publish
  • 16. Potential Stakeholders ‣ Scientists ‣ Division/Group Head ‣ PI/Lab Head/Lab manager ‣ Lab Tech ‣ Postdoc, Student, etc. ‣ Collaborator ‣ Informatics Team Members ‣ Informatics team lead ‣ Data scientist ‣ Core Labs ‣ Head of Core Lab ‣ Core lab manager (if different from the Head of the Core) ‣ Scientist within the core lab ‣ Information Technology Team Members ‣ Person in charge of compute, HPC, VMs, Containers, Cloud, etc ‣ Person in charge of storage, etc. ‣ Person in charge of managing backups, replication, and archiving ‣ Person in charge of storage capacity planning ‣ Person in charge of network, data movement to and from HPC, storage ‣ Person in charge of maintaining commons- related systems, deployment, updates, maintenance. ‣ Security and Compliance Office ‣ Leadership ‣ Persons responsible for strategic IT decisions and purchasing ‣ Billing - to assign storage costs to specific groups/users ‣ Legal - to be able to find data to respond to formal requests for information (e.g. FOIA), institute legal holds, data retention policies ‣ Non-human users (scripts, etc.) ‣ Scripts written to find data, add metadata, move data, catalog usage, etc. 16
  • 17. Define use cases, stories, competency questions Nail down the details 17 As a: Scientist I want: as much useful metadata associated with my data files as possible, while doing as little extra work as possible (preferably no extra work) to add this metadata. So that: I can benefit from searching, reporting, organization, etc. that comes with high quality metadata without having to take away time and effort from research to add the metadata manually. (the defining) User Story… Example Competency questions https://biocaddie.org/workgroup-3-group-links
  • 18. NIH Data Commons Stack 18
  • 20. Data Processing for the Commons 20
  • 22. Tools and Technologies 22 https://dockstore.org/ https://github.com/NERSC/shifter Containers, workflows, containers on HPC http://singularity.lbl.gov/ http://geekyap.blogspot.ch/2016/11/docker-vs-singularity-vs-shifter-in-hpc.html http://biocontainers.pro/
  • 25. Considerations ‣ Define the Commons for you • Address real pain points for your community • What does success look like? ‣ Its a complex engineering challenge • Databases, containers, compute, network, storage, etc. • ‘Just clone the repo’ never quite works as hoped… ‣ Its a complex social engineering challenge.. • Common metadata, formats, sharing, collaboration • Scientists would rather share their tooth brush than… ‣ Its (ideally) a long term commitment • Funding, ‘evolvability’ to avoid technology lock-in 25
  • 26. Metrics for these…? Goals for a Commons ‣ Scientists • Can manage data, find data, securely share data • Have ready access to data & associated compute ‣ IT • Has visibility into storage, has reduced duplication • Have ensured backups and enabled (and tested) DR • Are confident that the environment is appropriate for the data being used (e.g. clinical) • Have consolidated compute into fewer environments and are converging towards a common platform… ‣ Organizational Leadership • Can demonstrate sharing/reuse of data • Have examples of leveraging existing data for new discoveries • Can quantify the reduction in risk 26

Notes de l'éditeur

  1. Scope - your institution or company has decided that a Data Commons is needed - now what?
  2. What problem are we solving?
  3. General view of scientist’s data journey - many areas are OK, things are getting done, however, room for improvement in many areas, and particularly in data management and reuse
  4. Greater Scale = Data is key, but its not just more data, compute or tools, also more access for more people who couldn’t get at these types of resources previously
  5. One or more of these needs to be in common and significantly ‘big’/important/painful for a commons approach to make sense
  6. Data commons really requires a reasonable amount of Data in Common (common formats, commonly used, commonly accessed,
  7. Needs to address a real problem, have a demonstrable impact on something important. How might you find what these things are - can’t guess, you have to ask people and one way is to conduct a digital asset inventory
  8. Product development, Talk to the users, find out what their problems are, particularly as it relates to issues that a Data Commons might help with - Here’s some questions the research staff
  9. Product development, Talk to the users, find out what their problems are, particularly as it relates to issues that a Data Commons might help with (more storage, compute, analysis, more access to all of the above). IT is a critical partner as they need to be on board, will have much to contribute and potentially have much to benefit from this type of environment. Leadership can help articulate the bigger vision, what their main goals are, primary concerns, budget, etc.
  10. Lots of potential benefits from a Commons environment, however, which one(s) are relevant and important to your organization/constituency?
  11. Lots of groups/people to consider with a project of this nature
  12. What metadata attributes are needed, what terms will be used, what QC is necessary Note - scientists aren’t great at metadata…
  13. This is a really good set of high quality open source platforms Docker, DockStore, Packer, etc. are all great ways to go, either to reuse the Gen3 platform or to use to create your own environment.
  14. Dockstore - OICR, interesting Shifter - Containers on HPC, Docker-compatible Singularity - Containers on HPC,
  15. This is a really good set of high quality open source platforms
  16. Complex engineering - do you have the staff with the skills to pull this off? Social - Build it and they probably won’t come unless there’s clearly something in it for them Long term - its not just the first, sexy, 3-6m commons initiative, its the long term dedication to data management, data sharing
  17. Lots of potential benefits from a Commons environment, however, which one(s) are relevant and important to your organization/constituency?