Bonazzi commons bd2 k ahm 2016 v2

The Data Commons
An introduction & Overview
BD2K AHM, November 29, 2016
Vivien Bonazzi (ADDS)

Outline
 What’s driving the need for a Data Commons?
 Development of the Data Commons at NIH
 Current Data Commons Pilots
• Next steps
 Considerations & Concluding Thoughts

What’s driving the need for a
Data Commons?

Convergence of factors
Mountains of Data
Increasing need and support for Data sharing
Availability of digital technologies and
infrastructures that support Data at scale

https://gds.nih.gov/
Went into effect January 25, 2015
NCI guidance:
http://www.cancer.gov/grants-training/grants-management/nci-
policies/genomic-data
Requires public sharing of genomic data sets

8
Recommendation #4: A national cancer data ecosystem for sharing and analysis.
Create a National Cancer Data Ecosystem to collect, share, and interconnect a broad
array of large datasets so that researchers, clinicians, and patients will be able to both
contribute and analyze data, facilitating discovery that will ultimately improve patient
care and outcomes.
8

Challenges with Biomedical Data
The Journal Article is the end goal
Data is a means to an ends (low value)
Data is not FAIR
Findable, Accessible, Interoperable, Reproducible
Limited e-infrastructures to support FAIR data

What’s
Changing?
Digital
ecosystems

Development of the
NIH Data Commons

 How do we find data, software, standards?
 How can we make (large) data, annotations, software,
metadata accessible?
 How do we reuse data, tools and standards?
 How do we make more data machine readable?
 How do we leverage existing digital technologies systems,
infrastructures?
 How do we collaborate?
 How do we enable digital ecosystem?
Changing the conversation around
Data sharing and access
NIH Data Commons

Data Commons
enabling data driven science
Enable investigators to leverage all possible data and tools
in the effort to accelerate biomedical discoveries, therapies
and cures
by
driving the development of data infrastructure and data
science capabilities through collaborative research and
robust engineering
Matthew Trunnel, FHC

Developing a Data Commons
 Treats products of research – data, methods, papers etc.
as digital objects
 These digital objects exist in a shared virtual space
• Find, Deposit, Manage, Share, and Reuse data,
software, metadata and workflows
 Digital object compliance through FAIR principles:
• Findable
• Accessible (and usable)
• Interoperable
• Reusable

The Data Commons
is a framework
that supports
FAIR data access and sharing
and
fosters the development
of a digital ecosystem
https://datascience.nih.gov/commons

The Data Commons Framework
Compute Platform: Cloud
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
PaaS
SaaS
IaaS
https://datascience.nih.gov/commons

NIH + Community
defined data sets
BD2K Centers,
MODS, HMP &
Interoperability
Supplements
Cloud credits
model (CCM)
BioCADDIE/Other
Indexing
NCI &
NIAID
Cloud
Pilots
+ GDC
Compute Platform: Cloud or HPC
Data
User defined data
Mapping BD2K Activities and Commons Pilots
to the Commons Framework

Current Data Commons Pilots
Explore feasibility of the Commons Framework
Facilitate collaboration and interoperability
Making large and/or high impact NIH funded data sets and tools
accessible in the cloud
Developing Data and Software indexing methods
Leveraging BD2K Efforts: bioCADDIE and others.
Collaborating with external groups
Provide access to cloud (IaaS) and PaaS/SaaS via credits
Connecting credits to the grants system

Reference Data Sets Pilot
Large, High-Impact Datasets in the Cloud
Vivien Bonazzi

Data
User defined data
Mapping to the Commons Framework
Large, High-Impact Datasets in the Cloud - Populating the
Commons
Large, High-Impact
Data Sets in the
Cloud

 Make large, high impact, NIH funded data sets available in
the cloud/commons
 Co-locate large datasets and compute power, to improve
access, use, re-use, and sharing of data and tools
 Kick-start the Commons with Commons-compliant data and
tools
 Data must adhere to Common compliance /FAIR principles
 Provide an indexable test data sets for bioCADDIE (and
other indexing efforts)
Overview:
Large, High-Impact Datasets in the Cloud - Populating the Commons

This pilot project will inform NIH on:
 Which Clouds are most functional, practical, and cost
effective?
 What is involved in moving data resources to the Cloud?
 What will it cost?
 How to manage challenges associated with both open
access and controlled access data?
 How do we find data and resources across clouds?
 How do we compute across clouds?
What will we learn:
Large, High-Impact Datasets in the Cloud - Populating the Commons

 Biomedical data resources and tools
• Support to migrate large, high-impact datasets and associated tools into
multiple cloud providers
• Data an tools sets must be FAIR
 Cloud Infrastructure
• Support for cloud storage and architectural engineering to support data and
tools
 Coordination
• Facilitate activities across the biomedical data resources and cloud providers
• Development of market place/app store approaches
• Auth: Authorization & Access controls
• Tracking metrics (cost, usage etc.) and impact of the overall project
Proposed Components:
Large, High-Impact Datasets in the Cloud

Reference Data Sets – Next Steps
 NIH Data Task Force
• Chaired by Francis Collins
• Involves many NIH ICs
• Developing some shorter term preliminary pilots for larger NIH funded
data sets in the cloud
• Expect to see some announcements in Jan/Feb 2017
 RFI – engage in dialoged with the community
• Planned Winter 2017
 FOAs – Supporting large high impact data sets in the cloud
• Spring 2017

Commons Framework Pilots
Exploring feasibility of the Commons Framework : Software and Services layer
Valentina Di Francesco

Commons Framework Pilots (CFPs)
 Exploring feasibility of the Commons Framework
 Facilitating connectivity, interoperability and access to
digital objects
 Providing digital research objects to populate the
Commons

PI Parent grant’s IC Project description
TOGA NIBIB • Cloud-hosted data publication system
• Allows the automatic creation and publication of data a personalized data
repository
MUSEN NIAID • Smart APIs – improved handling for metadata within APIs
• Ontological support for metadata within an API
• Improving smart API discoverability: a registry of APIs
HAN NIGMS • Docker container hub for BD2K community
• Docker containers for genomic analysis applications and pipelines
• Benchmark, Evaluation & best practices
COOPER/KOHA
NE
NHGRI • Cloud based authenticated API access and exchange of causal modeling data
, tools + genomic and phenomic data (PIC)
• Docker containers for CCD tools available in AWS
HAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast
cancer susceptibility genes and variations
• (GA4GH) API : being able to query this data and metadata
Ohno-Machado NHLBI • Development of an ecosystem for repeatable science
• easy reuse of data AND software; tracking of provenance.
• Use of container technologies for software and data reuse.
White NHGRI • The entire HMP1 data set made accessible on AWS
• Analysis tools for microbiome data in AWS
Ma’ayan NHLBI • A Cloud-Based Microscopy Imaging Commons Portal with microscopy data
and metadata
Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation
tasks of the collaborating sites.
• An API to provide programmatic access to the relevant papers in PMC
MODs PIs NHGRI • Development of a common data model for the MODs
• Development of APIs accessing data across the MODs

• APIs
• Containerization:
• Docker containers, guidelines, registry store
• Workbenches, Connectors
• Indexing
• Market Place/App Store

Mapping the Commons Framework PILOTS
to the Commons Framework
White - HMP
Data
User defined data
Musen
Ma’ayan
Cooper Han
Haussler
MODs
Sternberg
Ohno-Machado
Toga

Commons Framework Pilots : Updates
Sept. 2015 – First set of CFPs awarded
Nov. 2015 - CFPs participated in the AHM and the
Commons breakout session
Feb. 2016 - Established Common Framework Working
Group (CFWG)
• CFWG members: Pilots’ PIs and/or technical leads; few PIs of
the BD2K interoperability projects
• Meeting in person on March 1, 2016

Commons Framework Pilots : Updates
March 2016 – CFPs meeting in person
• To develop an initial plan for the implementation of Commons Framework
• Meeting presentations here
• A manuscript describing the outcomes of the meeting was submitted
• Established the Commons Framework Working Group (CFWG) and sub-
WGs on the following topics:
• FAIRness Metrics (Neil McKenna & Michel Dumontier)
• Data-object registry (Lucila Ohno-Machado, Michel Dumontier, Wei Wang)
• Interoperability of APIs (Michel Dumontier)
• Workflow sharing and docker registry (Umberto Ravaioli & Brian O’Connor)
• Commons Framework Publications (Owen White)
Nov 28, 2016 – Held a CFWG meeting in person
These groups will present a report of their activities at the
Commons Session tomorrow at 10:30am

Commons Framework WG - Next Steps
GET INVOLVED: See Valentina Di Francesco or WG leads for details
 A broad announcement to the BD2K research community went
out in late summer – we are seeking more participants
 Contribute to the implementation of the Commons Framework
 Suggest other scientific areas of interest that need coordination
 Generate guidelines that all of our peers will use as we begin to
jumpstart the NIH Commons
 Participate in meetings of the CFWG and hear the latest news

Commons Framework – Next Steps
 FOA: Support investigator-initiated projects to further develop the Data
Commons Framework
• Could leverage and expand upon resources developed with the Reference
data sets
• Planned Fall 2017
 FOA: Making existing data and tools Commons Compliant/FAIR
• Competitive Supplements to existing NIH Awards.
• Provide support to existing projects to make current digital resources FAIR
& Commons Compliant
• Digital resources could include: data, analytical software, or workflows
• Planned Fall 2017

Resource Search & Indexing
Discoverability of data and software
Ian Fore, Ron Margolis, Alison Yao, Claire Schulkey Dawei Lin

Data
User defined data
Commons
Indexing

An Indexing Ecosystem for the Commons:
a virtual environment for ‘FIND’
 Enable biomedical research by providing scientists
with the ability to FIND digital resources
 Establish a mature resource discovery tool(s) that can
be sustained as long as the need for it exists
 Focus on characteristics of the tool as infrastructure
 Maintains a defined level of service
 Contribute to a Commons that is reliable, available, easy to
use, and adaptable

Identify indexing
activities in and
outside NIH
BD2K:
bioCADDIE,
Centers of
Excellence
ICs: NLM, NCI,
NHGRI, other
Non-BD2K: Elixir
(EBI), Publishers
(Elsevier),
Repositories,
schema.org
Compare
ongoing
activities and
identify needs
Benchmarking
Identify gaps in
strategy
• Dimensions to
consider
• Content,
Metadata,
Platform/
Technology
Coordinate with
other BD2K
PMWGs
Standards
Specific
Center WGs
Current Activities

Cloud Credits Model
George Komatsoulis

Data
User defined data
Commons
Cloud Credits Pilot

Investigator
CMS FFRDC
The Commons
Cloud Provider
C
Cloud Provider
B
Cloud Provider
A
Investigator Institution
[OPTIONAL]
Approves Credit
Request
Requests
Credits
Directs reseller
to distribute
credits
Distributes
Uses credits
1
2
NIH
3
4
5
7
8
Delivers
Funding Recommendation
Review &
Approval
CMS FFRDC
Review &
Selection
6

How do credits work from the
point of view of an investigator?
 Investigators receive credits worth a certain amount (in dollars) that
can be used at the conformant provider(s) of their choice
 Credits are pre-purchased and applied to the account of the
investigator with the relevant provider(s)
 As the investigator uses services with a conformant provider, the
provider debits the value of the investigators usage against the pre-
loaded credits
 INVESTIGATORS ARE NOT BILLED BY PROVIDERS AS LONG
AS THEY DO NOT EXCEED THEIR CREDIT ALLOCATION.

 3 year pilot to test this business model to facilitate researcher use of cloud
resources (enhance data sharing and potentially reduce costs).
 Contract with the CMS Alliance to Modernize Healthcare (CAMH) Federally
Funded Research and Development Center (FFRDC) managed by the MITRE
corporation
• FFRDCs are special purpose, government-owned but
contractor-managed entities that meet R&D needs that can’t
be well managed by traditional grants and contracts
• Examples: National Labs and organizations like RAND
 Pilot will not directly interact with the existing grant system.
• Instead is modeled on the mechanisms being used to gain
access to NSF and DOE national resources (HPC, light
sources, etc.)
 The only required qualification for applying for credits will be that the investigator
must have an existing NIH grant
Commons Credits Model Pilot

 Current List of Approved Vendors
 DLT = Amazon Web Services Reseller
 IBM
 Onix = Google Reseller
 Broad and ISB NCI Cloud Pilots accessible via Google
 Two more approved but negotiating participation agreement
 First batch of credits issued Sep 29, 2016
 8 Investigators (cohort 1) that are part of an ‘alpha test’
 Only IBM/AWS at the time
 93% AWS, 7% IBM
 First credits have been used, usage information coming
 First “production” credit request period opening this month
Commons Credits Model Pilot

Considerations and
Concluding Thoughts

Considerations
 Communication
 Metrics – Understanding and accounting of data usage patterns
 Cost
• Cloud Storage
• Pay for use cloud compute (NIH credits pilot)
• Indirect costs for cloud
 Hybrid Clouds – Institution (private) and commercial (public) clouds
 Managing Open vs Controlled access data
• Auth: single sign on - dreams/nightmares?
 Archive vs Working Copies of data
 Interoperability with other Commons (clouds)

 Standards – Metadata, UIDs, APIs
 Discoverability – Finding digital objects across clouds
 Interfaces – For users with different needs and capabilities
 Consent – Reconsenting data, Dynamic consents?
 Policies
• Data sharing policies that are useful and effective
• Keep pace with use of technology (e.g. dbGAP data in the Cloud)
 Incentives
• Access to, and shareability of FAIR Data as part of NIH grant review
criteria
 Governance – Community involvement in governance models
 Sustainability – Long term support

Summary
 We need an unprecedented level of convergence and
collaboration to drive biomedical science to the next level.
 Supporting this model of data-intensive collaborative science
requires a shift in academic research culture and new
investments in data infrastructure and capabilities.
Matthew Trunnel, FHC

Acknowledgments
• ADDS Office: Jennie Larkin, Phil Bourne, Michelle Dunn,Mark Guyer, Allen Dearry, Sonynka Ngosso,
Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)
• NCBI: George Komatsoulis
• NHGRI: Valentina di Francesco
• NIGMS: Susan Gregurick
• CIT: Andrea Norris, Debbie Sinmao
• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr
• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen
• Commons Reference Data Set Working Group: Weiniu
Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB),
Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)
• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI),
Claire Schulkey (AI), Eric Choi (AI)
• OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke,
• Research and Industry: Mathew Trunnell (FHC), Bob Grossman (Chicago), Toby Bloom (NYGC)

Acknowledgements- CFPs
NIH CFPs WG
• Valentina Di Francesco
• Sam Moore
• Vivien Bonazzi
• Allen Dearry
• Maria Giovanni
• Susan Gregurick
• Weiniu Gan
• James Luo
• Stacia Friedman-Hill
• Ajay Pillai
• Leslie Derr
• Debbie Sinmao
• Eric Choi
• Claire Schulkey
• George Komatsoulis
CFWG
• Owen White
• Neil McKenna
• Michel Dumontier
• Umberto Ravaioli
• Brian O’Connor
• Lucila Ohno-Machado
• Wei Wang
• All the other members

Acknowledgements - Credits Model
• ADDS Office
• Vivien Bonazzi
• Phil Bourne
• Jennie Larkin
• Mark Guyer
• MITRE
• Ari Abrams-Kudan
• Wenling (Eileen) Chang
• Peter Gutgarts
• Lynette Hirschman
• William Kim
• Eldred Rubeiro
• Bruce Shirk
• David Tanenbaum
• Lisa Tutterow
• Grant Thornton
• Katie Beringer
• Mike Clifford
• Tamara Reynolds
• NIH
• Tanja Davidsen (NCI)
• Valentina di Franceso (NHGRI)
• Susan Gregurick (NIGMS)
• David Lipman (NCBI)
• Vivek Navale (CIT)
• Jim Ostell (NCBI)
• Debbie Sinmao (CIT)
• Nick Weber (NIAID)
• NITRD
• Peter Lyster

Stay in
Touch
QR Business Card
LinkedIn
@Vivien.Bonazzi
Slideshare
Blog
(Coming soon!)

Bonazzi commons bd2 k ahm 2016 v2

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Bonazzi commons bd2 k ahm 2016 v2

Similaire à Bonazzi commons bd2 k ahm 2016 v2 (20)

Dernier

Dernier (20)

Bonazzi commons bd2 k ahm 2016 v2

Notes de l'éditeur