fundamental of entomology all in one topics of entomology
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and Tell
1. The NCI Cancer Genomics
Cloud (CGC) Pilots
NIH IC Show and Tell
Tuesday, November 8, 2016, from 10:00am – 12:00pm
Steve Tsang
Durga Addepalli
Sean Davis
2. Cancer Genomic Data Challenges
● > 2.5 PB of TCGA data (WXS, RNASeq, WGS)
● Fragmentary repositories of cancer genomic data
○ TCGA, TARGET and CGCI have their own data repositories (DCCs)
○ Sequencing data: BAM files at CGhub while VCF/MAF files at DCC
● Assuming the 2.5 PB TCGA data set
○ Storage and Data Protection cost approximately $2,000,000 per year
○ Downloading TCGA data at 10 Gb/sec = 23 days
○ Only large institutions have the ability to utilize this data
○ These data types will continue to grow
Slide Courtesy of Tanja Davidsen, NCI
5. http://firecloud.orgFireCloud Concepts
● Data Files reside in Google Cloud
Storage
● Workspaces
● Tasks and Workflows
● Method Repositories
● Provenance captured for every
analysis run (i.e. what version of
what methods was run on what data
at what time)
6. FireCloud Overview
● The Workspace is the organizing
principle for FireCloud
○ When a workspace is created,
a Google bucket is
automatically attached to that
workspace
● The Data Model is the backbone
within the workspace
○ Holds meta-data, and bucket
pointers to input and output
7. http://cgc.systemsbiology.net/
… is to make TCGA data, together with tools and
compute-power, available and accessible to a broad
range of users using multiple access modes:
❏ Interactive web application
❏ Scripting languages: R, Python, SQL
❏ Direct programmatic access
8. ❏ Build an open platform that can grow and evolve to satisfy a
broad range of users and use-cases
❏ Leverage the best existing tools and technologies, as they are
released
❏ Collaborate with the research community in areas of data
standards, containers, workflows, etc
❏ Provide a range of examples and tutorials to get newcomers
up and running quickly
9. http://www.cancergenomicscloud.org
/
❖The CGC aims to provide a collaborative environment where researchers can
take advantage of co-localized public data (like TCGA) and public tools; but
also recombine these with their private data and tools.
❖Guiding Principles
➢ Making data available isn’t enough to make it usable.
➢ The best science happens in teams.
➢ Reproducibility shouldn’t be hard.
➢ The impact of TCGA is extended by new data & tools
Seven Bridges Genomics CGC Objectives
10. ❖Explore processed TCGA data for
mutations, copy number variations
and expression levels
❖Analyze data from their private
cohorts alongside TCGA data.
❖Use standard bioinformatics pipelines
to perform analyses.
❖Bring their own analysis tools directly
to the TCGA dataset.
❖Collaborate with researchers around
the world.
❖Access storage and compute
resources on the cloud on demand.
❖Access the CGC using the API as
Seven Bridges Genomic
CGC Features
11. Acknowledgement
Team CGC - https://goo.gl/f21Lqq
National Cancer Institute CBIIT
CGC Fact sheet - https://cbiit.nci.nih.gov/sites/nci-cbiit/files/Cloud_Pilot_Handout.pdf
Access Cloud Pilots https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics-cloud-pilots/access-the-cloud-pilot-
platforms
Broad Institute - FireCloud - http://firecloud.org
Institute of Systems Biology - Cancer Genomics Cloud - http://cgc.systemsbiology.net/
Seven Bridges Genomics - Cancer Genomics Cloud - http://www.cancergenomicscloud.org/
Attain, LLC - http://http://www.attain.com/
Notes de l'éditeur
this is good but I would focus on how the native Google platform has been fully exploited - BigQuery and Google Genomics in addition to google storage
It would be nice to have a visual of the case explorer or something else.
Do you plan to explain why 3 pilots, what was uniquely evaluated in each of the three?
also do you plan a concluding slide:
- on next steps from the programs perspective and how these would become part of the Commons vision or something like that
- a call to action for those who want to use it to access cancer data, availability of free credits and or mimic it for their ICs using the open source code of the platforms available for others use.