Cloud Native Analysis Platform optimized for user-friendly large data set transfer from Dropbox to cloud infrastructure for data processing and analysis. It is particular tailored for easy Next Generation Sequence (NGS) fastq file transfer for rapid exome, RNASeq, small RNASeq, and amplicon analysis.
Scaling API-first – The story of a global engineering organization
Cloud Native Analysis Platform for NGS analysis
1. CCCB RNA-Seq DGE
Analysis Made Easy
Center for Cancer Computational Biology (SM822)
Bioinformatics Team
Homepage: https://cccb.dfci.harvard.edu/
Twitter: @CCCBseq
2. So why are we here...
You have RNA-Seq data generated but...
○ uploading to Galaxy public server for analysis take forever
○ my bioinformaticians can not process it today
○ sequence alignment is taking forever
○ want to make additional differential expression contrasts
○ formating DGE result for GSEA analysis somehow doesn’t work
○ I am the bioinformatician and don’t have the time to process all this data
(for others and for free)
○ bioinformatic core services can be expensive and takes time
○ The Cancer Genomics Cloud, while powerful, requires good understanding of
Amazon or Google Cloud System to manage projects and payment for the
computing cost
3. CCCB Cloud System can help
Fast
○ Scalable infrastructure with virtually no computing resource limitation
○ Minimal queue time to get data analyzed
Secure
○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure
HIPAA compliance security
Convenient
○ Simplified large data upload and download processes by parallelized
direct cloud-to-cloud transfer between Dropbox to GCP to reduce data
transfer time from hours to minutes
○ Like other Cloud platforms users is set up to pay for overhead and
computing time but without steep learning curve to request project or
manage payment
4. RNA-Seq DGE analysis should not be difficult
Most RNA-Seq data can be aligned and quantified using the same settings for
initial DGE analysis
Technical bottleneck is often to gather enough computing power and set up proper
analysis environment… after data transfer problem is solved
AlignFastq
files
Quantify DGE
Clustering
Func.
Enrichment
5. Please use your gmail account to log
into https://cccb-analysis.tm4.org/
And upload the fastq data files
6. CCCB Cloud System- authentication
1. Use Incognito/Private Browser Session
2. Sign-in to https://cccb-analysis.tm4.org
with provided Google account
- @gmail.com address
- DFCI Gsuite email
(first_last@mail.dfci.harvard.edu)
7. CCCB Cloud System- analysis setup
3. Click on ‘Upload files’ on analysis homepage
- All analysis projects associated with your email
- Projects created on your behalf by CCCB
- Status messages, Click on next steps
8. CCCB Cloud System- analysis setup
4. choose your reference genome
5. Edit the project name to something meaningful
9. CCCB Cloud System- file uploads
6. Upload Files
a. Dropbox
- Preferred method
- Log in again into Dropbox
- Select files and upload
b. From local computer
- File chooser
- Drag/drop interface
- Slow transfer through https
File naming instructions
- Email notification when transfer is
complete.
10. CCCB Cloud System- file uploads
7. After receiving email (if using
Dropbox), refresh.
Uploaded files will be visible
11. CCCB Cloud System- Assign Sample Name
7. Set Sample Names
Sample names are inferred from
sequencing file names. Can create
new samples or remove existing ones.
- Drag/drop files to the proper
sample
14. Alignment is a Computationally Intensive Process
Running on Local Computing
● Require knowledge in unix and high performance computing
● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM)
● Require ability to write scripts and program
● Require understanding of the process to run alignment program
Running on Public Web Servers
● Wait time for most public web servers such as Galaxy (https://galaxyproject.org/)
and Genboree (http://genboree.org/) increases with the number of users
● Most of them utilizes https protocol and allows only 1 fastq file upload at a time.
● The Cancer Genomics Cloud (http://www.cancergenomicscloud.org/) requires good
understanding of Amazon or Google Cloud System to setup project and payment
15. Typical RNASeq DGE Experimental Design
Difficult to estimate the minimum number of biological replicates required, but
typical rule of thumb:
● 3+ for cell lines
● 5+ for inbred lines of model organisms
● As many samples for human as possible
A single RNASeq experiment is usually between 6 to 20+ samples and wait time
for upload, run-time, and download increases linearly on public web server with
risk of broken connection
18. Scaling
Application“Align N samples”
Independent nodes/images
- Each node needs large amount of
data (e.g. index files for aligners)
- Pre-built images minimizes data
transfer
- Communication about status
Pulls raw data and pushes
processed data to/from Google
Storage buckets
19. Task management for data download
“Transfer these 50 fastQ files (>2Gb each) to
my Partner’s Dropbox!”
Application
20. Fast download for output files using
Dropbox
Save output by direct download or
Dropbox transfer:
- Authenticated: only those
logged-in as your Google user
can access files
- Direct transfer to Dropbox
storage for fast data transfer
and backup
- Email notification after transfer
is complete
- A master directory called
“cccb_transfers/” will be
created in Dropbox and
organized by projects
22. Standard RNA-Seq DGE Output
Custom report
Basic figures
Output files
Raw counts, normalized counts,
Differential expression results
Files for GSEA analysis
23. Gene Set Enrichment Analysis
Broad Institute GSEA (http://software.broadinstitute.org/gsea/)
Directly use the normalized count matrix file and groups.cls from CCCB Cloud
Platform DGE analysis result support files that can be imported into Broad Gene
Set Enrichment Analysis (GSEA) on MSigDB
24. RNASeq Data Visualization
Multi-experiment viewer (WebMEV)-- http://mev.tm4.org
Directly use the raw count matrix from CCCB Cloud Platform and import to do more
advanced analyses including:
- Clustering (hierarchical, k-means, PCA, etc)
- GO enrichment, pathway enrichment analyses
27. Pricing Structure for RNASeq DGE
DFCI/BWH: $18 per sample
External Academia: $24 per sample
Industry: Inquire
28. CCCB Cloud Platform Road Map
GATK v3 (Live)/ v4 (May)
- Germline Mutation Calling for DNA-Seq
Mutect2 (April):
- Somatic Mutation Calling for tumor/normal paired DNA-Seq
Small non-coding RNA (April):
- Mapping and quantification of small non-coding RNA classes (miRNA, piRNA,
tRNA, snoRNA)
Transcript Isoform (May):
- Novel transcript isoform identification and quantification
29. Important accounts and where to get them
DFCI G Suite Account (or just Google Account)
Google accounts linked with organization emails are prefered even though any
google account can be used. For DFCI community, please request an DFCI
google account (user@mail.dfci.harvard.edu) through Research Computing
website: http://rc.dfci.harvard.edu/contact-research-computing
Partners Dropbox
All Dropbox account will work with our systems. Partners Health provides virtually
unlimited encrypted storage on Dropbox Business for all Partners community
members (anyone with partners.org email) for free. Information is available here:
https://rc.partners.org/kb/collaboration/dropbox?article=2062
Agilent CrossLab (a.k.a iLab Solutions)
As most of cores and centers around DFCI, we use iLab to track all of our projects.
A free account can be requested at https://dfci.ilab.agilent.com/account/login
31. Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
CCCB
32. Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
Analysis
Pipeline
33. Moving Beyond Excel: Data Wrangling with R
This introductory course is designed for investigators looking to improve their data analysis skills and move beyond
Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by
practical examples with high-throughput sequencing data such as differential expression or variant analyses.
No prior experience with R (or programming in general) is necessary.
Topics include:
● Introduction to R and the command line
● The power and ease of programming for consistent, reproducible research
● Reading and writing formatted datasets
● Filtering
● Data “cleaning”
● Data merging
● (If time permits) Basic plotting