Cloud Native Analysis Platform for NGS analysis

CCCB RNA-Seq DGE
Analysis Made Easy
Center for Cancer Computational Biology (SM822)
Bioinformatics Team
Homepage: https://cccb.dfci.harvard.edu/
Twitter: @CCCBseq

So why are we here...
You have RNA-Seq data generated but...
○ uploading to Galaxy public server for analysis take forever
○ my bioinformaticians can not process it today
○ sequence alignment is taking forever
○ want to make additional differential expression contrasts
○ formating DGE result for GSEA analysis somehow doesn’t work
○ I am the bioinformatician and don’t have the time to process all this data
(for others and for free)
○ bioinformatic core services can be expensive and takes time
○ The Cancer Genomics Cloud, while powerful, requires good understanding of
Amazon or Google Cloud System to manage projects and payment for the
computing cost

CCCB Cloud System can help
Fast
○ Scalable infrastructure with virtually no computing resource limitation
○ Minimal queue time to get data analyzed
Secure
○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure
HIPAA compliance security
Convenient
○ Simplified large data upload and download processes by parallelized
direct cloud-to-cloud transfer between Dropbox to GCP to reduce data
transfer time from hours to minutes
○ Like other Cloud platforms users is set up to pay for overhead and
computing time but without steep learning curve to request project or
manage payment

RNA-Seq DGE analysis should not be difficult
Most RNA-Seq data can be aligned and quantified using the same settings for
initial DGE analysis
Technical bottleneck is often to gather enough computing power and set up proper
analysis environment… after data transfer problem is solved
AlignFastq
files
Quantify DGE
Clustering
Func.
Enrichment

Please use your gmail account to log
into https://cccb-analysis.tm4.org/
And upload the fastq data files

CCCB Cloud System- authentication
1. Use Incognito/Private Browser Session
2. Sign-in to https://cccb-analysis.tm4.org
with provided Google account
- @gmail.com address
- DFCI Gsuite email
(first_last@mail.dfci.harvard.edu)

CCCB Cloud System- analysis setup
3. Click on ‘Upload files’ on analysis homepage
- All analysis projects associated with your email
- Projects created on your behalf by CCCB
- Status messages, Click on next steps

CCCB Cloud System- analysis setup
4. choose your reference genome
5. Edit the project name to something meaningful

CCCB Cloud System- file uploads
6. Upload Files
a. Dropbox
- Preferred method
- Log in again into Dropbox
- Select files and upload
b. From local computer
- File chooser
- Drag/drop interface
- Slow transfer through https
File naming instructions
- Email notification when transfer is
complete.

CCCB Cloud System- file uploads
7. After receiving email (if using
Dropbox), refresh.
Uploaded files will be visible

CCCB Cloud System- Assign Sample Name
7. Set Sample Names
Sample names are inferred from
sequencing file names. Can create
new samples or remove existing ones.
- Drag/drop files to the proper
sample

CCCB Cloud System- Align and Quantify

RNA-Seq DGE Analysis Under the Hood
- Parallelized:
- alignment (STAR aligner) ---> BAM
Files
- Sort, primary-alignment filtering,
duplicate evaluation (Samtools,
Picard)
- Quantification (featureCounts)
- Merging:
- Overall “raw” (not normalized) count
matrix
- Differential expression testing
with DESeq2
- Plots/figures
Master
Sample 1
Sample 2
Sample N

Alignment is a Computationally Intensive Process
Running on Local Computing
● Require knowledge in unix and high performance computing
● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM)
● Require ability to write scripts and program
● Require understanding of the process to run alignment program
Running on Public Web Servers
● Wait time for most public web servers such as Galaxy (https://galaxyproject.org/)
and Genboree (http://genboree.org/) increases with the number of users
● Most of them utilizes https protocol and allows only 1 fastq file upload at a time.
● The Cancer Genomics Cloud (http://www.cancergenomicscloud.org/) requires good
understanding of Amazon or Google Cloud System to setup project and payment

Typical RNASeq DGE Experimental Design
Difficult to estimate the minimum number of biological replicates required, but
typical rule of thumb:
● 3+ for cell lines
● 5+ for inbred lines of model organisms
● As many samples for human as possible
A single RNASeq experiment is usually between 6 to 20+ samples and wait time
for upload, run-time, and download increases linearly on public web server with
risk of broken connection

CCCB Cloud Infrastructure
Users
CCCB Bioinformatics
CCCB
Sequencing

Data Upload
Application
“Download 50 fastq files!”
Pulls raw data from Dropbox and
push into Google Storage buckets

Scaling
Application“Align N samples”
Independent nodes/images
- Each node needs large amount of
data (e.g. index files for aligners)
- Pre-built images minimizes data
transfer
- Communication about status
Pulls raw data and pushes
processed data to/from Google
Storage buckets

Task management for data download
“Transfer these 50 fastQ files (>2Gb each) to
my Partner’s Dropbox!”
Application

Fast download for output files using
Dropbox
Save output by direct download or
Dropbox transfer:
- Authenticated: only those
logged-in as your Google user
can access files
- Direct transfer to Dropbox
storage for fast data transfer
and backup
- Email notification after transfer
is complete
- A master directory called
“cccb_transfers/” will be
created in Dropbox and
organized by projects

Straightforward differential analysis
Available
processed samples
Human-readable
contrast name
Thresholds used for
creating heatmaps and
volcano plots
Drag/drop samples
into contrast groups
Can rename groups

Standard RNA-Seq DGE Output
Custom report
Basic figures
Output files
Raw counts, normalized counts,
Differential expression results
Files for GSEA analysis

Gene Set Enrichment Analysis
Broad Institute GSEA (http://software.broadinstitute.org/gsea/)
Directly use the normalized count matrix file and groups.cls from CCCB Cloud
Platform DGE analysis result support files that can be imported into Broad Gene
Set Enrichment Analysis (GSEA) on MSigDB

RNASeq Data Visualization
Multi-experiment viewer (WebMEV)-- http://mev.tm4.org
Directly use the raw count matrix from CCCB Cloud Platform and import to do more
advanced analyses including:
- Clustering (hierarchical, k-means, PCA, etc)
- GO enrichment, pathway enrichment analyses

For more information on Pipeline Services

Pricing Structure for RNASeq DGE
DFCI/BWH: $18 per sample
External Academia: $24 per sample
Industry: Inquire

CCCB Cloud Platform Road Map
GATK v3 (Live)/ v4 (May)
- Germline Mutation Calling for DNA-Seq
Mutect2 (April):
- Somatic Mutation Calling for tumor/normal paired DNA-Seq
Small non-coding RNA (April):
- Mapping and quantification of small non-coding RNA classes (miRNA, piRNA,
tRNA, snoRNA)
Transcript Isoform (May):
- Novel transcript isoform identification and quantification

Important accounts and where to get them
DFCI G Suite Account (or just Google Account)
Google accounts linked with organization emails are prefered even though any
google account can be used. For DFCI community, please request an DFCI
google account (user@mail.dfci.harvard.edu) through Research Computing
website: http://rc.dfci.harvard.edu/contact-research-computing
Partners Dropbox
All Dropbox account will work with our systems. Partners Health provides virtually
unlimited encrypted storage on Dropbox Business for all Partners community
members (anyone with partners.org email) for free. Information is available here:
https://rc.partners.org/kb/collaboration/dropbox?article=2062
Agilent CrossLab (a.k.a iLab Solutions)
As most of cores and centers around DFCI, we use iLab to track all of our projects.
A free account can be requested at https://dfci.ilab.agilent.com/account/login

Request Project through iLab
For more info: http://cccb.dfci.harvard.edu/project-request

Request iLab Account and Project
CCCB

Request iLab Account and Project
Analysis
Pipeline

Moving Beyond Excel: Data Wrangling with R
This introductory course is designed for investigators looking to improve their data analysis skills and move beyond
Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by
practical examples with high-throughput sequencing data such as differential expression or variant analyses.
No prior experience with R (or programming in general) is necessary.
Topics include:
● Introduction to R and the command line
● The power and ease of programming for consistent, reproducible research
● Reading and writing formatted datasets
● Filtering
● Data “cleaning”
● Data merging
● (If time permits) Basic plotting

Cloud Native Analysis Platform for NGS analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Cloud Native Analysis Platform for NGS analysis

Similaire à Cloud Native Analysis Platform for NGS analysis (20)

Dernier

Dernier (20)

Cloud Native Analysis Platform for NGS analysis