SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
CCCB RNA-Seq DGE
Analysis Made Easy
Center for Cancer Computational Biology (SM822)
Bioinformatics Team
Homepage: https://cccb.dfci.harvard.edu/
Twitter: @CCCBseq
So why are we here...
You have RNA-Seq data generated but...
○ uploading to Galaxy public server for analysis take forever
○ my bioinformaticians can not process it today
○ sequence alignment is taking forever
○ want to make additional differential expression contrasts
○ formating DGE result for GSEA analysis somehow doesn’t work
○ I am the bioinformatician and don’t have the time to process all this data
(for others and for free)
○ bioinformatic core services can be expensive and takes time
○ The Cancer Genomics Cloud, while powerful, requires good understanding of
Amazon or Google Cloud System to manage projects and payment for the
computing cost
CCCB Cloud System can help
Fast
○ Scalable infrastructure with virtually no computing resource limitation
○ Minimal queue time to get data analyzed
Secure
○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure
HIPAA compliance security
Convenient
○ Simplified large data upload and download processes by parallelized
direct cloud-to-cloud transfer between Dropbox to GCP to reduce data
transfer time from hours to minutes
○ Like other Cloud platforms users is set up to pay for overhead and
computing time but without steep learning curve to request project or
manage payment
RNA-Seq DGE analysis should not be difficult
Most RNA-Seq data can be aligned and quantified using the same settings for
initial DGE analysis
Technical bottleneck is often to gather enough computing power and set up proper
analysis environment… after data transfer problem is solved
AlignFastq
files
Quantify DGE
Clustering
Func.
Enrichment
Please use your gmail account to log
into https://cccb-analysis.tm4.org/
And upload the fastq data files
CCCB Cloud System- authentication
1. Use Incognito/Private Browser Session
2. Sign-in to https://cccb-analysis.tm4.org
with provided Google account
- @gmail.com address
- DFCI Gsuite email
(first_last@mail.dfci.harvard.edu)
CCCB Cloud System- analysis setup
3. Click on ‘Upload files’ on analysis homepage
- All analysis projects associated with your email
- Projects created on your behalf by CCCB
- Status messages, Click on next steps
CCCB Cloud System- analysis setup
4. choose your reference genome
5. Edit the project name to something meaningful
CCCB Cloud System- file uploads
6. Upload Files
a. Dropbox
- Preferred method
- Log in again into Dropbox
- Select files and upload
b. From local computer
- File chooser
- Drag/drop interface
- Slow transfer through https
File naming instructions
- Email notification when transfer is
complete.
CCCB Cloud System- file uploads
7. After receiving email (if using
Dropbox), refresh.
Uploaded files will be visible
CCCB Cloud System- Assign Sample Name
7. Set Sample Names
Sample names are inferred from
sequencing file names. Can create
new samples or remove existing ones.
- Drag/drop files to the proper
sample
CCCB Cloud System- Align and Quantify
RNA-Seq DGE Analysis Under the Hood
- Parallelized:
- alignment (STAR aligner) ---> BAM
Files
- Sort, primary-alignment filtering,
duplicate evaluation (Samtools,
Picard)
- Quantification (featureCounts)
- Merging:
- Overall “raw” (not normalized) count
matrix
- Differential expression testing
with DESeq2
- Plots/figures
Master
Sample 1
Sample 2
Sample N
Alignment is a Computationally Intensive Process
Running on Local Computing
● Require knowledge in unix and high performance computing
● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM)
● Require ability to write scripts and program
● Require understanding of the process to run alignment program
Running on Public Web Servers
● Wait time for most public web servers such as Galaxy (https://galaxyproject.org/)
and Genboree (http://genboree.org/) increases with the number of users
● Most of them utilizes https protocol and allows only 1 fastq file upload at a time.
● The Cancer Genomics Cloud (http://www.cancergenomicscloud.org/) requires good
understanding of Amazon or Google Cloud System to setup project and payment
Typical RNASeq DGE Experimental Design
Difficult to estimate the minimum number of biological replicates required, but
typical rule of thumb:
● 3+ for cell lines
● 5+ for inbred lines of model organisms
● As many samples for human as possible
A single RNASeq experiment is usually between 6 to 20+ samples and wait time
for upload, run-time, and download increases linearly on public web server with
risk of broken connection
CCCB Cloud Infrastructure
Users
CCCB Bioinformatics
CCCB
Sequencing
Data Upload
Application
“Download 50 fastq files!”
Pulls raw data from Dropbox and
push into Google Storage buckets
Scaling
Application“Align N samples”
Independent nodes/images
- Each node needs large amount of
data (e.g. index files for aligners)
- Pre-built images minimizes data
transfer
- Communication about status
Pulls raw data and pushes
processed data to/from Google
Storage buckets
Task management for data download
“Transfer these 50 fastQ files (>2Gb each) to
my Partner’s Dropbox!”
Application
Fast download for output files using
Dropbox
Save output by direct download or
Dropbox transfer:
- Authenticated: only those
logged-in as your Google user
can access files
- Direct transfer to Dropbox
storage for fast data transfer
and backup
- Email notification after transfer
is complete
- A master directory called
“cccb_transfers/” will be
created in Dropbox and
organized by projects
Straightforward differential analysis
Available
processed samples
Human-readable
contrast name
Thresholds used for
creating heatmaps and
volcano plots
Drag/drop samples
into contrast groups
Can rename groups
Standard RNA-Seq DGE Output
Custom report
Basic figures
Output files
Raw counts, normalized counts,
Differential expression results
Files for GSEA analysis
Gene Set Enrichment Analysis
Broad Institute GSEA (http://software.broadinstitute.org/gsea/)
Directly use the normalized count matrix file and groups.cls from CCCB Cloud
Platform DGE analysis result support files that can be imported into Broad Gene
Set Enrichment Analysis (GSEA) on MSigDB
RNASeq Data Visualization
Multi-experiment viewer (WebMEV)-- http://mev.tm4.org
Directly use the raw count matrix from CCCB Cloud Platform and import to do more
advanced analyses including:
- Clustering (hierarchical, k-means, PCA, etc)
- GO enrichment, pathway enrichment analyses
Backup Slides
For more information on Pipeline Services
Pricing Structure for RNASeq DGE
DFCI/BWH: $18 per sample
External Academia: $24 per sample
Industry: Inquire
CCCB Cloud Platform Road Map
GATK v3 (Live)/ v4 (May)
- Germline Mutation Calling for DNA-Seq
Mutect2 (April):
- Somatic Mutation Calling for tumor/normal paired DNA-Seq
Small non-coding RNA (April):
- Mapping and quantification of small non-coding RNA classes (miRNA, piRNA,
tRNA, snoRNA)
Transcript Isoform (May):
- Novel transcript isoform identification and quantification
Important accounts and where to get them
DFCI G Suite Account (or just Google Account)
Google accounts linked with organization emails are prefered even though any
google account can be used. For DFCI community, please request an DFCI
google account (user@mail.dfci.harvard.edu) through Research Computing
website: http://rc.dfci.harvard.edu/contact-research-computing
Partners Dropbox
All Dropbox account will work with our systems. Partners Health provides virtually
unlimited encrypted storage on Dropbox Business for all Partners community
members (anyone with partners.org email) for free. Information is available here:
https://rc.partners.org/kb/collaboration/dropbox?article=2062
Agilent CrossLab (a.k.a iLab Solutions)
As most of cores and centers around DFCI, we use iLab to track all of our projects.
A free account can be requested at https://dfci.ilab.agilent.com/account/login
Request Project through iLab
For more info: http://cccb.dfci.harvard.edu/project-request
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
CCCB
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
Analysis
Pipeline
Moving Beyond Excel: Data Wrangling with R
This introductory course is designed for investigators looking to improve their data analysis skills and move beyond
Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by
practical examples with high-throughput sequencing data such as differential expression or variant analyses.
No prior experience with R (or programming in general) is necessary.
Topics include:
● Introduction to R and the command line
● The power and ease of programming for consistent, reproducible research
● Reading and writing formatted datasets
● Filtering
● Data “cleaning”
● Data merging
● (If time permits) Basic plotting

Contenu connexe

Tendances

Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
OREChem Services and Workflows
OREChem Services and WorkflowsOREChem Services and Workflows
OREChem Services and Workflowsmarpierc
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsDatabricks
 
CaGrid 1.0 Service Infrastructure
CaGrid 1.0 Service InfrastructureCaGrid 1.0 Service Infrastructure
CaGrid 1.0 Service Infrastructurebosc
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Hector Correa
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overviewmarpierc
 
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...SisInfLab-SWoT @Politecnico di Bari
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...Nandana Mihindukulasooriya
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyNandana Mihindukulasooriya
 

Tendances (20)

Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
OREChem Services and Workflows
OREChem Services and WorkflowsOREChem Services and Workflows
OREChem Services and Workflows
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
 
CaGrid 1.0 Service Infrastructure
CaGrid 1.0 Service InfrastructureCaGrid 1.0 Service Infrastructure
CaGrid 1.0 Service Infrastructure
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
 

Similaire à Cloud Native Analysis Platform for NGS analysis

Cloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisCloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisYaoyu Wang
 
CCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformCCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformYaoyu Wang
 
Request CCCB Services
Request CCCB ServicesRequest CCCB Services
Request CCCB ServicesYaoyu Wang
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility ExhibitionGlobus
 
Cloud Computing in Systems Programming Curriculum
Cloud Computing in Systems Programming CurriculumCloud Computing in Systems Programming Curriculum
Cloud Computing in Systems Programming CurriculumSteven Miller
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountSAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountGary Jackson MBCS
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageDaniel Rohan
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte PushingChris Dagdigian
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 

Similaire à Cloud Native Analysis Platform for NGS analysis (20)

Cloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisCloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysis
 
CCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformCCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud Platform
 
Request CCCB Services
Request CCCB ServicesRequest CCCB Services
Request CCCB Services
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility Exhibition
 
Cloud Computing in Systems Programming Curriculum
Cloud Computing in Systems Programming CurriculumCloud Computing in Systems Programming Curriculum
Cloud Computing in Systems Programming Curriculum
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountSAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage Account
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud Storage
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Chembience
ChembienceChembience
Chembience
 

Dernier

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Cloud Native Analysis Platform for NGS analysis

  • 1. CCCB RNA-Seq DGE Analysis Made Easy Center for Cancer Computational Biology (SM822) Bioinformatics Team Homepage: https://cccb.dfci.harvard.edu/ Twitter: @CCCBseq
  • 2. So why are we here... You have RNA-Seq data generated but... ○ uploading to Galaxy public server for analysis take forever ○ my bioinformaticians can not process it today ○ sequence alignment is taking forever ○ want to make additional differential expression contrasts ○ formating DGE result for GSEA analysis somehow doesn’t work ○ I am the bioinformatician and don’t have the time to process all this data (for others and for free) ○ bioinformatic core services can be expensive and takes time ○ The Cancer Genomics Cloud, while powerful, requires good understanding of Amazon or Google Cloud System to manage projects and payment for the computing cost
  • 3. CCCB Cloud System can help Fast ○ Scalable infrastructure with virtually no computing resource limitation ○ Minimal queue time to get data analyzed Secure ○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure HIPAA compliance security Convenient ○ Simplified large data upload and download processes by parallelized direct cloud-to-cloud transfer between Dropbox to GCP to reduce data transfer time from hours to minutes ○ Like other Cloud platforms users is set up to pay for overhead and computing time but without steep learning curve to request project or manage payment
  • 4. RNA-Seq DGE analysis should not be difficult Most RNA-Seq data can be aligned and quantified using the same settings for initial DGE analysis Technical bottleneck is often to gather enough computing power and set up proper analysis environment… after data transfer problem is solved AlignFastq files Quantify DGE Clustering Func. Enrichment
  • 5. Please use your gmail account to log into https://cccb-analysis.tm4.org/ And upload the fastq data files
  • 6. CCCB Cloud System- authentication 1. Use Incognito/Private Browser Session 2. Sign-in to https://cccb-analysis.tm4.org with provided Google account - @gmail.com address - DFCI Gsuite email (first_last@mail.dfci.harvard.edu)
  • 7. CCCB Cloud System- analysis setup 3. Click on ‘Upload files’ on analysis homepage - All analysis projects associated with your email - Projects created on your behalf by CCCB - Status messages, Click on next steps
  • 8. CCCB Cloud System- analysis setup 4. choose your reference genome 5. Edit the project name to something meaningful
  • 9. CCCB Cloud System- file uploads 6. Upload Files a. Dropbox - Preferred method - Log in again into Dropbox - Select files and upload b. From local computer - File chooser - Drag/drop interface - Slow transfer through https File naming instructions - Email notification when transfer is complete.
  • 10. CCCB Cloud System- file uploads 7. After receiving email (if using Dropbox), refresh. Uploaded files will be visible
  • 11. CCCB Cloud System- Assign Sample Name 7. Set Sample Names Sample names are inferred from sequencing file names. Can create new samples or remove existing ones. - Drag/drop files to the proper sample
  • 12. CCCB Cloud System- Align and Quantify
  • 13. RNA-Seq DGE Analysis Under the Hood - Parallelized: - alignment (STAR aligner) ---> BAM Files - Sort, primary-alignment filtering, duplicate evaluation (Samtools, Picard) - Quantification (featureCounts) - Merging: - Overall “raw” (not normalized) count matrix - Differential expression testing with DESeq2 - Plots/figures Master Sample 1 Sample 2 Sample N
  • 14. Alignment is a Computationally Intensive Process Running on Local Computing ● Require knowledge in unix and high performance computing ● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM) ● Require ability to write scripts and program ● Require understanding of the process to run alignment program Running on Public Web Servers ● Wait time for most public web servers such as Galaxy (https://galaxyproject.org/) and Genboree (http://genboree.org/) increases with the number of users ● Most of them utilizes https protocol and allows only 1 fastq file upload at a time. ● The Cancer Genomics Cloud (http://www.cancergenomicscloud.org/) requires good understanding of Amazon or Google Cloud System to setup project and payment
  • 15. Typical RNASeq DGE Experimental Design Difficult to estimate the minimum number of biological replicates required, but typical rule of thumb: ● 3+ for cell lines ● 5+ for inbred lines of model organisms ● As many samples for human as possible A single RNASeq experiment is usually between 6 to 20+ samples and wait time for upload, run-time, and download increases linearly on public web server with risk of broken connection
  • 16. CCCB Cloud Infrastructure Users CCCB Bioinformatics CCCB Sequencing
  • 17. Data Upload Application “Download 50 fastq files!” Pulls raw data from Dropbox and push into Google Storage buckets
  • 18. Scaling Application“Align N samples” Independent nodes/images - Each node needs large amount of data (e.g. index files for aligners) - Pre-built images minimizes data transfer - Communication about status Pulls raw data and pushes processed data to/from Google Storage buckets
  • 19. Task management for data download “Transfer these 50 fastQ files (>2Gb each) to my Partner’s Dropbox!” Application
  • 20. Fast download for output files using Dropbox Save output by direct download or Dropbox transfer: - Authenticated: only those logged-in as your Google user can access files - Direct transfer to Dropbox storage for fast data transfer and backup - Email notification after transfer is complete - A master directory called “cccb_transfers/” will be created in Dropbox and organized by projects
  • 21. Straightforward differential analysis Available processed samples Human-readable contrast name Thresholds used for creating heatmaps and volcano plots Drag/drop samples into contrast groups Can rename groups
  • 22. Standard RNA-Seq DGE Output Custom report Basic figures Output files Raw counts, normalized counts, Differential expression results Files for GSEA analysis
  • 23. Gene Set Enrichment Analysis Broad Institute GSEA (http://software.broadinstitute.org/gsea/) Directly use the normalized count matrix file and groups.cls from CCCB Cloud Platform DGE analysis result support files that can be imported into Broad Gene Set Enrichment Analysis (GSEA) on MSigDB
  • 24. RNASeq Data Visualization Multi-experiment viewer (WebMEV)-- http://mev.tm4.org Directly use the raw count matrix from CCCB Cloud Platform and import to do more advanced analyses including: - Clustering (hierarchical, k-means, PCA, etc) - GO enrichment, pathway enrichment analyses
  • 26. For more information on Pipeline Services
  • 27. Pricing Structure for RNASeq DGE DFCI/BWH: $18 per sample External Academia: $24 per sample Industry: Inquire
  • 28. CCCB Cloud Platform Road Map GATK v3 (Live)/ v4 (May) - Germline Mutation Calling for DNA-Seq Mutect2 (April): - Somatic Mutation Calling for tumor/normal paired DNA-Seq Small non-coding RNA (April): - Mapping and quantification of small non-coding RNA classes (miRNA, piRNA, tRNA, snoRNA) Transcript Isoform (May): - Novel transcript isoform identification and quantification
  • 29. Important accounts and where to get them DFCI G Suite Account (or just Google Account) Google accounts linked with organization emails are prefered even though any google account can be used. For DFCI community, please request an DFCI google account (user@mail.dfci.harvard.edu) through Research Computing website: http://rc.dfci.harvard.edu/contact-research-computing Partners Dropbox All Dropbox account will work with our systems. Partners Health provides virtually unlimited encrypted storage on Dropbox Business for all Partners community members (anyone with partners.org email) for free. Information is available here: https://rc.partners.org/kb/collaboration/dropbox?article=2062 Agilent CrossLab (a.k.a iLab Solutions) As most of cores and centers around DFCI, we use iLab to track all of our projects. A free account can be requested at https://dfci.ilab.agilent.com/account/login
  • 30. Request Project through iLab For more info: http://cccb.dfci.harvard.edu/project-request
  • 31. Request iLab Account and Project For more info: http://cccb.dfci.harvard.edu/project-request CCCB
  • 32. Request iLab Account and Project For more info: http://cccb.dfci.harvard.edu/project-request Analysis Pipeline
  • 33. Moving Beyond Excel: Data Wrangling with R This introductory course is designed for investigators looking to improve their data analysis skills and move beyond Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by practical examples with high-throughput sequencing data such as differential expression or variant analyses. No prior experience with R (or programming in general) is necessary. Topics include: ● Introduction to R and the command line ● The power and ease of programming for consistent, reproducible research ● Reading and writing formatted datasets ● Filtering ● Data “cleaning” ● Data merging ● (If time permits) Basic plotting