SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
This presentation is available under the Creative Commons
Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts
hereof.
RNA-seq for DE analysis training
Generating the count table
and validating assumptions
Joachim Jacob
22 and 24 april 2014
2 of 40
Overview
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
3 of 40
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing Mapping to a reference genome
(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
4
5
6
4 of 40
Goal
We need to summarize the read counts per
gene from a mapping result.
The outcome is a raw count table on which
we can perform some QC, to validate the
experimental setup.
This table is used by the differential
expression algorithm to detect DE genes.
5 of 40
Status
20M
25M
15M
~16%
~5%
~10%
6 of 40
Tools to count 'features'
● 'Features' = type of annotation on a
genome = exons in our case.
● Different tools exist to accomplish this
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
7 of 40
The challenge in counting
'Exons' are the type of features used here.
They are summarized per 'gene'
Concept:
GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads
GeneB = exon 1 + exon 2 + exon 3 = 180 reads
No normalization yet! Just pure counts, aka 'raw counts',
Overlaps no feature
Alt splicing
Mapping result of RNA-seq data
8 of 40
Dealing with ambiguity
● Genes, often consist of different isoforms. These
contain different exons, some shared between
them, some not. Furthermore...
● Reads that do not overlap a feature, but
appear in introns. Take into account?
● Reads that align to more than one gene?
Transcripts can be overlapping - perhaps on
different strands. (PE, and strandedness can
resolve this partially).
● Reads that partially overlap a feature, not
following known annotations.
9 of 40
The tool HTSeq-count has 3 modes
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
HTSeq-count
recommends
the 'union
mode'. But
depending on
your genome,
you may opt
for the
'intersection_st
rict mode'.
Galaxy allows
experimenting!
10 of 40
Indicate the SE or PE nature of your data
(note: mate-pair is not
appropriate naming here)
The annotation file with the coordinates
of the features to be counted
mode
Check with mapping QC (see earlier)
For RNA-seq DE we summarize over
'exons' grouped by 'gene_id'. Make sure
these fields are correct in your GTF file.
Reverse stranded: heck with mapping viz
11 of 40
Resulting count table column
One sample !
12 of 40
Merging to create experiment count table
Tool 'Column join'
13 of 40
Resulting count table
14 of 40
Quality control of count table
In the end, we used about 70% of the reads. Check for your experiment.
Relative numbers Absolute numbers
15 of 40
Quality control of count table
2 types of QC:
● General metrics
● Sample-specific quality control
16 of 40
QC: general metrics
● General numbers
Total number of counted reads
17 of 40
QC: general metrics
● General numbers
18 of 40
QC: general metrics
Which genes are most highly present?
Which fractions do they occupy?
42 genes (0,0063%)
of the 6665 genes
take 25% of all
counts.
This graph can be
constructed from
the count table.
Gene Counts
TEF1alpha, putative ribo prot,...
19 of 40
QC: general metrics
● We can plot the counts per sample: filter
out the '0', and transform on log2.
log2(count)
The bulk of the genes have counts
in the hundreds.
Few are extremely highly expressed
A minority have extremely low counts
20 of 40
QC: log2 density graph
● We can do this for all samples, and merge
Strange
Deviation
here
All samples show
nice overlap, peaks
are similar
21 of 40
QC: log2 merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
You can conclude different
things when a horizontal
or vertical shift of the
graph, is appearing.
22 of 40
QC: rarefaction curve
Code:
ggplot(data = nonzero_counts, aes(total,
counts)) + geom_line() + labs(x = "total
number of sequenced reads",
y = "number of genes with counts > 0")
What is the number
of total detected
features, how does
the feature space
increase with each
additional sample
added?
There should be
saturation, but
here there is none.
23 of 40
QC: rarefaction curve
Saturation: OK!
….
SampleA
SampleA+sampleB
SampleA+sampleB+sampleC
Etc.
24 of 40
Alternative to log2 transformations
● Log2 transformations suffer from bloated
variance.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
VSTrLogLog2
Not normalizations!
http://www.biomedcentral.com/1471-2105/14/91
25 of 40
QC: count transformations
● Other transformations do not have this
behavior, especially VST.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
VSTrLogLog2
Not normalizations!
http://www.biomedcentral.com/1471-2105/14/91
26 of 40
Alternative to log2 transformations
Regularized log (rLog) and 'Variance Stabilizing Transformation'
(VST) as alternatives to log2.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
rLog VST
27 of 40
Beyond simple metrics QC
● We can also include condition
information, to interpret our QC better.
For this, we need to gather sample
information.
● Make a separate file
in which sample info
is provided (metadata)
28 of 40
QC with condition information
What are the differences in
counts in each sample
dependent on? Here: counts are
dependent on the treatment
and the strain. Must match
the sample descriptions file.
29 of 40
QC with condition info
Clustering of the distance between samples based on
transformed counts can reveal sample errors.
VST transformed rLog transformed
Colour scale
Of the distance
measure between
Samples. Similar conditions
Should cluster together
30 of 40
QC with condition info
Clustering of transformed counts can reveal sample
errors.
VST transformed rLog transformed
Biological samples
Should cluster
together
31 of 40
QC with condition info
Principal component (PC) analysis allows to
display the samples in a 2D scatterplot based on
variability between the samples. Samples close to
each other resemble each other more.
32 of 40
Collect enough metadata
Principal component (PC) analysis allows to
display the samples in a 2D scatterplot based on
variability between the samples. Samples close to
each other resemble each other more.
Why do
these lie so close together?
33 of 40
You can never collect enough
During library preparation, collect as much as
information as possible, to add to the sample
descriptions. Pay particular attention to differences
between samples: e.g. day of preparation,
centrifuges used, ...
34 of 40
Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).
Additional metadata
35 of 40
Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).
Day 1
Day 2
36 of 40
Collect enough metadata
Days are included
And give us more
insight
37 of 40
Next step
Now we know our data from the inside out, we
can run a DE algorithm on the count table!
38 of 40
Keywords
Raw counts
Count table
Overlapping features
Density graph
Rarefaction curve
Count transformation
VST
Sample metadata
PCA plot
Write in your own words what the terms mean
39 of 40
Exercises
● → Extracting counts and doing QC
40 of 40
Break

Contenu connexe

Tendances

RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysestuxette
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
Understanding and controlling for sample and platform biases in NGS assays
Understanding and controlling for sample and platform biases in NGS assaysUnderstanding and controlling for sample and platform biases in NGS assays
Understanding and controlling for sample and platform biases in NGS assaysCandy Smellie
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Major resources of bioinformatics 2
Major resources of bioinformatics 2Major resources of bioinformatics 2
Major resources of bioinformatics 2Mohd Affan
 
Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Syed Lokman
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017David Cook
 
Ecocyc database
Ecocyc databaseEcocyc database
Ecocyc databaseShiv Kumar
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformaticsSumatiHajela
 
DNA microarray
DNA microarrayDNA microarray
DNA microarrayS Rasouli
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary databaseKAUSHAL SAHU
 

Tendances (20)

Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Alignments
AlignmentsAlignments
Alignments
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analyses
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
Real Time PCR
Real Time PCRReal Time PCR
Real Time PCR
 
Understanding and controlling for sample and platform biases in NGS assays
Understanding and controlling for sample and platform biases in NGS assaysUnderstanding and controlling for sample and platform biases in NGS assays
Understanding and controlling for sample and platform biases in NGS assays
 
Bridge Amplification Part 1
Bridge Amplification Part 1Bridge Amplification Part 1
Bridge Amplification Part 1
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Major resources of bioinformatics 2
Major resources of bioinformatics 2Major resources of bioinformatics 2
Major resources of bioinformatics 2
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
Ecocyc database
Ecocyc databaseEcocyc database
Ecocyc database
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
DNA microarray
DNA microarrayDNA microarray
DNA microarray
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 

En vedette

Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionJoachim Jacob
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Tips and Techniques for Improving the Performance of Validation Procedures in...
Tips and Techniques for Improving the Performance of Validation Procedures in...Tips and Techniques for Improving the Performance of Validation Procedures in...
Tips and Techniques for Improving the Performance of Validation Procedures in...Perficient, Inc.
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
Effect of Procedure Question Group Attributes on Performance of Batch Validation
Effect of Procedure Question Group Attributes on Performance of Batch ValidationEffect of Procedure Question Group Attributes on Performance of Batch Validation
Effect of Procedure Question Group Attributes on Performance of Batch ValidationPerficient
 
Inside an Oracle Clinical Validation Procedure
Inside an Oracle Clinical Validation ProcedureInside an Oracle Clinical Validation Procedure
Inside an Oracle Clinical Validation ProcedurePerficient
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisJunsu Ko
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 

En vedette (10)

Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Tips and Techniques for Improving the Performance of Validation Procedures in...
Tips and Techniques for Improving the Performance of Validation Procedures in...Tips and Techniques for Improving the Performance of Validation Procedures in...
Tips and Techniques for Improving the Performance of Validation Procedures in...
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
Effect of Procedure Question Group Attributes on Performance of Batch Validation
Effect of Procedure Question Group Attributes on Performance of Batch ValidationEffect of Procedure Question Group Attributes on Performance of Batch Validation
Effect of Procedure Question Group Attributes on Performance of Batch Validation
 
Inside an Oracle Clinical Validation Procedure
Inside an Oracle Clinical Validation ProcedureInside an Oracle Clinical Validation Procedure
Inside an Oracle Clinical Validation Procedure
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 

Similaire à Part 4 of RNA-seq for DE analysis: Extracting count table and QC

Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Prof. Wim Van Criekinge
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Optimization of Test Pattern Using Genetic Algorithm for Testing SRAM
Optimization of Test Pattern Using Genetic Algorithm for Testing SRAMOptimization of Test Pattern Using Genetic Algorithm for Testing SRAM
Optimization of Test Pattern Using Genetic Algorithm for Testing SRAMIJERA Editor
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug predictionMartin Pinzger
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3QIAGEN
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia岳華 杜
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataIRJET Journal
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Rt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcardeRt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcardeElsa von Licy
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 

Similaire à Part 4 of RNA-seq for DE analysis: Extracting count table and QC (20)

Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Optimization of Test Pattern Using Genetic Algorithm for Testing SRAM
Optimization of Test Pattern Using Genetic Algorithm for Testing SRAMOptimization of Test Pattern Using Genetic Algorithm for Testing SRAM
Optimization of Test Pattern Using Genetic Algorithm for Testing SRAM
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug prediction
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
May workshop
May workshopMay workshop
May workshop
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
Rt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcardeRt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcarde
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 

Plus de Joachim Jacob

Korte handleiding van de Partago app
Korte handleiding van de Partago appKorte handleiding van de Partago app
Korte handleiding van de Partago appJoachim Jacob
 
Blaas nieuw leven in je PC met Linux
Blaas nieuw leven in je PC met LinuxBlaas nieuw leven in je PC met Linux
Blaas nieuw leven in je PC met LinuxJoachim Jacob
 
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...Joachim Jacob
 
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 6 of "Introduction to linux for bioinformatics": Productivity tipsPart 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 6 of "Introduction to linux for bioinformatics": Productivity tipsJoachim Jacob
 
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Joachim Jacob
 
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 2 of 'Introduction to Linux for bioinformatics': Installing softwarePart 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 2 of 'Introduction to Linux for bioinformatics': Installing softwareJoachim Jacob
 
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': IntroductionPart 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': IntroductionJoachim Jacob
 
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...Joachim Jacob
 

Plus de Joachim Jacob (9)

Korte handleiding van de Partago app
Korte handleiding van de Partago appKorte handleiding van de Partago app
Korte handleiding van de Partago app
 
Blaas nieuw leven in je PC met Linux
Blaas nieuw leven in je PC met LinuxBlaas nieuw leven in je PC met Linux
Blaas nieuw leven in je PC met Linux
 
The Galaxy toolshed
The Galaxy toolshedThe Galaxy toolshed
The Galaxy toolshed
 
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
 
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 6 of "Introduction to linux for bioinformatics": Productivity tipsPart 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
 
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
 
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 2 of 'Introduction to Linux for bioinformatics': Installing softwarePart 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
 
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': IntroductionPart 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
 
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
 

Dernier

Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 

Dernier (20)

Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 

Part 4 of RNA-seq for DE analysis: Extracting count table and QC

  • 1. This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. RNA-seq for DE analysis training Generating the count table and validating assumptions Joachim Jacob 22 and 24 april 2014
  • 3. 3 of 40 Bioinformatics analysis will take most of your time Quality control (QC) of raw reads Preprocessing: filtering of reads and read parts, to help our goal of differential detection. QC of preprocessing Mapping to a reference genome (alternative: to a transcriptome) QC of the mapping Count table extraction QC of the count table DE test Biological insight 1 2 3 4 5 6
  • 4. 4 of 40 Goal We need to summarize the read counts per gene from a mapping result. The outcome is a raw count table on which we can perform some QC, to validate the experimental setup. This table is used by the differential expression algorithm to detect DE genes.
  • 6. 6 of 40 Tools to count 'features' ● 'Features' = type of annotation on a genome = exons in our case. ● Different tools exist to accomplish this http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
  • 7. 7 of 40 The challenge in counting 'Exons' are the type of features used here. They are summarized per 'gene' Concept: GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads GeneB = exon 1 + exon 2 + exon 3 = 180 reads No normalization yet! Just pure counts, aka 'raw counts', Overlaps no feature Alt splicing Mapping result of RNA-seq data
  • 8. 8 of 40 Dealing with ambiguity ● Genes, often consist of different isoforms. These contain different exons, some shared between them, some not. Furthermore... ● Reads that do not overlap a feature, but appear in introns. Take into account? ● Reads that align to more than one gene? Transcripts can be overlapping - perhaps on different strands. (PE, and strandedness can resolve this partially). ● Reads that partially overlap a feature, not following known annotations.
  • 9. 9 of 40 The tool HTSeq-count has 3 modes http://www-huber.embl.de/users/anders/HTSeq/doc/count.html HTSeq-count recommends the 'union mode'. But depending on your genome, you may opt for the 'intersection_st rict mode'. Galaxy allows experimenting!
  • 10. 10 of 40 Indicate the SE or PE nature of your data (note: mate-pair is not appropriate naming here) The annotation file with the coordinates of the features to be counted mode Check with mapping QC (see earlier) For RNA-seq DE we summarize over 'exons' grouped by 'gene_id'. Make sure these fields are correct in your GTF file. Reverse stranded: heck with mapping viz
  • 11. 11 of 40 Resulting count table column One sample !
  • 12. 12 of 40 Merging to create experiment count table Tool 'Column join'
  • 13. 13 of 40 Resulting count table
  • 14. 14 of 40 Quality control of count table In the end, we used about 70% of the reads. Check for your experiment. Relative numbers Absolute numbers
  • 15. 15 of 40 Quality control of count table 2 types of QC: ● General metrics ● Sample-specific quality control
  • 16. 16 of 40 QC: general metrics ● General numbers Total number of counted reads
  • 17. 17 of 40 QC: general metrics ● General numbers
  • 18. 18 of 40 QC: general metrics Which genes are most highly present? Which fractions do they occupy? 42 genes (0,0063%) of the 6665 genes take 25% of all counts. This graph can be constructed from the count table. Gene Counts TEF1alpha, putative ribo prot,...
  • 19. 19 of 40 QC: general metrics ● We can plot the counts per sample: filter out the '0', and transform on log2. log2(count) The bulk of the genes have counts in the hundreds. Few are extremely highly expressed A minority have extremely low counts
  • 20. 20 of 40 QC: log2 density graph ● We can do this for all samples, and merge Strange Deviation here All samples show nice overlap, peaks are similar
  • 21. 21 of 40 QC: log2 merging samples Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples. You can conclude different things when a horizontal or vertical shift of the graph, is appearing.
  • 22. 22 of 40 QC: rarefaction curve Code: ggplot(data = nonzero_counts, aes(total, counts)) + geom_line() + labs(x = "total number of sequenced reads", y = "number of genes with counts > 0") What is the number of total detected features, how does the feature space increase with each additional sample added? There should be saturation, but here there is none.
  • 23. 23 of 40 QC: rarefaction curve Saturation: OK! …. SampleA SampleA+sampleB SampleA+sampleB+sampleC Etc.
  • 24. 24 of 40 Alternative to log2 transformations ● Log2 transformations suffer from bloated variance. http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html VSTrLogLog2 Not normalizations! http://www.biomedcentral.com/1471-2105/14/91
  • 25. 25 of 40 QC: count transformations ● Other transformations do not have this behavior, especially VST. http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html VSTrLogLog2 Not normalizations! http://www.biomedcentral.com/1471-2105/14/91
  • 26. 26 of 40 Alternative to log2 transformations Regularized log (rLog) and 'Variance Stabilizing Transformation' (VST) as alternatives to log2. http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html rLog VST
  • 27. 27 of 40 Beyond simple metrics QC ● We can also include condition information, to interpret our QC better. For this, we need to gather sample information. ● Make a separate file in which sample info is provided (metadata)
  • 28. 28 of 40 QC with condition information What are the differences in counts in each sample dependent on? Here: counts are dependent on the treatment and the strain. Must match the sample descriptions file.
  • 29. 29 of 40 QC with condition info Clustering of the distance between samples based on transformed counts can reveal sample errors. VST transformed rLog transformed Colour scale Of the distance measure between Samples. Similar conditions Should cluster together
  • 30. 30 of 40 QC with condition info Clustering of transformed counts can reveal sample errors. VST transformed rLog transformed Biological samples Should cluster together
  • 31. 31 of 40 QC with condition info Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.
  • 32. 32 of 40 Collect enough metadata Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more. Why do these lie so close together?
  • 33. 33 of 40 You can never collect enough During library preparation, collect as much as information as possible, to add to the sample descriptions. Pay particular attention to differences between samples: e.g. day of preparation, centrifuges used, ...
  • 34. 34 of 40 Collect enough metadata In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect). Additional metadata
  • 35. 35 of 40 Collect enough metadata In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect). Day 1 Day 2
  • 36. 36 of 40 Collect enough metadata Days are included And give us more insight
  • 37. 37 of 40 Next step Now we know our data from the inside out, we can run a DE algorithm on the count table!
  • 38. 38 of 40 Keywords Raw counts Count table Overlapping features Density graph Rarefaction curve Count transformation VST Sample metadata PCA plot Write in your own words what the terms mean
  • 39. 39 of 40 Exercises ● → Extracting counts and doing QC