Part 4 of RNA-seq for DE analysis: Extracting count table and QC

This presentation is available under the Creative Commons
Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts
hereof.
RNA-seq for DE analysis training
Generating the count table
and validating assumptions
Joachim Jacob
22 and 24 april 2014

2 of 40
Overview
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

3 of 40
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing Mapping to a reference genome
(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
4
5
6

4 of 40
Goal
We need to summarize the read counts per
gene from a mapping result.
The outcome is a raw count table on which
we can perform some QC, to validate the
experimental setup.
This table is used by the differential
expression algorithm to detect DE genes.

5 of 40
Status
20M
25M
15M
~16%
~5%
~10%

6 of 40
Tools to count 'features'
● 'Features' = type of annotation on a
genome = exons in our case.
● Different tools exist to accomplish this
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting

7 of 40
The challenge in counting
'Exons' are the type of features used here.
They are summarized per 'gene'
Concept:
GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads
GeneB = exon 1 + exon 2 + exon 3 = 180 reads
No normalization yet! Just pure counts, aka 'raw counts',
Overlaps no feature
Alt splicing
Mapping result of RNA-seq data

8 of 40
Dealing with ambiguity
● Genes, often consist of different isoforms. These
contain different exons, some shared between
them, some not. Furthermore...
● Reads that do not overlap a feature, but
appear in introns. Take into account?
● Reads that align to more than one gene?
Transcripts can be overlapping - perhaps on
different strands. (PE, and strandedness can
resolve this partially).
● Reads that partially overlap a feature, not
following known annotations.

9 of 40
The tool HTSeq-count has 3 modes
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
HTSeq-count
recommends
the 'union
mode'. But
depending on
your genome,
you may opt
for the
'intersection_st
rict mode'.
Galaxy allows
experimenting!

10 of 40
Indicate the SE or PE nature of your data
(note: mate-pair is not
appropriate naming here)
The annotation file with the coordinates
of the features to be counted
mode
Check with mapping QC (see earlier)
For RNA-seq DE we summarize over
'exons' grouped by 'gene_id'. Make sure
these fields are correct in your GTF file.
Reverse stranded: heck with mapping viz

11 of 40
Resulting count table column
One sample !

12 of 40
Merging to create experiment count table
Tool 'Column join'

13 of 40
Resulting count table

14 of 40
Quality control of count table
In the end, we used about 70% of the reads. Check for your experiment.
Relative numbers Absolute numbers

15 of 40
Quality control of count table
2 types of QC:
● General metrics
● Sample-specific quality control

16 of 40
QC: general metrics
● General numbers
Total number of counted reads

17 of 40
QC: general metrics
● General numbers

18 of 40
QC: general metrics
Which genes are most highly present?
Which fractions do they occupy?
42 genes (0,0063%)
of the 6665 genes
take 25% of all
counts.
This graph can be
constructed from
the count table.
Gene Counts
TEF1alpha, putative ribo prot,...

19 of 40
QC: general metrics
● We can plot the counts per sample: filter
out the '0', and transform on log2.
log2(count)
The bulk of the genes have counts
in the hundreds.
Few are extremely highly expressed
A minority have extremely low counts

20 of 40
QC: log2 density graph
● We can do this for all samples, and merge
Strange
Deviation
here
All samples show
nice overlap, peaks
are similar

21 of 40
QC: log2 merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
You can conclude different
things when a horizontal
or vertical shift of the
graph, is appearing.

22 of 40
QC: rarefaction curve
Code:
ggplot(data = nonzero_counts, aes(total,
counts)) + geom_line() + labs(x = "total
number of sequenced reads",
y = "number of genes with counts > 0")
What is the number
of total detected
features, how does
the feature space
increase with each
additional sample
added?
There should be
saturation, but
here there is none.

23 of 40
QC: rarefaction curve
Saturation: OK!
….
SampleA
SampleA+sampleB
SampleA+sampleB+sampleC
Etc.

24 of 40
Alternative to log2 transformations
● Log2 transformations suffer from bloated
variance.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
VSTrLogLog2
Not normalizations!
http://www.biomedcentral.com/1471-2105/14/91

25 of 40
QC: count transformations
● Other transformations do not have this
behavior, especially VST.
VSTrLogLog2
Not normalizations!
http://www.biomedcentral.com/1471-2105/14/91

26 of 40
Alternative to log2 transformations
Regularized log (rLog) and 'Variance Stabilizing Transformation'
(VST) as alternatives to log2.
rLog VST

27 of 40
Beyond simple metrics QC
● We can also include condition
information, to interpret our QC better.
For this, we need to gather sample
information.
● Make a separate file
in which sample info
is provided (metadata)

28 of 40
QC with condition information
What are the differences in
counts in each sample
dependent on? Here: counts are
dependent on the treatment
and the strain. Must match
the sample descriptions file.

29 of 40
QC with condition info
Clustering of the distance between samples based on
transformed counts can reveal sample errors.
VST transformed rLog transformed
Colour scale
Of the distance
measure between
Samples. Similar conditions
Should cluster together

30 of 40
Clustering of transformed counts can reveal sample
errors.
VST transformed rLog transformed
Biological samples
Should cluster
together

31 of 40
Principal component (PC) analysis allows to
display the samples in a 2D scatterplot based on
variability between the samples. Samples close to
each other resemble each other more.

32 of 40
Collect enough metadata
Principal component (PC) analysis allows to
display the samples in a 2D scatterplot based on
variability between the samples. Samples close to
each other resemble each other more.
Why do
these lie so close together?

33 of 40
You can never collect enough
During library preparation, collect as much as
information as possible, to add to the sample
descriptions. Pay particular attention to differences
between samples: e.g. day of preparation,
centrifuges used, ...

34 of 40
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).
Additional metadata

35 of 40
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).
Day 1
Day 2

36 of 40
Days are included
And give us more
insight

37 of 40
Next step
Now we know our data from the inside out, we
can run a DE algorithm on the count table!

38 of 40
Keywords
Raw counts
Count table
Overlapping features
Density graph
Rarefaction curve
Count transformation
VST
Sample metadata
PCA plot
Write in your own words what the terms mean

39 of 40
Exercises
● → Extracting counts and doing QC

Part 4 of RNA-seq for DE analysis: Extracting count table and QC

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Part 4 of RNA-seq for DE analysis: Extracting count table and QC

Similaire à Part 4 of RNA-seq for DE analysis: Extracting count table and QC (20)

Plus de Joachim Jacob

Plus de Joachim Jacob (9)

Dernier

Dernier (20)

Part 4 of RNA-seq for DE analysis: Extracting count table and QC