The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
2. Transcriptome Sequencing
Sequencing steady state RNA in a sample is known as
RNA-Seq. It is free of limitations such as prior
knowledge about the organism is not required.
RNA-Seq is useful to unravel inaccessible complexities
of transcriptomics such as finding novel transcripts and
isoforms.
Data set produced is large and complex; interpretation
is not straight forward.
3. Making sense of RNA-Seq data…….
Depends upon the scientific question of interest.
For example allele specific expression requires accurate
determination of the transcribed SNPs.
Finding novel transcripts will help in finding fusion gene
events and aberrations in cancer samples.
4. Applications of RNA-Seq
Abundance estimation
2. Alternative splicing
3. RNA editing
4. Finding novel transcripts
5. Finding isoforms
And many more…..
1.
5. From RNA-seq reads
to differential
expression results:
Oshlack et al. Genome
Biology 2010, 11:220
6. Mapping Reads to Reference: CLC bio Workbench
The
RNA-Seq analysis is done in several steps: First, all genes
are extracted from the reference genome (using annotations of
type gene). Other annotations on the gene sequences are
preserved (e.g. CDS information about coding sequences etc).
Next, all
annotated transcripts (using annotations of type
mRNA) are extracted. If there are several annotated splice
variants, they are all extracted. Note that the mRNA
annotation type is used for extracting the exon-exon
boundaries.
8. The mapping parameters
Maximum number of mismatches : short reads (shorter than 56
nucleotides, except for color space data which are always treated as
long reads). This is the maximum number of mismatches to be
allowed. Maximum value is 3, except for color space where it is 2.
Minimum length fraction : the default is 0.9 which means that at
least 90 % of the bases need to align to the reference.
Minimum similarity fraction : the default setting at 0.8 and the default
setting for the length fraction, it means that 90 % of the read should
align with 80 % similarity in order to include the read.
Maximum number of hits for a read : a read that matches to more
distinct places in the references than the ’Maximum number of hits
for a read’ specified will not be mapped
Strand-specific alignment : Mapping reads to specific strand
15. Summarization : Parameters
Transcripts: The number of transcripts based on the mRNA
annotations on the reference. Note that this is not based on the
sequencing data - only on the annotations already on the reference
sequence(s).
Exon length: The total length of all exons (not all transcripts).
Unique gene reads : This is the number of reads that match uniquely to
the gene.
Total gene reads: This is all the reads that are mapped to this gene --both reads that map uniquely to the gene and reads that matched to
more positions in the reference (but fewer than the ’Maximum
number of hits for a read’ parameter) which were assigned to this
gene.
RPKM: Reads Per Kilobase of exon model per Million mapped reads is
the expression value measured in RPKM [Mortazavi et al., 2008]:
RPKM = total exon reads/ mapped reads(millions)exon length (KB) .
18. Basic Statistics Summary
The Basic Statistics module generates some simple
composition statistics for the file analysed.
Filename: The original filename of the file which was analysed.
File type: Says whether the file appeared to contain actual base calls or
colorspace data which had to be converted to base calls.
Total Sequences: A count of the total number of sequences processed.
There are two values reported, actual and estimated.
Sequence Length: Provides the length of the shortest and longest
sequence in the set. If all sequences are the same length only one value
is reported.
%GC: The overall %GC of all bases in all sequences
Warning
Basic Statistics never raises a warning.
19.
This view shows an overview of the range of
quality values across all bases at each position
in the FastQ file. For each position a
BoxWhisker type plot is drawn. The elements
of the plot are as follows:
The central red line is the median value
The yellow box represents
quartilerange (25-75%)
The upper and lower whiskers represent
the10% and 90% points
the
inter-
The blue line represents the mean quality. The y-axis on the graph shows the
quality scores. The higher the score the better the base call. The background of the
graph divides the y axis into very good quality calls (green), calls of reasonable
quality (orange), and calls of poor quality (red). The quality of calls on most
platforms will degrade as the run progresses, so it is common to see base calls
falling into the orange area towards the end of a read. It should be mentioned that
there are number of different ways to encode a quality score in a FastQ file.
FastQC attempts to automatically determine which encoding method was used,
the title of the graph will describe the encoding FastQC thinks your file used.
20. The per sequence quality score report allows you
to see if a subset of your sequences have
Universally low quality values. It is often the case
that a subset of sequences will have
universally poor quality,
often because they are
poorly imaged (on the edge of the field
of view
etc),
however these should represent only a
small percentage of
the total sequences. If a
significant proportion of the sequences
in a run
have
overall low quality then this could
indicate some kind of
systematic problem - possibly with just part of
the run (for example one end of a flowcell).