7. 7Functional Genomics, SS2014Dienstag, 25. März 2014
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Mapping strategies depend on read length
●
Read length < 50 bp
●
Read length > 50 bp
8. 8Functional Genomics, SS2014Dienstag, 25. März 2014
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Mapping strategies depend on read length
●
Read length < 50 bp → Short (Unspliced) aligners
●
Read length > 50 bp
BWA BOWTIE
9. 9Functional Genomics, SS2014Dienstag, 25. März 2014
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Mapping strategies depend on read length
●
Read length < 50 bp → Short (Unspliced) aligners
●
Read length > 50 bp → Spliced alignment programs
●
In mRNA sequences the introns were removed
BWA BOWTIE
GSNAP
MapSplice
STAR
PAL Mapper
TopHat
ReadsMapPASS
SMALT
10. 10Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
11. 11Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
18. 18Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
a single gene may code
for multiple proteins
19. 19Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
20. 20Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
21. 21Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
22. 22Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
pseudogenes often have highly similar sequences to functional,
intron-containing genes → RNA reads can incorrectly be mapped
here
the human genome, which contains over 14,000 pseudogenes [Pei
et al. Genome Biol 2012]
23. 23Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
Duplications
24. 24Functional Genomics, SS2014Dienstag, 25. März 2014
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
Duplications
may correspond to biased PCR amplification of particular fragments
25. 25Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
26. 26Functional Genomics, SS2014Dienstag, 25. März 2014
The aim of this paper
Asses the performance of 26 RNA seq alignment
protocols –based on 11 programs on real and simulated
human and mouse transcriptomes
Alignment protocols were evaluated on Illumina 76-
nucleotide
paired-end RNA-seq data from:
the human leukemia cell line K562 (1.3 × 109 reads)
mouse brain (1.1 × 108 reads) and two simulated
27. 27Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
TopHat
MapSplice
STAR
GSNAP
Conclusions
29. 29Functional Genomics, SS2014Dienstag, 25. März 2014
unspliced
alignment
- reads that map to more than
10 locations
- reads that have more than a
few mismatches
TopHat
Trapnell, Pachter, and Salzberg (2009)
30. 30Functional Genomics, SS2014Dienstag, 25. März 2014
unspliced
alignment
assemble
islands of sequences
- reads that map to more than
10 locations
- reads that have more than a
few mismatches
TopHat
Trapnell, Pachter, and Salzberg (2009)
31. 31Functional Genomics, SS2014Dienstag, 25. März 2014
unspliced
alignment
assemble
Such an approach will identify only known
or predicted combinations of exons
TopHat
Trapnell, Pachter, and Salzberg (2009)
35. 35Functional Genomics, SS2014Dienstag, 25. März 2014
TopHat
Trapnell, Pachter, and Salzberg (2009)
If an alignment extends into
an intron region, realign the reads
to the adjacent exons instead
Known junction signals:
GT-AG, GC-AG, and AT-AC
36. 36Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in sequence alignment
What the paper is about
Existing software
TopHat
MapSplice
STAR
GSNAP
Conclusions
Future work
37. 37Functional Genomics, SS2014Dienstag, 25. März 2014
MapSplice
Wang et al. (2010)
Similar to TopMap
Reads = tags
A tag has an ‘exonic alignment’ if it can be aligned in its
entirety to a consecutive sequence of nucleotides in G.
T has a ‘spliced alignment’ if its alignment to G Requires
one or more gaps
39. 39Functional Genomics, SS2014Dienstag, 25. März 2014
MapSplice
Wang et al. (2010)
Step 2: spliced alignment
●
the spliced alignment of tj+1
to the genomic interval between
anchors tj and tj+2
●
consider all the possible positions
of the splice site and map according
to the Hamming distace
41. 41Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in sequence alignment
What the paper is about
Existing software
TopHat
MapSplice
STAR
GSNAP
Conclusions
Future work
42. 42Functional Genomics, SS2014Dienstag, 25. März 2014
STAR
Dobin et al. (2012)
Maximal Mappable Prefix (read location i) =
the longest read substring from position i
that has exact match on one
or more substrings of the ref genome
poor genomic alignment
Detect:
(a) splice junctions
(b) mismatches
(c) tails
43. 43Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in sequence alignment
What the paper is about
Existing software
TopHat
MapSplice
STAR
GSNAP
Conclusions
Future work
44. 44Functional Genomics, SS2014Dienstag, 25. März 2014
GSNAP
Wu and Nacu (2010)
Efficient detection of indels and splice pairs:
For large genomes, it is more efficient to preprocess the
genome rather than the reads to create genomic
index files, which provide genomic positions for a given
prefix/suffix.
Works with candidate regions in the ref genome. (keep
track of the read location of 12 residues that support each
candidate region)
46. 46Functional Genomics, SS2014Dienstag, 25. März 2014
For a more powerful use of the algorithms:
use of available gene annotations, which allow it to avoid
erroneously mapping reads to pseudogenes
use the information about the pair sof the paired read
47. 47Functional Genomics, SS2014Dienstag, 25. März 2014
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
48. 48Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
Mismatches and basewise accuracy
MapSplice, PASS and TopHat display a low tolerance for mismatches.
Consequently, a large proportion of reads with low base-call quality scores
were not mapped by these methods
49. 49Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
Mismatches and basewise accuracy
●
GSNAP, GSTRUCT, MapSplice,PASS, SMALT and STAR allow missmatches an can also
output an incomplete alignment when they are unable to map an entire sequence
50. 50Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
Mismatches and basewise accuracy
Reads from mouse were mapped (against the mouse reference assembly17) at a greater rate and
with fewer mismatches than those from K562 (the cancer cell line K562 accumulated a lot of
mutations with respect to the human reference assembly).
51. 51Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
Indel frequency
and accuracy
.
●
GSTRUCT produced the most uniform
distribution of indels
(coefficient of variation (CV) = 0.32)
●
TopHat produced the most variable
distribution
(CV = 1.5 and 1.1 splice junctions)
Size distribution of indels
for the human K562 data set
Precision and recall, stratified by indel size
GEM and PALMapper output included more
indels than any other method
52. 52Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
Indel frequency
and accuracy
●
GEM and PALMapper report many false indels
(precision)
●
GSNAP and GSTRUCT exhibit high sensitivity
for deletions, independent of size (recall)
●
TopHat2 protocol is the most
sensitive method for long insertions (recall)
Precision and recall, stratified by indel size
53. 53Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
Spliced alignment
●
High accuracy discovery rate for
ReadsMap, GSNAP, GSTRUCT and
MapSplice and TopHat
●
#false junction calls was greatly reduced
if junctions were filtered by supporting
alignment counts (plot c)
●
Protocols using annotation recovered
nearly all of the known junctions in
expressed transcripts (plot d)
●
For novel-junction discovery,
GSTRUCT outperformed other methods
●
54. 54Functional Genomics, SS2014Dienstag, 25. März 2014
Conclusions
GSNAP, GSTRUCT, MapSplice and STAR compared
favorably to the other methods
MapSplice seems to be a conservative aligner with respect to
mismatch frequency, indel and exon junction calls.
The most significant issue with GSNAP, GSTRUCT and
STAR is the presence of many false exon junctions in the
output.
Both GSNAP and GSTRUCT require considerable computing
time when parameterized for sensitive spliced alignment
56. 56Functional Genomics, SS2014Dienstag, 25. März 2014
Remaining challenges:
Remaining challenges include exploiting gene annotation
with-
out introducing bias, correctly placing multimapped reads,
achiev-
ing optimal yet fast alignment around gaps and
mismatches, and
Analysis
reducing the number of false exon junctions reported.
Ongoing
developments in sequencing technology will demand
efficient
processing of longer reads with higher error rates and will
require
more extensive spliced alignment as reads span multiple
57. 57Functional Genomics, SS2014Dienstag, 25. März 2014
Some RNA-seq aligners, including GSNAP [5], RUM [6],
and STAR [7], map reads independently of the alignments
of other reads, which may explain their lower sensitivity for
these spliced reads
GSNAP [5] and STAR [7] also make use of annotation,
although they use it in a more limited fashion in order to
detect splice sites
58. 58Functional Genomics, SS2014Dienstag, 25. März 2014
have shown how suffix arrays (Manber
and Myers, 1990), compressed using a Burrows-Wheeler
Transform
(BWT) (Burrows and Wheeler, 1994), can rapidly map
reads that
are exact matches or have a few mismatches or short
insertions or
deletions (indels) relative to the reference.
59. 59Functional Genomics, SS2014Dienstag, 25. März 2014
A third approach, provided by the QPALMA program (Bona
et al., 2008), can align individual reads across exon–exon
junctions
using Smith–Waterman-type alignments and a specifically
trained
splice site model.
Notes de l'éditeur
/home/monique/Desktop/ETH_alignment_MDragan.odp
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps