Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Transcript detection in RNAseq
1. [by Joseph Robertson] RNAseq analysis: Transcript detection (1/2) What is a jar ? August 11, 2011
2. Quick recap: Production informatics August 11, 2011 Sequencing->Images->Conversion (Demultiplexing) Resulting file type: FASTQ “Having raw sequence reads and quality scores” Sequencing Image Fastq Quality Control Projects
3. Objective & Challenges Objective: study the active transcriptome of the cell Problems: The RNA content of a cell is dominated by tRNA, rRNA and housekeeping genes Flowcell has only a finite real-estate of which most would be occupied by these mainly invariable transcripts How to focus the sequencing on the “interesting” part of the transcriptome: mRNA and ncRNA ? August 11, 2011
4. What RNAseq protocols are there? RNA seq total RNA tRNA/rRNA removed + PolyA-tail filtered Good for studying protein coding genes, e.g. gene expression, isoforms, expression of variant alleles RNA editing events RNA-DNA differences in the human transcriptome provide a yet-unexplored aspect of genome variation. Small RNAseq: Total RNA size selection for small RNA molecules Good for small ncRNA e.g. miRNAs, snoRNA Duplex-specific thermostable nuclease (DSN) guided RNA seq normalization Total RNA high abundant transcripts are digested Good for studying all transcripts August 11, 2011 Today Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952. Christodoulou DC, Gorham JM, Herman DS, Seidman JG. Construction of normalized RNA-seq libraries for next-generation sequencing using the crab duplex-specific nuclease. CurrProtoc Mol Biol. 2011 PMID: 21472699
5. RNA-seq workflow Select PolyA-tail + remove tRNA/rRNA Fragment RNA Make cDNA(caution you may loose strand info) Sequence Map reads Identify transcripts Quantify transcripts Identify differences between conditions August 11, 2011 Today Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 PMID: 18516045.
6. Production Informatics and Bioinformatics August 11, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project
7. Challenges for RNAseq read mapping Loosing reads because they do not match the ref. genome Reads spanning exon junctions RNA editing events Approaches Align to ref. transcriptom library Exon-first e.g. Tophat Seed-extend methods e.g. GSNAP August 11, 2011 Sequencing reads DNA gRNA mRNA editing event Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353. Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.
8. Exon-first approach Align reads to ref. genome Chop up unaligned reads and try to identify matching regions Find splice junctions around the matches August 11, 2011 Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
9. Seed-extend approach Break reads in smaller k-mers and find matches Iteratively extend k-mers to identify exact spliced alignment August 11, 2011 Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
10. Which method ? Exon-first: less computationally intensive The additional exon-junctions found by seed-extend have not (yet) been demonstrated to be real. August 11, 2011 Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
11. Challenges for transcript detection Identifying isoforms is difficult Transcript abundance is volatile Most reads are not helpful (reads from exons) or even misleading (incompletely spliced precursor RNA) Genes can have many isoforms Approaches Ignore isoforms Genome-guided reconstruction, e.g. Cufflinks Genome-independent reconstruction, e.g. Trinity August 11, 2011 QBI data Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
12. Genome-guided reconstruction Use reads spanning slice junction to assemble the transcript path Work out minimal possible set paths so that all reads are visited (graph theory) If more than one set use read count to pick the most probable August 11, 2011 Reads aligned to the genome Isoforms Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
13. Genome-independent reconstruction Break reads into k-mers find their mutual overlap to build a de Bruijn graph Find probable paths through the graph by using read counts Map consensus assembly to genome August 11, 2011 Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.
14. Which method? De novo methods are very computationally intensive However, they are able to find alternative isoforms and promoters and structural variation deletions (yellow) chimeras (green) August 11, 2011 Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.
15. What are real transcripts? Even the most sophisticated computational method can’t tell you what is a real transcript. August 11, 2011 Roberts et al. QBI data Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 PMID: 21697122.
16. Solution: biological replicates Significant findings (here: new isoforms) in small sample sets can be due to Technical errors Biological variability Population outliers Sequencing experiments are subject to the same issues (even though they are more expensive than arrays) Replicates are necessary to build confidence in your results! August 11, 2011 Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not eliminate biological variability. Nat Biotechnol. 2011 PMID: 21747377
17. Three things to remember Methods for analyzing RNAseq data are not as mature as expression array analysis tools yet. Especially identifying transcript isoforms is difficult. Replicates are crucial to account for the biological variability August 11, 2011
18. Next Week: August 11, 2011 Abstract: This session will follow up from transcript quantification of RNAseq data and discusses statistical means of identifying differentially regulated transcripts, and isoforms and contrasts these against microarray analysis approaches.
That is, double-stranded cDNA is denatured, then allowed to partially re-anneal, and the most abundant species, which re-anneal most rapidly, are digested with crab duplex-specific nuclease