RNA Sequencing for Full Length Transcript Discovery
1. RNA-Sequencing for Full-length Transcript Discovery
Lab Meeting
2/10/14
Anne Deslattes Mays
Mentor: Anton Wellstein, MD, PhD
Special Recognition: Marcel Schmidt, PhD
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
1
2. 2
Discovery of homing gene fragments
using bone marrow-derived monocytes
Questions:
1. which proteins drive organ homing of hematopoietic
cells ?
2. are there distinct homing proteins for diseased organs
(cancer, wound healing, ischemia, infection) ?
Approaches:
1. use human bone marrow (BM) cDNA library
that displays large proteins from bone
marrow & precursor cells on the phage
surface
2. in vivo selection of homing proteins from
target organs or vessels in animal models
(normal or diseased)
3. this approach selects for gene fragments
coding for homing proteins
full length transcripts
from source material
3. Experimental Objective
We aim to identify the full-length transcripts using 2nd and 3rd generation
sequencing methods for genes whose fragments were discovered through the
phage display experiments nearly a decade ago.
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
3
9. Four Sequencing Experiments
Second Generation Sequencing
1. Total.bm.random – total bone marrow sequenced mate paired
non-strand specific randomly primed ~ 180 million reads
4/18/2014 Wellstein/Riegel Laboratory 9
11. Experiment 1 Results
Genome aligned (tophat (bowtie2)/cufflinks) and De novo assemblies (trinity
(gsnap & blat)) using the read information
Wellstein Genome – created a sub genome with excised regions around the
phage with the hopes of discovering the underlying isoform and gene
structure
Blat/Blasted the short reads against this region and still
• Results were ambiguous information regarding isoforms and gene
structure hits which included phage
• Structure of transcript was not clear
• Strand information regarding reads aligned not clear
Next Steps
• Design another experiment, same cell population, this time targeted
(including original phage primers used often in experiments in both
lineage negative and total bone marrow experiments) and strand specific
• Create a custom long transcript library primed to include full length phage
transcripts
4/18/2014 Wellstein/Riegel Laboratory 11
13. Random RNA-Sequencing vs Strand-specific Targeted RNA-
sequencing
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
13
15. Initial G12 Gene Model from the Total Bone Marrow
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
15
16. Design targeted primers and create custom long reaction cDNA
library
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
16
17. Results and pre-sequencing fragmentation
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
17
18. Experiment 2 Results
Genome aligned (tophat (bowtie2)/cufflinks) and De novo assemblies (trinity
(gsnap & blat)) using the read information
Wellstein Genome – created a sub genome with excised regions around the phage
with the hopes of discovering the underlying isoform and gene structure
Blat/Blasted the short reads against this region and still
• Results were ambiguous information regarding isoforms and gene structure
hits which included phage
• Strand information known but yet
• Structure of transcript was not clear
• Was it the depth? Was it the cell population? Was it mistargeted regions?
Next Steps
• Design another experiment, now looking at only the lineage negative cell
population where it is known the phage are enriched
• Return to randomly primed reads
• Sequence at a depth similar to the original total bone marrow experiment
(100 million reads)
4/18/2014 Wellstein/Riegel Laboratory 18
19. Four Sequencing Experiments
Second Generation Sequencing
1. Total.bm.random – total bone marrow non-strand specific
randomly primed ~ 180 million reads
2. Total.bm.ss.targeted – total bone marrow strand specific targeted
primed to a depth ~ 20 million reads
3. Lin.neg.ss.random – lineage-negative strand specific randomly
primed ~ 111 million reads
4/18/2014 Wellstein/Riegel Laboratory 19
21. Negative Selection:
Human Progenitor Cell Enrichment Kit with Platelet Depletion
to Isolate the Lineage Negative sub population from total bone marrow
24. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
24
Negative Control: CD14 (should be highest in Total Bone Marrow)
Peak read count: 109
Peak read count: 6318
Peak read count: 48
Peak read count: 21
25. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
25
Negative Control: CD34 (should be highest in Lineage Negative)
Peak read count: 169
Peak read count: 43
Peak read count: 386
Peak read count: 10
26. What’s Wrong With Illumina Reads
Uniformity of Read Coverage*
• An aligned read can be represented as an integer point in R2 as follows:
The ‘t-coordinate’ corresponding to the read is its left-end point while the
‘l-coordinate’ is the length of the fragment. In Evans et al. (2010), it is
shown that for any choice of fragment length distribution, the col- lection
of points f(t, l)} from a sequencing experiment forms a two-dimensional
Poisson process. This principle guides our further analysis of these points
f(t, l)}, as we test for uniformity in both the t and l coordinates. The output
of ReadSpy is a list of test statistics and P-values for each transcript. A
statistically significant (low) P-value means we reject the fact that the
dataset is uniform on that transcript. Thus, a higher P-value corresponds
to a set of reads sampled uniformly, which is desired. In the next two
sections, we describe the statistical test applied a each transcript. The test
is formulated in terms of the genomic segment [a, b].
*Hower, Valerie, Richard Starfield, Adam Roberts, and Lior Pachter. "Quantifying uniformity of mapped reads." Bioinformatics
28, no. 20 (2012): 2680-2682.
4/18/2014 Wellstein/Riegel Laboratory 26
28. Experiment 3 Results
Genome aligned (tophat (bowtie2)/cufflinks) and De novo assemblies (trinity (gsnap & blat)) using the
read information
Wellstein Genome – created a sub genome with excised regions around the phage with the hopes of
discovering the underlying isoform and gene structure
Blat/Blasted the short reads against this region and still
• Results were ambiguous information regarding isoforms and gene structure hits which included
phage
• Strand information known but yet
• Enrichment in population is evident
• Unambiguous Structure of phage transcripts still not clear
• Finding known genes can be done, even de novo assembly of novel transcripts is done on a regular
basis
• But with these phage, a fragment is known -- how do we find the full length structure of this
phage?
• What if we had the phage transcripts in the targeted full length library, but it was lost in the
fragmentation? Is there a way to do sequencing without fragmentation?
Next Steps
• Use new 3rd generation technology to do full length transcript sequencing without fragmentation
4/18/2014 Wellstein/Riegel Laboratory 28
29. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
29
Source: Iso-seq webinar by Liz Tseng, Pacific Biosystems
https://github.com/PacificBiosciences/cDNA_primer/wiki/Understanding-PacBio-
transcriptome-data
30. Four Sequencing Experiments
Second Generation Sequencing
1. Total.bm.random – total bone marrow sequenced non-strand
specific randomly primed ~ 180 million reads
2. Total.bm.ss.targeted – total bone marrow sequenced strand
specific targeted primed to a depth ~ 20 million reads
3. Lin.neg.ss.random – lin- sequenced strand specific randomly
primed ~ 111 million reads
Third Generation Sequencing
4. Lin.neg Pac Bio Long reads –
6 million CCS Filtered SubReads ~ 277,000 readsOfInserts
4/18/2014 Wellstein/Riegel Laboratory 30
36. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
36
Negative Control: CD14 (should be highest in Total Bone Marrow)
Peak read count: 109
Peak read count: 6318
Peak read count: 48
Peak read count: 21
37. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
37
Negative Control: CD34 (should be highest in Lineage Negative)
Peak read count: 169
Peak read count: 43
Peak read count: 386
Peak read count: 10
39. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
39
Peak read count: 10
Peak read count: 16
Peak read count: 10
Peak read count: 10
Phage: B9 10x larger region (~9kb) centered on phage evidence
40. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
40
2/6/2014 Reports for Job readsofinsert
http://ec2-54-197-149-12.compute-1.amazonaws.com:8080/smrtportal/View-Data/Report/16437?name=readsofinsert&media=all&reportKey=Reads-Of-Insert-R… 1/1
Read Length Of Insert Read Quality Of Insert
Number Of Passes
Reports for Job readsofinsert
Reads Of Insert
Movie
Reads Of
Insert
Read Bases
Of Insert
Mean Read Length
Of Insert
Read Accuracy
Of Insert
Mean Number Of
Passes
m131214_160008_42177R_c100597152550000001823102305221422_s1_p0 47,762 61,257,390 1,282 97.96% 11.01
m131212_234151_42177R_c100597412550000001823102305221473_s1_p0 23,360 33,092,110 1,416 98.39% 11.65
m131214_092100_42177R_c100597152550000001823102305221420_s1_p0 36,623 59,671,472 1,629 98.41% 10.78
m131214_124034_42177R_c100597152550000001823102305221421_s1_p0 49,710 63,809,739 1,283 98.04% 11.26
m131213_232025_42177R_c100597412550000001823102305221475_s1_p0 30,720 37,357,905 1,216 97.49% 10.75
m131213_030106_42177R_c100597412550000001823102305221474_s1_p0 24,284 34,943,462 1,438 98.49% 11.85
m131214_060132_42177R_c100597412550000001823102305221477_s1_p0 32,492 39,813,943 1,225 97.49% 10.54
m131214_023937_42177R_c100597412550000001823102305221476_s1_p0 32,210 39,536,384 1,227 97.57% 10.74
Generated by SMRT® Portal. Thu Feb 06 13:30:44 UTC 2014
For Research Use Only. Not for use in diagnostic procedures.
Source: self-install smrt portal – reads of insert
42. Summary of reads.
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
42
------ 5' primer seen summary ----
Per subread: 258835/277161 (93.4%)
Per ZMW: 258835/277161 (93.4%)
Per ZMW first-pass: 258835/277161 (93.4%)
------ 3' primer seen summary ----
Per subread: 1361/277161 (0.5%)
Per ZMW: 1361/277161 (0.5%)
Per ZMW first-pass: 1361/277161 (0.5%)
------ 5'&3' primer seen summary ----
Per subread: 1341/277161 (0.5%)
Per ZMW: 1341/277161 (0.5%)
Per ZMW first-pass: 1341/277161 (0.5%)
------ 5'&3'&polyA primer seen summary ----
Per subread: 18/277161 (0.0%)
Per ZMW: 18/277161 (0.0%)
Per ZMW first-pass: 18/277161 (0.0%)
------ Primer Match breakdown ----
F0/R0: 258855 (100.0%) Source: output of summarize_results.py (Liz Tseng)
43. But this is not good – it turns out that the primers were incorrectly
chosen and the best way to find the primers used is to do as follows:
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
43
>cat reads_of_insert.fasta | grep -A1 "AAAAAAAAAAAAAAAAA" | more
GGCTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
AACATTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTAACTCTGCGTTGATACCACTGCTT
--
TGTTTTATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
TTACAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
GAGCCCTTACCGAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
GTGGTGATTGTTTACTAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
GACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
TTTCCCGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
CTTACTTACGTAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
--
GCCCCATCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT
>cat reads_of_insert.fasta | grep -A1 "TTTTTTTTTTTT" | more
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTGGCTTGAT
--
AAGCAGTTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTGATTTCCAT
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACTTGGGATCTTT
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTT
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACCCATCAGCG
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTGGTATTTGTTTGTTTCTG
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTTT
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGACATAAACAC
--
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACTAAGCATATT
T
Now my primers are:
>F0
AAGCAGTGGTATCAACGCAGAGTAC
>R0
GTAACTCTGCGTTGATACCACTGCTT
44. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
44
------ 5' primer seen summary ----
Per subread: 256672/277161 (92.6%)
Per ZMW: 256672/277161 (92.6%)
Per ZMW first-pass: 256672/277161 (92.6%)
------ 3' primer seen summary ----
Per subread: 208877/277161 (75.4%)
Per ZMW: 208877/277161 (75.4%)
Per ZMW first-pass: 208877/277161 (75.4%)
------ 5'&3' primer seen summary ----
Per subread: 207111/277161 (74.7%)
Per ZMW: 207111/277161 (74.7%)
Per ZMW first-pass: 207111/277161 (74.7%)
------ 5'&3'&polyA primer seen summary ----
Per subread: 100863/277161 (36.4%)
Per ZMW: 100863/277161 (36.4%)
Per ZMW first-pass: 100863/277161 (36.4%)
------ Primer Match breakdown ----
F0/R0: 258438 (100.0%)
Source: output of summarize_results.py (Liz Tseng)
45. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
45
Negative Control: CD14 (should be highest in Total Bone Marrow)
Peak read count: 109
Peak read count: 6318
Peak read count: 48
Peak read count: 21
46. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
46
Negative Control: CD34 (should be highest in Lineage Negative)
Peak read count: 169
Peak read count: 43
Peak read count: 386
Peak read count: 10
48. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
48
Peak read count: 10
Peak read count: 16
Peak read count: 10
Peak read count: 10
Phage: B9 10x larger region (~9kb) centered on phage evidence
49. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
49
Scale
chr11:
MOB2
CTSD
Indiv. Seq. Matches
Sequences
SNPs
Genes
Human mRNAs
Spliced ESTs
DNase Clusters
Txn Factor ChIP
Rhesus
Mouse
Dog
Elephant
Chicken
X_tropicalis
Zebrafish
Lamprey
Common SNPs(138)
RepeatMasker
200 bases hg19
1,774,050 1,774,100 1,774,150 1,774,200 1,774,250 1,774,300 1,774,350 1,774,400 1,774,450
Your Sequence from Blat Search
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
RefSeq Genes
Retroposed Genes V5, Including Pseudogenes
Publications: Sequences in scientific articles
Human mRNAs from GenBank
Human ESTs That Have Been Spliced
H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE
Transcription Factor ChIP-seq from ENCODE
100 vertebrates Basewise Conservation by PhyloP
Multiz Alignments of 100 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP 138) Found in >= 1% of Samples
Repeating Elements by RepeatMasker
01823102305221476_s1_p0/142269/25_1056_CCS
001823102305221475_s1_p0/23219/25_2124_CCS
14-10
01823102305221420_s1_p0/101093/25_2057_CCS
001823102305221420_s1_p0/43933/25_2151_CCS
01823102305221474_s1_p0/126784/25_2052_CCS
001823102305221474_s1_p0/38774/25_2111_CCS
001823102305221473_s1_p0/61096/26_2148_CCS
001823102305221420_s1_p0/90213/25_2018_CCS
001823102305221420_s1_p0/70860/25_1785_CCS
001823102305221420_s1_p0/46857/25_2050_CCS
01823102305221474_s1_p0/129700/25_2069_CCS
001823102305221473_s1_p0/56996/25_2088_CCS
01823102305221421_s1_p0/102623/25_2092_CCS
0001823102305221477_s1_p0/3072/2126_65_CCS
001823102305221476_s1_p0/26060/25_2036_CCS
0001823102305221476_s1_p0/1057/25_2034_CCS
0001823102305221474_s1_p0/5669/25_2058_CCS
01823102305221476_s1_p0/118762/25_1890_CCS
001823102305221422_s1_p0/82049/25_2039_CCS
MOB2
CTSD
Layered H3K27Ac
100 _
0 _
100 Vert. Cons
4.88 _
-4.5 _
0 -
Phage 14-10: 100% identity and alignment to 19 full length read of inserts
50. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
50
Scale
chr11:
MOB2
IFITM10
CTSD
Indiv. Seq. Matches
Sequences
SNPs
Genes
Human mRNAs
Spliced ESTs
DNase Clusters
Txn Factor ChIP
Rhesus
Mouse
Dog
Elephant
Chicken
X_tropicalis
Zebrafish
Lamprey
Common SNPs(138)
RepeatMasker
5 kb hg19
1,775,000 1,780,000 1,785,000
Your Sequence from Blat Search
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
RefSeq Genes
Retroposed Genes V5, Including Pseudogenes
Publications: Sequences in scientific articles
Human mRNAs from GenBank
Human ESTs That Have Been Spliced
H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE
Transcription Factor ChIP-seq from ENCODE
100 vertebrates Basewise Conservation by PhyloP
Multiz Alignments of 100 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP 138) Found in >= 1% of Samples
Repeating Elements by RepeatMasker
01823102305221476_s1_p0/142269/25_1056_CCS
001823102305221475_s1_p0/23219/25_2124_CCS
14-10
01823102305221420_s1_p0/101093/25_2057_CCS
001823102305221420_s1_p0/43933/25_2151_CCS
01823102305221474_s1_p0/126784/25_2052_CCS
001823102305221474_s1_p0/38774/25_2111_CCS
001823102305221473_s1_p0/61096/26_2148_CCS
001823102305221420_s1_p0/90213/25_2018_CCS
001823102305221420_s1_p0/70860/25_1785_CCS
001823102305221420_s1_p0/46857/25_2050_CCS
01823102305221474_s1_p0/129700/25_2069_CCS
001823102305221473_s1_p0/56996/25_2088_CCS
01823102305221421_s1_p0/102623/25_2092_CCS
0001823102305221477_s1_p0/3072/2126_65_CCS
001823102305221476_s1_p0/26060/25_2036_CCS
0001823102305221476_s1_p0/1057/25_2034_CCS
0001823102305221474_s1_p0/5669/25_2058_CCS
01823102305221476_s1_p0/118762/25_1890_CCS
001823102305221422_s1_p0/82049/25_2039_CCS
MOB2
IFITM10 CTSD
Layered H3K27Ac
100 _
0 _
100 Vert. Cons
4.88 _
-4.5 _
0 -
Phage 14-10: 100% aligned to CTSD, 2 possibly 3 splice variants in lineage negative cell
population – structure fully resolved
51. Conclusions:
• Full Length Transcript discovery is achieved with Pacific Biosystems RS
sequencer, using size selection in library preparation prior to sequencing
and Reads Of Insert algorithm
• Even before the release of the ReadsOfInsert approach, the subreads that
are available as a result of the sequencing still had the ability to tell you
the structure of the complete transcript.
• With an error rate of 15%, seemingly daunting, the random nature of the
error and the length of the read provided the complete structure in a way
that no short read second generation sequence could.
• When one is searching for the complete structure, perfection in the parts
is of no consequence
• NO ASSEMBLY is REQUIRED
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
51
52. Next Steps:
1. Compete the reads of insert approach with 75% accuracy and minimum 1
pass
2. Identify additional full length structure (if possible with the sample reads)
3. Write up the results
4. (next paper) If no additional phage found, sequence an enriched
population with confirmed phage evidence at full length with more
another pacific bio sequencing
5. Use illumina reads to correct for errors and recover more reads
6. Use greater pac bio sequencing depth
4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
52
53. 4/18/2014 Wellstein/Riegel Laboratory 53
Acknowledgements
Dr. Anton Wellstein
Dr. Anna Riegel
Dr. Elena Tassi
Dr. Marcel Schmidt
The entire lab: Elena, Virginie, Ghada, Ivana, Eveline, Khalid, Khaled, Eric, Nitya, the entire
Wellstein/Riegel laboratory
My Committee
Dr. Yuri Gusev
Dr. Anatoly Dritschilo
Dr. Michael Johnson
Dr. Christopher Loffredo
Dr. Habtom Ressom
Dr. Terry Ryan (external committee member)
Robert Sebra, Mt. Sinai PacBio Sequencing
Liz, Tseng, Pacific Biosystems
Eric Schadt, Mt. Sinai PacBio Sequencing
Brian Haas, Author Trinity Suite
`
54. 4/18/2014
Wellstein/Riegel Laboratory, Lombardi
Cancer Center, Washington DC 20007
54
CD11: New Evidence of an Exon From all Samples, confirmed by PacBio
Peak read count: 16
Peak read count: 1925
Peak read count: 639
Peak read count: 121
Fragments are important and interestingSequencing is cheap and should reveal our fragments – as shown they express at high levels relative to actin – as shown for an annotation experiment – recommendations are paired end stranded sequencing –
Figure 5 – Random RNASeqvs Strand Specific Targeted RNA-SequencingPanel A shows the typical RNA seq experiment. It begins with cDNA library preparation constructed from the tissue of choice but randomly primed and includes second strand cDNA synthesis. PanelB shows the steps in a strand-specific targetd RNA-sequencing experiment. Primers are targetd and the second strand cDNA not synthesized.
Figure 5 – Random RNASeqvs Strand Specific Targeted RNA-SequencingPanel A shows the typical RNA seq experiment. It begins with cDNA library preparation constructed from the tissue of choice but randomly primed and includes second strand cDNA synthesis. PanelB shows the steps in a strand-specific targetd RNA-sequencing experiment. Primers are targetd and the second strand cDNA not synthesized.
Figure 2 -Step 1 – Assemble known information. Both the novel transcript fragments discovered through phage display experiments and additional transcript data gathered from a random RNASeq experiment were mapped to the genome. Step 2 – Create gene model. Step 3 – Primer Design. Primers were designed to be unique to the genome and specific and antisense to the gene. Step 4 – Perform Targeted RNASeq – this step involves fragmentation (see figure 5). Step 5 – Reassemble the fragmented Transcript data into full length transcripts. Step 6 – Confirm the full-length transcript.
Figure 3 Map phagecDNA fragment information together with Random RNAseq readsIn steps 1 and 2 of our workflow we want to map all known information to the genome, create a putative gene model. Mapping of short reads is a crucial and not always disambiguous step. Read mapping with blat versus read mapping with bowtie2 is not identical. The gene model in step 3 was created using blat reads. Using abundancy and known transcript information to select novel and specific transcript data to create our initial putative G12 gene model
Figure 4 – Primer design and custom cDNA library creationPrimers were designed specific to the gene model created. Panel A shows G12.1, G12.2, G12.3, G12.4, G12.6, G12.7, G12.transcript.1, G12.transcript.2, G12.transcript.3. These are the primers that were designed to the G12 gene model. 119 Primers were designed to 23 genes discovered in the initial surgical experiments.An average of 6 primers were designed to each of the genes including the 3’ most putative exon. To create the custom targeted cDNA library, a pooling strategy was employed separating chromosomes and primers to each of the genes in such a way that the reverse transcriptase reaction could occur as specifically as possible in 24 separate reactions (Panel B). The cDNA library was synthesized in a long reaction (> 12 hours) on sample freshly harvested from bone marrow with a RIN quality of greater than 9.
Figure 5 6 – Results and pre-sequencing fragmentationPanel A shows the results from our long reverse transcriptase reaction (12-16 hours) in our cDNA library creation. On aaverage, the transcripts are 3671 base pairs in length. Panel B shows the results from the pre-sequencing step. The purpose of this latter spte is to fragment the full length transcripts to an average length of 300 base pairs due to sequencing length limitations. The electropherogram reveals an average length of 333 base pairs for these fragments.