Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Comparison between RNASeq and Microarray for Gene Expression Analysis
1. Yaoyu E. Wang, Ph.D
Center for Cancer Computational Biology, DFCI
SPECSII webinar
June 05, 2013
2. - Transcriptome profiling represents a static gene expression
state of a biological sample across the genome
- Allows for direct genomic comparisons with multiple samples
to determine genes that exhibit differential expression in
different state (i.e. normal vs. tumor)
- Allows for hypothesis generation on molecular abnormalities
and mechanisms that may contribute to the tumor phenotype
- Provides information on molecular subtypes, the development
of prognostic and predictive molecular signatures
- Two main technologies:
a. Microarray
b. RNA-Sequencing (RNASeq) using next generation
sequencing
4. Blencowe B J et al. Genes Dev. 2009;23:1379-1386
Illumina HiSeq
5. .bcl files
CASAVA processing
•Demultiplexing
•Fastq file generation
•Sequencing filtering
Raw files containing base calls
and quality scores
Illumina defined
quality filters
Split into Project and Sample Folders
Jones_Lab
ChIP_A ChIP-B
Marcus_Lab
RNA-SeqA RNA-SeqB RNA-SeqC
Williams_Lab
Exome1 Exome2
Fastq Files Fastq Files Fastq Files
6. Haas & Zody. Nature Biotechnology 28, 421–423 (2010)
Using known
annotations
And compare to
known annotations
•Differential Expression
•Differential Isoform Abundance
•RNA editing
•SNP, indel detection
7. Technology RNASeq Microarray
High run-to run reproducibility Yes Yes
Dynamic Range Comparable to
actual transcript abundance
>8000-fold
Hundred
fold
Able to detect alternative splice site
and novel isoforms
Yes No
De novo analysis of samples without
reference genome
Yes No
Multiplexing Samples in one run Yes No
Required amount of total RNA >100 ng ~1 ug
Re-analyzable data Yes No
8. Technology RNASeq Microarray
Heterogeneity of read coverage
across an expressed region
Yes No
Well understood sources of
experimental bias
No Yes
Data portable on a flush drive (~4G) No Yes
Data is analyzable by any PC No Yes
Cheaper cost per sample No(?) Yes(?)
13. Mooney M, PloSOne (2013)
10 Lymphoma (3T-cell, 7 B-cell)
4 Normal lymph node
Total RNA
PE100 run
50-100 million
mapped reads
Compare 15,092 annotated genes on chip
17. RNA-Seq and tiling arrays
Tiling Array
Microarray
Maximum
Sensitivity
RNASeq 11-plex
RNASeq 6-plex
Agarwal, BMC Genomics (2010)
18.
19. Per Sample Microarray Illumina HiSeq
1 per Chip/Lane $670 $4,010.00
2 plex NA $2,097.50
4-plex NA $1,141.25
6-plex NA $822.50
8-plex NA $663.13
6-plex
11-plex
20. Per Sample Microarray Illumina HiSeq
1 per Chip/Lane $670 $4,010.00
2 plex NA $2,097.50
4-plex NA $1,141.25
6-plex NA $822.50
8-plex NA $663.13
21.
22. Data Per Sample
Time to
download 1
Sample
Time to download
100 samples
Cost to Store on the
Cloud per Month
RNASeq 30-65GB 1 Hr 6 days $270
Microarray 30MB 5 second 8 minutes $0.30
http://www.ncbi.nlm.nih.gov/genbank/statistics
23. -Application withUser Interface RNA-Seq analysis (i.e. Galaxy) can only
handle very few samples
-Knowledge of Linux server, scripting language, programming language is
absolutely REQUIRED
-Lack of detailed understanding in NGS technology and data leads to
diverse bioinformatics tools with different characteristics
LawWC ,Voom!, Bionconductor (2013)
24.
25. The answer isYes
- Transcriptome profiles generated by microarray and RNASeq
are in strongly concordance
- Microarray data generated in the last decades is durable
- RNASeq is it offers more a lot more biological information
than microarray that is re-analyzable
- NGS is getting cheaper
However, the devil is in the data
- NGS data is a lot more expensive to store and analyze
- Specialized computing infrastructure and personnel are
required to take advantage of the information from NGS data
Notes de l'éditeur
The basic concept behind the use of GeneChip arrays for gene expression is simple: labeled cDNA or cRNA targets derived from the mRNA of an experimental sample are hybridized to nucleic acid probes attached to the solid support. By monitoring the amount of dye label associated with each DNA location, it is possible to infer the abundance of each mRNA species represented. For transcriptome profiling, the input is usually about 1ug total RNA that are poly-A selected to ensure only mature mRNA is being assayed.
Poly(A)+ mRNA is purified, fragmented, and then converted to a cDNA library with 5′ and 3′ adapter sequences. Short sequence reads are generated from the cDNA library. Normally, reads are mapped to previously annotate known transcripts and a pile un-mapped reads are kept. Reads that map to novel expressed sequences, including alternative exons and corresponding splice junction sequences
Two RNA sample types MAQC brain and universal human Reference RNA were processed using 5 technical replicates on both microarray and RNA-Seq. Once teh data is generated, the microarray data was processed using MAQC. For RNA-Seq, the sample cDNA libraries were prepared with Illumina protocol and sequenced to a depth of ~30 million mapped reads.
This is the scatter plot of technical replicates of the samples analyzed by RNA-Seq and microarray. The false positive rates are comparable between the two methods, and both methods have extremely high correlation between replicates (R>0.99). The plots demonstrate that RNA-Seq identifies more genes and spans a wider dynamic range compared to the microarray.
Scatterplot of fold change per gene as measured by RNASeq and microarray. Genes identified as differentially expressed by both platform are plotted in red, genes identified by RNASeq in blue, microarray in yellow and neither ins green. While the correlation between the two platforms in identifying differentially expressed genes is really high, this figure clearly indicates that a discrepancy between the platforms in the ability to identify genes as differentially expressed. The gene subset segmentation reveal that RNA-Seq counts identified significantly more differentially expressed genes. However, microarray does detect gene expression differences. Further valudation from a subset of 1000 genes for which PCR data is available, RNASeq data shows higher concordance with PCR results than microarray.
A study by Mooney et al, use a paired RNA sequencing (RNA-Seq)/microarray analysis of a set of 4 normal canine lymph nodes and 10 canine lymphoma fine needle aspirates to identify technical biases and variation between the technologies. We use a paired RNA sequencing (RNA-Seq)/microarray analysis of a set of four normal canine lymph nodes and ten canine lymphoma fine needle aspirates to identify technical biases and variation between the technologies and compare the 15,092 annotated genes on chip.
Both RNA-Seq and microarray observations provide present detection calls for 15,092 genes in each of the 14 samples. Thepercent present detection calls provided by the two technologies agreed with high frequency (73%) and were statistically associated(Table 3; p,10215, odds ratio .40). Among genes probed by both methods, percent present detection frequencies of 69% and 44%were obtained by RNA-Seq and microarray, respectively. Among genes called present using microarray over 97% were detectedusing RNA-Seq.Variation among expression profiles obtained using RNASeq is similar to that obtained using microarray after removing contributions of the first surrogate variable [42]. Each letter denotes a sample from a dog having a normal (N), B-cell (B), or T-cell (T) diagnosis as in the legend, with subscript ‘m’ run on the microarray platform and subscript ‘r’ run onthe NGS platform. a) Principal component scores b) Hierarchical clustering
Here, we compare these two platforms using a matched sample of poly(A)-enriched RNA isolated from thesecond larval stage of C. elegansto Young adult (YA)for all genes Each point represents a gene from the composite model. RNA-Seq expression levels per gene were measured using RPKM, and tiling array levels were measured using the mean intensity of probes falling within composite exons. The Spearman's coefficient is 0.90, indicating that the platforms correlate well on identical samples. The disproportionate number of genes in the upper left likely represents cross-hybridization.
Differential expression of genes between the L2 and YA stages. (a) Correlation of log2(YA/L2) ratios between RNA-Seq and tiling arrays.. Black: not significantly differentially expressed between samples.Blue: significantly differentially expressed (q ≤ 0.01). The ratio of expression levels is well-correlated, but RNA-Seq has a larger dynamic range. (b) Venn diagram of genes called differentially expressed by each platform. There is significant overlap (8,976) between the two platforms, but more genes were called differentially expressed by RNA-Seq (14,201) than by tiling arrays (10,283), likely reflecting its greater dynamic range. A total of 4,326 genes were not called differentially expressed by either technology.
ROC curve analysis. Black: tiling array. Red: RNA-Seq with all 32 million reads. It is evident that the RNA-Seq substantially outperforms the tiling array with consistently higher sensitivity at lower FPR. Remaining curves are for RNA-Seq with only a subset of reads utilized. At an FPR = 0.05, just 4 million reads (blue) are required to attain the same sensitivity as two tiling array replicates.