Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
High Throughput Sequencing Technologies: What We Can Know
1. High Throughput Sequencing
Technologies:
What We Can Know
Brian Krueger, PhD
Duke University
Center for Human Genome Variation
2. 2nd Generation Sequencing Overview
Fragmented
DNA
Align reads to a
reference genome
Ligate Adaptors
Add
Bases
Bind Library and create
clusters
Repeat Hundreds of
times on billions of
Wash Wash
clusters
Cleave Image
Sequencing Cycle
Genomic
DNA
3. 2nd Generation Sequencing Advances
• V3 System Chemistry
– 300GB per Flowcell
– 11 Days to Data
– Genome: $4700, Exome: $790
• V4 System Chemistry
– 600GB per Flowcell
– 6 Days to Data
– Genome: $3000, Exome: $640
• X System Chemistry
– 1GB per Patterned Flowcell
– 3 Days to Data
– Genome: $1500, Exome: $500
4. Techniques for Acquiring Data
• Whole Genome Sequencing
– Obtain whole blood or tissue sample
– Create sequencing libraries of all DNA fragments
• Whole Exome Sequencing
– Utilizes a selection protocol to fish out ONLY coding
DNA sequences
– Create sequencing libraries from enriched DNA
– Reduces cost and analysis time
• Custom Capture
– Same protocol as Exome sequencing
– Only target desired DNA sequences
• Amplicon Sequencing
– Use PCR to amplify target DNA
– Sequence amplified DNA (Amplicon)
• RNA-Seq
– Extract RNA, capture mRNA, convert to cDNA
– Used for differential gene expression analyses, RNA
isoform detection
5. CCoommmmoonn DDNNAA MMuuttaattiioonnss
Chromosome
Sequence
variants
Structural
variants
Referenc
Single nucleotide variant
Small insertion
Small deletion
Deletion
Duplication A B C C D
Inversion A B D C
Translocation
e A B C D
ATCGGGTCATGTCA
A B C D
ATCGGGTCATATCA
A B C D
ATCGGGTCATGACGTCA
A B C D
ATCGGGTCAT
A C D
A B E F
G
Credit: Elizabeth Ruzzo, PhD, CHGV
6. Disadvantages of Current Techniques
• Amplification errors
– All polymerases have an inherent error rate (10-6-10-7)
• GC bias
– PCR bias against GC rich sequences
– Exome capture bias against GC rich sequences
• Trouble detecting small insertions and deletions
– Capture baits may not hybridize well
– Capture cannot be used to reliably detect large CNVs
• Cannot be used for De novo assembly
– Read length too short to span long repeat regions
– Not good for detecting trinucleotide repeat
expansions
• Miss large structural variations
– Translocations and inversions likely will be missed
– Require significant read depth at break points for
these variations to be detected
• Trouble with RNA-seq isoform detection
– Like large structural variations, hard to accurately
detect all splice isoforms using short read technology
A B B B C D
A B B B B C D
A B B B B B B C D
X
A C D
X
A E F G
7. Solutions!
• Solutions for many of these problems exist
– As always, come at a cost
• Whole Genome Sequencing - $1500
– Reduce Exome Artifacts
• Better Indel Detection and higher
coverage in high GC regions
• Can be used to detect large copy
number variations
• PCR Free Whole Genome Sequencing
– Reduces amplification bias and polymerase
error artifacts
• WGS will miss large structural variations
(Inversions, Translocations, microsatellites)
– Combine with long read technologies
– Added cost of $1000-$10,000
– Higher cost = better detection
8. Long-ish Read Sequencing Technologies
• Mate-Pair Sequencing
– Insert size increased from 300bp to 3-8KB
– Sequence ends of mate-pairs to pair reads
over much longer distances
– Use short reads to fill gaps
– Adds $1000 to Genome cost
9. Long-ish Read Sequencing Technologies
• Illumina Synthetic Long Reads
– Fragment Genomic DNA to 10KB
– Dilute across a 384 well plate
– Fragment clonal 10KB fragments into
300bp fragments and barcode
– Sequence fragments and use barcodes to
re-create the long reads synthetically
– Use as a short read scaffold to perform De
Novo sequencing
– Has been used in HLA sequencing and De
Novo assembly of the Drosophila genome
including accurate mapping of 80% of the
transposable elements
– Adds $1800 to Genome cost
10kb fragmentation
Barcoding and clonal amp
Nextera prep
Sequencing
10. True Long Read Sequencing Technologies
• Defined as single molecule sequencing
• Less complex sample prep and much longer read length
(1-100kb) compared to 200-400bp for 2nd Gen
• Two categories
– Sequencing by synthesis
• Pioneered by Pacific Biosciences
• Sequencer uses super microscopes and polymerase bound
nanowells to WATCH DNA as it is sequenced in real time
• Nanowells filled with DNA bases
• Fluorescence of base only detected at the polymerase
– Direct sequencing by passing DNA through a nanopore
• Bases fed through a membrane bound nanopore
• Ionic difference between both sides of the membrane
• Detect how ion flow changes at the pore as each base passes
through
• Oxford Nanopore, Base4, Stratos Genomics, Genia
• Bleeding edge technology
– Many technical hurdles with very high error rates (10-40%)
– Current best use is to create scaffolds for De Novo assembly
– Very expensive technology
• Costs 3-10x as much as Illumina to do whole genome
sequencing
PacBio
Oxford Nanopore
11. Questions??
• Reading/Viewing Material:
• Sequencing Methods Ecosystem -
http://res.illumina.com/documents/products/research_reviews/sequencing-methods-
review.pdf
• Illumina TruSeq synthetic long-reads empower de novo assembly and resolve
complex, highly repetitive transposable elements -
http://biorxiv.org/content/early/2014/01/19/001834
• Characterization of the human ESC transcriptome by hybrid sequencing -
http://www.pnas.org/content/110/50/E4821.short
• Nanopore Sequencing Web Conference - http://www.youtube.com/watch?
v=UtXlr19xTh8
Notes de l'éditeur
And while the title of my talk is High Throughput Sequencing Technologies: What We Can Know, a better sub title would be, “How much would you like to spend?” Because in many cases in genomics, the limitation of our knowledge isn’t the technology but how much money we can reasonably allocate for obtaining that knowledge.
First off I wanted to start with a quick refresher, the dominant sequencing technology today is Illumina sequencing by synthesis. This is commonly referred to as second generation or “next” generation sequencing. In this system, genomic DNA is sheared to 300-400 basepair fragments. These fragments then undergo library prep to repair the sheared ends and add adaptors with known sequences that are then used in the sequencing process. Library DNA is bound to a flowcell, clustered using bridge amplification and the sequencing cycle is initiated. Through this process, a camera takes a picture of the flowcell during each cyclic addition of fluorescently tagged nucleotides. Because the clusters never change their position, we can determine the sequence of the bases by watching how the color of the cluster changes after each cycle. This is done hundreds of times on billions of clusters to create the final sequenced short reads. This data is then aligned to a reference genome and further downstream analyses are performed to pull out variants and structural defects
There have been two exciting advances in this space over the past year. Prior to January, the chemistry run on the current iteration of the HiSeq system was version 3 which produced 300 gigabases of data per flowcel in 11 days. Using this Chemistry a 40X genome cost $4700 and a 100x exome ran about $800. In April of this year Illumina released its new version 4 chemistry which allows for the generation of twice as much data in about half of the time. One added benefit to this new chemistry is that the reagent price dropped significantly which resulted in a $1700 reduction in the cost of a genome and a $150 reduction in the cost of an exome. Illumina also announced this year that they were releasing a higher capacity system called the HiSeqX which could generate twice as much data as a V4 system in half the time. This is achieved through the use of additional cameras for imaging and also an improved clustering scheme. Cost of a genome on the HiSeqX runs about $1500 once bioinformatics is accounted for, and when Illumina allows these systems to be used for exomes, they should run about $500. Unfortunately, Illumina only allows the X systems to be purchased as a 10 pack which limits the access of this technology; however, our group should have one of these systems in place early next year through a collaborative endeavor.
While I only listed genome and exome prices on the previous slide we do also perform a wide variety of assays in the sequencing lab. Of course whole genome sequencing is done by acquiring genomic DNA from whole blood or tissue and sequencing those libraries directly, while whole exome sequencing is performed using a selection protocol and we use DNA baits to only pull out coding DNA sequences. Over the past year we have also performed over 10,000 custom capture sequencing runs. This process is similar to whole exome sequencing except the capture size is smaller and only targets desired genomic locations. A lot of companies are also now calling these custom panels. We have performed small amplicon sequencing projects in the past, but in most cases they’re expensive and don’t offer much of an added benefit over a custom capture. In the past year we’ve started doing more RNA-seq. This is typically combined with whole genome sequencing to see if we can correlate differential gene expression or differential isoform expression with genomic changes that occur outside the coding regions that are surveyed by exome sequencing.
Of course the reason we perform these assays is to determine the genomic make-up of a patient or study subject with a focus on finding sequence variants. The techniques I mentioned on the previous slide have differing success in detecting these variations. For example, amplicon sequencing is best used for profiling SNVs or small indels in a small set of target amplicons while custom capture and exome sequencing are best suited for discovering SNVs and small indels across the coding DNA. Unfortunately, these two techniques will likely miss large structural variations such as deletions or duplications and this is due to a number of factors that I’ll discuss on the next slide. Whole genome sequencing does a better job of discovering indels along with deletions and duplications; however, because we use short reads that can’t span large regions we will miss things like inversions, translocations or large repeat expansions.
As I said, there are some major disadvantages of using the current sequencing technology. This isn’t meant to scare you but it’s certainly something to keep in mind when reviewing your data. Second generation sequencing relies on a number of amplification steps both after library preparation and while generating sequencing clusters. Because polymerases have an inherent error rate, an error could be introduced into the sequence every 10 to 100 million bases. The current techniques also suffer from a bias against GC rich sequences. This is both a PCR efficiency issue and in the case of exome sequencing a sequence capture issue. GC rich sequences do not amplify as efficiently and they also are harder to elute from capture baits. Exome sequencing also has a problem detecting small insertions and deletions depending on their size because capture baits may not hybridize well to sequences that have a high degree of variation causing these sequences to be under represented. Exome capture also can’t be used to reliably detect large CNVs due to the differential capture efficiency of individual probes – It’s hard to tell if the read depth variation is due to true variation or just capture efficiency. Because of these factors, GC rich sequences are usually underrepresented in datasets. Exome capture manufacturers have realized this is a problem and the most recent versions of these kits try to account for this problem with varying degrees of success. Because the technology we use relies on short reads we also can’t use this technology for de novo assembly of a genome, and it will miss trinucleotide repeat expansions. Additionally, the short reads make it hard to detect large structural variants such as inversions and translocations because without significant coverage at break points these are very hard to detect as variations. For this very same reason, short reads also aren’t ideal for RNA-seq when the goal is to detect isoforms because if we don’t have reads that span of all the splice sites in a transcript, it is very hard to identify individual transcript isoforms.
However, solutions to most of these problems exist but as with anything they come at a cost. One way to counter the indel and GC bias problem of exome sequencing is to perform whole genome sequencing. This has the added benefit of allowing us to detect many of the large structural variations. To further reduce the GC bias problem, there are now kits available for PCR free whole genome sequencing which reduce PCR biases and polymerase artifacts. Since the cost of whole genome sequencing is dropping rapidly, it may make sense in the near future to perform whole genome sequencing over whole exome sequencing. Finally, to detect large structural variations including repeat expansions, translocations and inversions we can use some of the newly developed long read technologies. We need to keep in mind though these techniques have a varying degree of success and generally the more expensive the technique the better the data quality. These can add anywhere from $1000 to $10,000 to the cost of a genome run but in some cases the added cost may be worth it.
At CHGV we are currently reviewing two of the short read based technologies that allow for some access in repetitive regions. We’re performing a side-by-side analysis of these technologies on a C9orf72 ALS sample. One of these techniques is mate-pair sequencing which is essentially the same technique we use to create sequencing libraries except the insert size, or space between sequenced regions, is much larger. The larger insert size allows us to span longer distances with the hope that variant sequence will be detected to better inform the alignment of both the short and mate-pair reads.
The second technique we’re testing is a new protocol released by Illumina called Synthetic long reads. This technique uses a dilution scheme to recreate 10 kilobase reads in-silico from barcoded subfragments. In this scheme the genomic DNA is sheared to 10KB, barcoded, fragmented again into short reads and sequenced. Because we know the size of the original fragment and the source of the sub fragments we can use this information to recreate the original long reads using short reads. This technique has been used to sequence the HLA region and also to perform De Novo sequencing of the Drosophila genome including accurate mapping of 80% of the drosophila transposable elements. We remain optimistic about the success of these techniques, and hope they perform well on our C9orf sample, but we’ll see what the final result is after our evaluation of both protocols.
However, to get a more accurate view of these repeat expansions and large structural variations we could use a true long read sequencing technology. These techniques have become more popular over the past few years as their base accuracy has improved and their cost has become more reasonable. These single molecule sequencing techniques offer some advantages over second generation sequencing in that their sample preparation is less complex and they provide much longer reads on the order of 1 kilobase to 100 kilobases. There are currently two long read technologies available. One of them is Pacific Biosystems sequencing by synthesis which uses a microscope to watch a polymerase as it sequences DNA and records the flash that appears in the active site. The second type of single molecule sequencing is nanopore based and in this system DNA is fed through a porous membrane and the DNA sequence is detecting by sensing changes in the flow of ions through the pore. Because DNA bases are different sizes they restrict the flow of ions to a different degree and this can be used to determine the sequence of the bases. There are a few proof of concept nanopore systems in development and I expect this technology to expand rapidly over the next 5 years. Oxford nanopore has finally allowed us to start using its technology and while the base accuracy is awful, the results are promising for a number of use cases – particular in bacterial community profiling. Quality should improve significantly either through Oxford nanopore or through one of the other companies that are developing similar ingenious technology. While the base quality of these techniques is currently on the order of 60-40% and the cost is 3-10 times as much as illumina sequencning, these long reads can be used as a scaffold for highly accurate illumina short reads and give us access to sequence information that is impossible to detect with Illumina chemistry alone. To illustrate this, a researcher recently performed RNA-seq analysis using both illumina and PacBio reads and showed that the illumina system missed 90% of the RNA isoforms.
Finally I’d like to end with a slide that offers some additional reading and viewing material. Garvin in Australia has released the first test run data from the HiSeqX and has made it available to the public. If you’re interested in playing with some new data, I recommend visiting their DNAnexus page and downloading the data. While I listed 5 techniques that we perform routinely in the sequencingg lab, there are a wide array of other techniques in this space we’d be willing to explore. Many of these are detailed in this Sequencing methods overview from illumina. If you’re interestted in synthetic long reads and how the technology works you can read the paper dettailing thte use of this system for de novo sequencing of the drosophila genome.