SlideShare une entreprise Scribd logo
1  sur  273
Télécharger pour lire hors ligne
Genome assembly: then and now
Keith Bradnam
Image from Wellcome Trust
Author: Keith Bradnam, Genome Center, UC Davis
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


This was a talk given on 2014-06-19 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on using Galaxy.
Image from flickr.com/photos/dougitdesign/5613967601/
Contents
Sequencing 101!
!
Genome assembly: then!
!
Genome assembly: now
Assemblathon 1 & 2!
!
Advice & Angst!
!
The future
More info
✤ http://assemblathon.org!
!
✤ http://gigasciencejournal.com!
!
✤ http://twitter.com/assemblathon
Assemblathon 2 paper has been reviewed, just dealing with reviewer's comments.
Sequencing 101
A, C, G, T...
Image from nlm.nih.gov
Fred Sanger, who invented the sequencing technology that helped sequence most of the good quality genomes that are out there. He was also a winner of
two Nobel prizes.
Read
Most sequencing technologies start with a sequencing read. A read could be as short as 25 bp (Solexa sequencing from a few years ago), or >15,000 bp
(PacBio with latest chemistry). The record read length is currently held by PacBio and is over 50,000 bp.
Read pair
Most sequencing is done with pairs of connected reads, separated by a short interval whose approximate length is known. Not all reads will have this exact
‘insert size’. There can be a LOT of variation. Read pairs can also overlap with each other.
Read pair
Mate pair
Mate pairs, also known as jumping pairs, have much larger inserts (thousands or tens of thousands of bp), but it is hard to make good mate pair libraries.
Having very large inserts is very useful for the purposes of genome assembly. Again, there is a lot of variation in the actual size of inserts (as determined
by mapping mate pairs back to a known reference).
Sequence a whole lot of read pairs, and hopefully they will overlap with each other and allow you to start making contiguous sequences...
Contigs
...which are better known as contigs.
Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with
unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.
Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with
unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.
Scaffold
NNNNNNNNNNNNNNNNNNN
Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with
unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from
scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences.
!
Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from
scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences.
!
Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to
be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50
scaffold length, start with the length of the longest scaffold...
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to
be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50
scaffold length, start with the length of the longest scaffold...
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
70
The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to
be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50
scaffold length, start with the length of the longest scaffold...
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
95
If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running
total.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
95
If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running
total.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
115
Now we have exceeded 50% of the total assembly size.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
115
Now we have exceeded 50% of the total assembly size.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
The length of the contig or scaffold that takes you past 50% is what is reported as the N50 length. So here, we have an N50 length of 20 Mbp.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
5
5
N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our
assembly?
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
5
5
N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our
assembly?
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our
assembly?
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different
assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
190 Mbp
Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different
assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
190 Mbp
Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different
assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
N50 for two assemblies
Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come
from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
N50 for two assemblies
208 Mbp 190 Mbp
Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come
from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
N50 for two assemblies
208 Mbp 190 Mbp
N50 = 15 Mbp N50 = 25 Mbp
Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come
from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
NG50 for two assemblies
208 Mbp 190 Mbp
We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to
the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.
NG50 for two assemblies
We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to
the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.
NG50 for two assemblies
Expected genome size = 250 Mbp
We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to
the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.
Expected genome size = 250 Mbp
NG50 for two assemblies
The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total
size.
NG50 = 15 Mbp NG50 = 15 Mbp
Expected genome size = 250 Mbp
NG50 for two assemblies
The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total
size.
You should check that high N50 values!
are not simply due to lots of Ns in the scaffolds!
You should always look at your assembly before you do anything with it!
Assembly 'x'
In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of
other ambiguity characters too (e.g. R for puRine).
Assembly 'x'
Size: 859 Mbp!
!
Number of scaffolds: 28!
!
N50 = 70.3 Mbp
In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of
other ambiguity characters too (e.g. R for puRine).
Assembly 'x'
Size: 859 Mbp!
!
Number of scaffolds: 28!
!
N50 = 70.3 Mbp
Ns = 90.6% !!!
In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of
other ambiguity characters too (e.g. R for puRine).
Assembly 'x'
Size: 859 Mbp!
!
Number of scaffolds: 28!
!
N50 = 70.3 Mbp
Ns = 90.6% !!!
In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of
other ambiguity characters too (e.g. R for puRine).
Basic assembly metrics
Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors
Base errors from alignment to reference sequence !
and/or input read data
Number of genes
From comparison to reference transcriptome !
and/or set of known genes
Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors
Base errors from alignment to reference sequence !
and/or input read data
Number of genes
From comparison to reference transcriptome !
and/or set of known genes
And many, many more...
Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
Genome assembly
Back in the day...
How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?
Genome assembly
Back in the day...
1998
How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
Genetic maps ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
Genetic maps ✓
Physical maps ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by
targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
So what was the result of spending millions of dollars !
to assemble genomes of well-characterized species,!
with accurate long reads, and detailed maps???
So hopefully this gave us a useful set of finished genomes, right?
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
Arabidopsis thaliana
Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they
had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of
the size went up. Now it has come back down again. But the genome remains unfinished.
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
Arabidopsis thaliana
Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they
had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of
the size went up. Now it has come back down again. But the genome remains unfinished.
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
✤ Ns = 0.2% of genome
Arabidopsis thaliana
Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they
had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of
the size went up. Now it has come back down again. But the genome remains unfinished.
Two views of the same gene
The top sequence is taken from a web based view of a gene in the the Arabidopsis thaliana genome sequence (information taken from the TAIR database).
There is a G missing compared to the same gene sequence that is available in a file of gene sequences to download from their FTP site.
Two views of the same gene
Top: from genome sequence view on TAIR web site!
Bottom: from gene sequence file on TAIR FTP site
This is one of eight cases where sequencing has not confirmed the existence of a base, but if you don’t have it you lead to a frameshift and a truncated
protein. The G has been added by humans, and the difference between the two versions of the sequence is the sort of thing that gives bioinformaticians
nightmares. How are you meant to know what these differences mean and which is the correct one (unless you email a TAIR curator like I did)?
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier
heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
✤ Ns = 4% of genome
The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier
heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors
are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible
to find over 1,400 errors in one of the best characterized genome sequences that exists.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors
are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible
to find over 1,400 errors in one of the best characterized genome sequences that exists.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors
are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible
to find over 1,400 errors in one of the best characterized genome sequences that exists.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
}Nov 2012
The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors
are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible
to find over 1,400 errors in one of the best characterized genome sequences that exists.
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011.
These changes affected coding sequences, not just intergenic and intronic DNA.
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
✤ Last changes made in 2011
Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011.
These changes affected coding sequences, not just intergenic and intronic DNA.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
And all of this was done in an era when we had all of these supporting materials.
Genetic maps ✗
Physical maps ✗
Understanding of target genome ✗
Haploid / low heterozygosity genome ✗
Accurate & long reads ✗
Resources (time, money, people) ✗
Genome assembly: now
We don't have these now! Genome sequencing no longer requires an international consortium, rather it could be a project for a Grad student.
Assembling & finishing!
a genome is not easy!
It was never easy, even when we access to lots of resources to help us put together genomes. And it is not easy now. Don't be fooled into thinking that
because there are many published genome sequences, that these sequences represent the absolute ideal genome sequence.
!
And don’t be fooled that just because you can afford to sequence a genome, that you will have the resources to make a useful assembly from that
sequence data.
Assemblathons
A new idea is born
Image from flickr.com/photos/dullhunk/4422952630
The Assemblathon was born out of the Genome 10K project.
If you sequence 10,000 genomes...!
...you need to assemble 10,000 genomes
The Assemblathon was born out of the Genome 10K project.
How many assembly tools are out there?
There are many, many tools out there for assembling, or helping to assemble, a genome sequence (there are 114 on this page). People will not have the
time or patience (or skill) to try more than a handful of these. But…
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LGSGA
Curtain
Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
Quast
SCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAP
SR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC
There are many, many tools out there for assembling, or helping to assemble, a genome sequence (there are 114 on this page). People will not have the
time or patience (or skill) to try more than a handful of these. But…
How many assembly tools are out there?
…people want to know, which is the best?
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LGSGA
Curtain
Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
Quast
SCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAP
SR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC
…people want to know, which is the best?
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LGSGA
Curtain
Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
Quast
SCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAP
SR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC
Which is the best?
…people want to know, which is the best?
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different sequencing technologies
However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different sequencing technologies
✤ used same sequencing technologies but have different
sequence libraries
However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different sequencing technologies
✤ used same sequencing technologies but have different
sequence libraries
✤ Even using different options for the same assembler may produce
very different assemblies!
However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
The PRICE genome assembler has 52 command-line options!!!
This assembler has 52 command-line options! Not all of these will affect the resulting assembly, but many of them will.
The PRICE genome assembler has 52 command-line options!!!
how many of them are you going to learn?
This assembler has 52 command-line options! Not all of these will affect the resulting assembly, but many of them will.
A genome assembly competition
That's where the Assemblathon came in.
An attempt to standardize some aspects !
of the genome assembly process
Genome assembly contests
Others have been trying to do the same thing. E.g. GAGE, and dnGASP. If you can at least give difference assemblers the same input sequence data, you
can start to take account of one of the biggest variables in genome assembly.
✤ 2010–2011!
✤ Used synthetic data!
✤ Small genome (~100 Mbp)!
✤ We knew the answer!
Assemblathon 1
It is easier to judge a tool when you know what the final answer should look like. However, many people that work on developing assemblers would prefer
to work with real data...
Here we go again
...which is where Assemblathon 2 came in.
Type of data
Number of
genomes
Size of
genomes
Do we know
the answer?
Assemblathon 1 Synthetic 1 Small ✓
Assemblathon 2 became a much bigger contest compared to Assemblathon 1.
Type of data
Number of
genomes
Size of
genomes
Do we know
the answer?
Assemblathon 1 Synthetic 1 Small ✓
Assemblathon 2 Real 3 Large ✗
Assemblathon 2 became a much bigger contest compared to Assemblathon 1.
Melopsittacus undulatus
Boa constrictor constrictorMaylandia zebra
A budgie, a cichlid fish from Lake Mawali, and a reptile.
Bird
SnakeFish
Let's simplify the names for the rest of the talk.
Why these three species?
There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate
sequences.
Why these three species?
Because they were there
There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate
sequences.
Species
Bird
Fish
Snake
Estimated
genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
!
This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
Species
Bird
Fish
Snake
Estimated
genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!
(14 libraries)
192x!
(8 libraries)
125x!
(4 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
!
This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
Species
Bird
Fish
Snake
Estimated
genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!
(14 libraries)
192x!
(8 libraries)
125x!
(4 libraries)
Roche 454
16x!
(3 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
!
This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
Species
Bird
Fish
Snake
Estimated
genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!
(14 libraries)
192x!
(8 libraries)
125x!
(4 libraries)
Roche 454
16x!
(3 libraries)
PacBio
10x!
(2 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
!
This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
Who took part?
Lots of teams took part. Not just from the big sequencing/genome centers.
Who took part?
Lots of teams took part. Not just from the big sequencing/genome centers.
Who took part?
21 teams!
43 assemblies!
52,013,623,777 bp of sequence
Lots of teams took part. Not just from the big sequencing/genome centers.
Species
Bird
Fish
Snake
Competitive
entries
12
10
12
Entries
There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).
Species
Bird
Fish
Snake
Competitive
entries
12
10
12
Evaluation
entries
3
6
0
Entries
There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).
Goals
Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what
contributes to assembly quality.
Goals
✤ Assess 'quality' of assemblies
Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what
contributes to assembly quality.
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what
contributes to assembly quality.
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what
contributes to assembly quality.
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
✤ Produce ranking of assemblers across species?
Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what
contributes to assembly quality.
Who did what?
Person/group Jobs
Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies
David Schwarz et al. Produce & evaluate optical maps
Jay Shendure et al.
Produce Fosmid sequences !
(bird & snake only)
Martin Hunt & Thomas Otto Performed REAPR analysis
Dent Earl & Benedict Paten Help with meta-analysis of final rankings
Lots of different groups were involved in the organization and assessment of the Assemblathon 2 entries.
91 co-authors!
flickr.com/photos/jamescridland/613445810
Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.
Results!
Lots of results!
A screen grab of my master spreadsheet that contains all of the numerical results. Each row represents a submitted assembly, and each column represents
a different assembly metric.
There were a lot of metrics. Many of these were not important or highly informative (e.g. %N).
102 different metrics!
There were a lot of metrics. Many of these were not important or highly informative (e.g. %N).
10 key metrics
We focused on 10 of 102 metrics that we thought were a) useful and b) captured different aspects of an assembly's quality.
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
The 10 key metrics.
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
In the remainder of this talk, I’ll just focus on some of these metrics. See the Assemblathon 2 paper for more details.
1) Scaffold NG50 lengths
✤ Can calculate NG50 length for each assembly!
✤ But also calculate NG60, NG70 etc.!
✤ Plot all results as a graph
An N50 (or NG50) value on its own doesn't tell you that much. Ideally you should always be aware of the total assembly size and the distribution of lengths
when comparing assemblies. You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length of scaffold that captures 1% of the
estimated genome size (when summing scaffolds from longest to shortest).
1) Scaffold NG50 lengths
Scaffold length is on a log axis and team identifiers are shown in the legend. 
!
The black dashed line shows the NG50 value, but the point where each series starts on the left shows the lengths of the longest scaffolds. Also, if the
NG100 value is greater than zero, then that assembly is bigger than the known/estimated genome size.
2) Contig vs scaffold NG50
We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data
points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
2) Contig vs scaffold NG50
We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data
points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
2) Contig vs scaffold NG50
We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data
points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
3) Gene-sized scaffolds
It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have
scaffolds that have a good chance of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have
scaffolds that have a good chance of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have
scaffolds that have a good chance of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have
scaffolds that have a good chance of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
✤ Average vertebrate gene = ~25 Kbp
It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have
scaffolds that have a good chance of capturing a full-length gene.
3) Gene-sized scaffolds
The red data series orders the bird assemblies in order of their NG50 scaffold length. The blue line shows the percentage of the estimated genome size
that is present in scaffolds of 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffold length, may contain many scaffolds
that are still long enough to contain a single gene.
4) Core genes
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have
been identified in six different eukaryotes and are expected to be present in all eukaryotes.
!
Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but
split across scaffolds.
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have
been identified in six different eukaryotes and are expected to be present in all eukaryotes.
!
Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but
split across scaffolds.
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have
been identified in six different eukaryotes and are expected to be present in all eukaryotes.
!
Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but
split across scaffolds.
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana,
C. elegans, D. melanogaster, and H. sapiens
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have
been identified in six different eukaryotes and are expected to be present in all eukaryotes.
!
Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but
split across scaffolds.
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana,
C. elegans, D. melanogaster, and H. sapiens
✤ How many full-length CEGs are in each assembly?
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have
been identified in six different eukaryotes and are expected to be present in all eukaryotes.
!
Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but
split across scaffolds.
4) Core genes
Species
Bird
Fish
Snake
Core genes (out of 458)
Best individual
assembly
420
436
438
In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.
4) Core genes
Species
Bird
Fish
Snake
Core genes (out of 458)
Best individual
assembly
420
436
438
Across all
assemblies
442
455
454
In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.
4) Core genes
These results show the number of CEGMA genes that were present in any one assembly as a percentage of all possible CEGMA genes (i.e. those present
across all assemblies for each species).
ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED
MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED
SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
************************************************ *****
!
ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI
MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
******************************************************
!
ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------
MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
***************************************
!
4) Core genes
Example of one core gene predicted in bird assemblies. CEGMA gene predictions are available as supplementary material with the paper.
8 & 9) Optical maps
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
✤ Not all scaffolds suitable for analysis
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
8 & 9) Optical maps
Image from University of Wisconsin-Madison
An example of an optical map. After cutting, each DNA fragment is measured to estimate its length. Optical map results were divided into three categories
(levels 1–3).
8 & 9) Optical maps
White bars: total length of scaffolds that were suitable for optical map analysis. Dark blue: global alignments of scaffolds to maps (these are the best
quality). Light blue: global alignments with more permissive thresholds. Orange bars: local alignments. We used level 1 (dark blue) as one key metric and
levels 1+2+3 as a second key metric. The MLK assembly is good, *relatively* speaking (high percentage of suitable scaffolds are in level 1 category), but
we record scores on an absolute basis (MERAC highest for level 1, SOAP highest for levels 1+2+3).
8 & 9) Optical maps
Fish optical map results were much worse than in bird, with very few assemblies having scaffolds with 'level 1' global alignments to the optical map. SGA
had the most level 1 coverage, but a much lower amount of sequence that was alignable at any level (1, 2, or 3).
8 & 9) Optical maps
Snake optical map results were intermediate compared to bird and fish.
What does this all mean?
102 metrics!
per assembly
10 key !
metrics
1 final!
ranking
Using the 10 key metrics, we combined the results to produce a single score for each assembly by which to rank them.
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !
core genes
438
436
435
434
433
430
429
428
422
415
360
Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the
total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics.
The above results are from the CEGMA metric in bird.
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !
core genes
438
436
435
434
433
430
429
428
422
415
360
Rank
1
2
3
4
5
6
7
8
9
10
11
Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the
total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics.
The above results are from the CEGMA metric in bird.
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !
core genes
438
436
435
434
433
430
429
428
422
415
360
Rank
1
2
3
4
5
6
7
8
9
10
11
Z-score
+0.68
+0.59
+0.54
+0.49
+0.44
+0.30
+0.25
+0.21
–0.08
–0.41
–3.02
Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the
total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics.
The above results are from the CEGMA metric in bird.
This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect
what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest
ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect
what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest
ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect
what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest
ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect
what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest
ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect
what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest
ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
In fish, the BCM entry ranked 1st though the error bars suggest there is much variability. The lack of Fosmid data means that there is only 7 key metrics
rather than 10.
Snake seemed to the only species where it looked like one assembler outperformed all others (SGA, in this case). We will return to this issue. Note that
there were no evaluation entries for snake.
Another way of looking at all of this data is to plot the Z-scores for each metric as a heat map (red = higher Z-scores).
A parallel coordinates plot is another way of trying to show all of the information at once.
What does this all mean?
No really, what does this all mean?
Still a bit hard to make sense of the overall rankings. What are the main findings from our paper?
Some conclusions
✤ Very hard to find assemblers that performed well across
all 10 key metrics!
✤ Assemblers that perform well in one species, do not
always perform as well in another!
✤ Bird & snake assemblies appear better than fish!
✤ No real 'winner' for bird and fish
This type of news is perhaps disappointing to many.
SGA — best assembler for snake?
Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably
rank 1st. So is that the end of the story?
SGA — best assembler for snake?
Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably
rank 1st. So is that the end of the story?
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly
important to you, you may want to use an assembler that ranked higher in that metric.
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly
important to you, you may want to use an assembler that ranked higher in that metric.
Best assembler across species?
Not all teams entered assemblies for all three species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry
for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the
best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
Best assembler across species?
Assembler
Number of 1st places
(out of 27)
BCM 5
Meraculous 4
Symbiose 4
Ray 3
Excluding evaluation entries
Not all teams entered assemblies for all three species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry
for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the
best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
Best assembler across species?
Assembler
Number of 1st places
(out of 27)
BCM 5
Meraculous 4
Symbiose 4
Ray 3
Excluding evaluation entries
Not all teams entered assemblies for all three species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry
for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the
best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
Ray performance
Species Final ranking
Bird 7th
Fish 7th
Snake 9th
However, Ray ranks much lower when looking at its performance across all key metrics. So some assemblers do very well in specific measures, and not so
well in others and other assemblers do moderately well across lots of metrics (e.g. SGA).
We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to
their competitive entry?
We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to
their competitive entry?
Assembler
BCM -
evaluation
BCM -
competitive
Final
rank
1
2
NGS data
used in
assembly
Illumina +
454
Illumina +
454 + PacBio
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more
sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
BCM -
evaluation
BCM -
competitive
Final
rank
1
2
NGS data
used in
assembly
Illumina +
454
Illumina +
454 + PacBio
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more
sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
BCM -
evaluation
BCM -
competitive
Final
rank
1
2
NGS data
used in
assembly
Illumina +
454
Illumina +
454 + PacBio
Coverage!
Z-score
+2.0
–0.3
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more
sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
BCM -
evaluation
BCM -
competitive
Final
rank
1
2
NGS data
used in
assembly
Illumina +
454
Illumina +
454 + PacBio
Coverage!
Z-score
+2.0
–0.3
Validity!
Z-score
+1.4
–0.8
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more
sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
BCM -
evaluation
BCM -
competitive
Final
rank
1
2
NGS data
used in
assembly
Illumina +
454
Illumina +
454 + PacBio
Coverage!
Z-score
+2.0
–0.3
Validity!
Z-score
+1.4
–0.8
NG50 Contig
Z-score
+1.5
+2.7
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more
sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM used PacBio data to help fill in the gaps in their scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
BCM used PacBio data to help fill in the gaps in their scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
PacBio sequence
BCM used PacBio data to help fill in the gaps in their scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmid sequences to scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
Mismatches from PacBio sequence penalized alignment !
score more than matching unknown bases
Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmid sequences to scaffolds.
The choice of one command-line option,!
used by one tool in the calculation of one key metric...
...probably made enough difference to drop!
the PacBio-containing assembly to 2nd place.
This was actually down to the use of a single command-line option in the lastz alignment program. If we had not chosen this option, the PacBio-containing
entry would have probably ranked 1st among all bird assemblies.
Other conclusions
✤ Different metrics tell different stories!
✤ Heterozygosity was a big issue for bird & fish assemblies!
✤ Final rankings very sensitive to changes in metrics!
✤ N50 is a semi-useful predictor of assembly quality
The last point may disappoint some. Despite looking at many different metrics, N50 scaffold length still does a reasonable job of predicting overall quality.
However...
...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird
assemblies.
...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird
assemblies.
Inter-specific differences matter
Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also
very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also
very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
✤ The three genomes had very different NGS data sets!
✤ Only bird had PacBio & 454 data!
✤ Different insert sizes in short-insert libraries
Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also
very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
The Big Conclusion
People would like an assembler that consistently performs well across most (all?) metrics and across most species. We didn’t find such an assembler in the
Assemblathon 2 contest.
The Big Conclusion
"You can't always get what you want"
Sir Michael Jagger, 1969
People would like an assembler that consistently performs well across most (all?) metrics and across most species. We didn’t find such an assembler in the
Assemblathon 2 contest.
What comes next?
What comes next?
There may one day be an Assemblathon 3 but there are no immediate plans (and no funding for us at UC Davis to do so).
What comes next?
3?
There may one day be an Assemblathon 3 but there are no immediate plans (and no funding for us at UC Davis to do so).
A wish list for Assemblathon 3
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
✤ Get someone else to write the paper!
If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
Intermission
And now a break in the scheduled program in order to let me vent a little steam.
NGS must die!
Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases
— be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?
NGS must die!
‘NGS’ is used to refer to everything post-Sanger
Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases
— be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?
NGS must die!
‘NGS’ is used to refer to everything post-Sanger
Pyrosequencing was developed ~1996
Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases
— be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?
There are over 5,000 papers in Google Scholar which feature ‘Next-generation sequencing’ or ‘NGS’ in the title of the article. These do not help you if were
trying to find papers that focus on pyrosequencing or nanopore sequencing. How could we improve these titles?
In many cases, including ‘next-generation’ adds nothing to the description of the paper.
NGS madness
Next generation sequencing
aka second generation sequencing
Some people have tried alternative names. These are all descriptions that have been used in published papers.
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also:
Some people have tried alternative names. These are all descriptions that have been used in published papers.
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
Some people have tried alternative names. These are all descriptions that have been used in published papers.
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
Some people have tried alternative names. These are all descriptions that have been used in published papers.
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
Some people have tried alternative names. These are all descriptions that have been used in published papers.
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
next-next-next generation sequencing
Some people have tried alternative names. These are all descriptions that have been used in published papers.
NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to
some papers…
2nd generation
2nd generation
2nd generation
3rd generation
And of course, not everyone agrees on what is 2nd, 3rd, or 4th generation!
NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to
some papers…
2nd generation
2nd generation
2nd generation
3rd generation
According to
other papers…
3rd generation
3rd generation
3rd generation
4th generation
And of course, not everyone agrees on what is 2nd, 3rd, or 4th generation!
NGS madness
“PacBio is a 2.5th generation”
“Helicos lies between the transition of next-generation to third generation”
And of course, someone also has to be different!
NGS madness
There are different sequencing methodologies, !
and there are different sequencing platforms.
I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing,
nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).
NGS madness
There are different sequencing methodologies, !
and there are different sequencing platforms.
Use one or the other.
I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing,
nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).
NGS madness
There are different sequencing methodologies, !
and there are different sequencing platforms.
Use one or the other.
Or just say ‘current sequencing technologies’.
I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing,
nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).
Intermission
And now back to our scheduled programming.
My #1 piece!
of advice
flickr.com/julia_manzerova
If you ever have to work with genome assemblies, here is my top piece of advice.
flickr.com/thomashawk
Look at your *input* data (what goes into the assembler) and *output* data (what comes out of the assembler). And really look at it (in a Unix terminal).
flickr.com/thomashawk
Look at your data!
Look at your *input* data (what goes into the assembler) and *output* data (what comes out of the assembler). And really look at it (in a Unix terminal).
I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of
248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.
I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of
248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.
I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of
248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.
I looked at the shortest 10 sequences in 34 different genome assemblies…
Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the
shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these
useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
I looked at the shortest 10 sequences in 34 different genome assemblies…
Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the
shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these
useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
I looked at the shortest 10 sequences in 34 different genome assemblies…
Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the
shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these
useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
I looked at the shortest 10 sequences in 34 different genome assemblies…
Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the
shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these
useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
From a vertebrate genome assembly with 72,214 sequences…
In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp.
Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
From a vertebrate genome assembly with 72,214 sequences…
In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp.
Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
From a vertebrate genome assembly with 72,214 sequences…
In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp.
Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
From a vertebrate genome assembly with 72,214 sequences…
In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp.
Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
From a vertebrate genome assembly with 72,214 sequences…
In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp.
Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
From a vertebrate genome assembly with 72,214 sequences…
Length of 10 shortest sequences: !
100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp.
Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
For some of the CEGMA runs that I have made, I’ve noted which assemblers was used…
These results show that any assembler can be used to make a bad genome assembly. There is no one assembler which consistently performs well (as
assessed by these two metrics). Note that these assemblies were generated from many different species.
Reasons to be cheerful
flickr.com/danielygo
After sounding quite pessimistic so far, here are some more positive reasons why genome assembly might be getting better.
Data from Lex Nederbragt’s blog, June 2014
Sequencing technologies continue to improve. 10,000 bp is sort of a ‘breakthrough’ length that would greatly assist genome assembly. Producing many
reads that are >10,000 bp means that you can sequence all the way through most eukaryotic repeats (which are one of the two major scourges for genome
assemblers).
Data from Lex Nederbragt’s blog, June 2014
Sequencing technologies continue to improve. 10,000 bp is sort of a ‘breakthrough’ length that would greatly assist genome assembly. Producing many
reads that are >10,000 bp means that you can sequence all the way through most eukaryotic repeats (which are one of the two major scourges for genome
assemblers).
Long-read technology
Moleculo read data from Illumina BaseSpace, July 2013
Moleculo (now owned by Illumina) can take Illumina reads and somehow (not sure anyone knows the science behind how it works) combine them to make
much longer reads.
Long-read technology
From https://flxlexblog.wordpress.com (Lex Nederbragt's blog)
PacBio!
data
Library preparation is a hugely important part of the genome assembly process. The Blue Pippin library prep greatly improves the number of super long
PacBio reads.
Long-read technology
MinIon from Oxford Nanopore
Oxford Nanopore burst on to the scene and excited everyone. But it has been a wait before people had the chance to use their MinION devices for
themselves. The UC Davis Genome Center recently received 3 MinIONs as part of the early access program.
Long-read technology
MinIon from Oxford Nanopore
Oxford Nanopore burst on to the scene and excited everyone. But it has been a wait before people had the chance to use their MinION devices for
themselves. The UC Davis Genome Center recently received 3 MinIONs as part of the early access program.
Where is the data?
Nick Loman was the first person to publish a ‘real world’ read from these devices.
Where is the data?
Nick Loman was the first person to publish a ‘real world’ read from these devices.
Where is the data?
Nick Loman published the first real-world data on June 10th
Nick Loman was the first person to publish a ‘real world’ read from these devices.
He also shared the data from his entire run. This nanopore sequencing technology seems limited by how large your DNA fragments are. It may be possible
to generated much longer reads.
Single chromosome assembly?
Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in
eukaryotes are still very long).
Single chromosome assembly?
Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in
eukaryotes are still very long).
Single chromosome assembly?
Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in
eukaryotes are still very long).
Tackling heterozygosity
1000 Genomes project plans to sequence 15 'trios' in high-depth
The second major problem for genome assemblers is that of heterozygosity that is present in most (diploid) genomes. The 1,000 Genomes project is trying
to tackle this by sequencing ‘trios’, an individual plus their parents and will try to use the combination of datasets to resolve the heterozygosity.
Hi-C
✤ Nature Biotechnology, 31, 2013 !
✤ Burton et al.!
✤ Selvaraj et al.!
✤ Kaplan & Dekker
Hi-C is another new technology that might be able to improve the scaffolding step of genome assembly.
The future of genome assembly
Maybe one day, genome assembly will be as simple as downloading a sequence to your iPhone and clicking ‘assemble’. That day is still some time away.
Kwik-E-Assembler
acgtaacacaancac
gggaacnnnacatta
acnactagcataata
nnnnnnnnnnaacac
actttaaattatatc
The future of genome assembly
Maybe one day, genome assembly will be as simple as downloading a sequence to your iPhone and clicking ‘assemble’. That day is still some time away.
The future of genome assembly
Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out
there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often
be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly.
!
Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points
throughout their lives (part of your ‘genome checkup’ at the doctors?).
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out
there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often
be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly.
!
Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points
throughout their lives (part of your ‘genome checkup’ at the doctors?).
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out
there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often
be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly.
!
Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points
throughout their lives (part of your ‘genome checkup’ at the doctors?).
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out
there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often
be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly.
!
Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points
throughout their lives (part of your ‘genome checkup’ at the doctors?).
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out
there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often
be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly.
!
Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points
throughout their lives (part of your ‘genome checkup’ at the doctors?).
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
✤ the human genome -> human genomes -> tissue-specific genomes
Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out
there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often
be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly.
!
Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points
throughout their lives (part of your ‘genome checkup’ at the doctors?).
Summary
The last point on this slide is something that I repeat every 5 years.
Summary
✤ There is no real consensus on how to make a good genome assembly
The last point on this slide is something that I repeat every 5 years.
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
The last point on this slide is something that I repeat every 5 years.
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
The last point on this slide is something that I repeat every 5 years.
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
The last point on this slide is something that I repeat every 5 years.
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
✤ Wait 5 years and come back, we’ll (probably) have solved everything!
The last point on this slide is something that I repeat every 5 years.
Resources
✤ Lex Nederbragt’s blog - https://flxlexblog.wordpress.com!
✤ Nick Loman’s blog - http://pathogenomics.bham.ac.uk/blog/!
✤ Assemblathon twitter feed - https://twitter.com/assemblathon
These are good resources for staying on top of the latest and greatest news in the world of genome assembly.

Contenu connexe

Tendances

What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?Keith Bradnam
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesKeith Bradnam
 
Hw09 Hadoop For Bioinfomatics
Hw09   Hadoop For BioinfomaticsHw09   Hadoop For Bioinfomatics
Hw09 Hadoop For BioinfomaticsCloudera, Inc.
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for BioinformaticsDeepak Singh
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 

Tendances (20)

What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
2014 naples
2014 naples2014 naples
2014 naples
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
Hw09 Hadoop For Bioinfomatics
Hw09   Hadoop For BioinfomaticsHw09   Hadoop For Bioinfomatics
Hw09 Hadoop For Bioinfomatics
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for Bioinformatics
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 

Similaire à Genome assembly: then and now — with notes — v1.1

Splay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docxSplay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docxmckellarhastings
 
Creating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerCreating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerNacho Caballero
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Brock peters single tube long fragment read technology
Brock peters single tube long fragment read technologyBrock peters single tube long fragment read technology
Brock peters single tube long fragment read technologyGenomeInABottle
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
Mesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_Solvers
Mesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_SolversMesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_Solvers
Mesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_SolversDuncan Kitts
 
Optimal Meshing
Optimal MeshingOptimal Meshing
Optimal MeshingDon Sheehy
 
DNA Fingerprinting.pptx
DNA Fingerprinting.pptxDNA Fingerprinting.pptx
DNA Fingerprinting.pptxSudeepGoswami6
 

Similaire à Genome assembly: then and now — with notes — v1.1 (17)

Splay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docxSplay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docx
 
Creating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerCreating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designer
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
Alignments
AlignmentsAlignments
Alignments
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
Brock peters single tube long fragment read technology
Brock peters single tube long fragment read technologyBrock peters single tube long fragment read technology
Brock peters single tube long fragment read technology
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Trebuchet Paper
Trebuchet PaperTrebuchet Paper
Trebuchet Paper
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
On-Chip Variation
On-Chip VariationOn-Chip Variation
On-Chip Variation
 
Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
Regular buffer v/s Clock buffer
Regular buffer v/s Clock bufferRegular buffer v/s Clock buffer
Regular buffer v/s Clock buffer
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Mesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_Solvers
Mesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_SolversMesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_Solvers
Mesh_Orientation_and_Cell_Size_Senstivity_in_2D_SWE_Solvers
 
Galgo f
Galgo fGalgo f
Galgo f
 
Optimal Meshing
Optimal MeshingOptimal Meshing
Optimal Meshing
 
DNA Fingerprinting.pptx
DNA Fingerprinting.pptxDNA Fingerprinting.pptx
DNA Fingerprinting.pptx
 

Plus de Keith Bradnam

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxyKeith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'Keith Bradnam
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesKeith Bradnam
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentationsKeith Bradnam
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meetingKeith Bradnam
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programsKeith Bradnam
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished? Keith Bradnam
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to TwitterKeith Bradnam
 

Plus de Keith Bradnam (10)

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxy
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slides
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentations
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programs
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to Twitter
 

Dernier

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Dernier (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Genome assembly: then and now — with notes — v1.1

  • 1. Genome assembly: then and now Keith Bradnam Image from Wellcome Trust Author: Keith Bradnam, Genome Center, UC Davis This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 
 This was a talk given on 2014-06-19 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on using Galaxy.
  • 2. Image from flickr.com/photos/dougitdesign/5613967601/ Contents Sequencing 101! ! Genome assembly: then! ! Genome assembly: now Assemblathon 1 & 2! ! Advice & Angst! ! The future
  • 3. More info ✤ http://assemblathon.org! ! ✤ http://gigasciencejournal.com! ! ✤ http://twitter.com/assemblathon Assemblathon 2 paper has been reviewed, just dealing with reviewer's comments.
  • 4. Sequencing 101 A, C, G, T... Image from nlm.nih.gov Fred Sanger, who invented the sequencing technology that helped sequence most of the good quality genomes that are out there. He was also a winner of two Nobel prizes.
  • 5. Read Most sequencing technologies start with a sequencing read. A read could be as short as 25 bp (Solexa sequencing from a few years ago), or >15,000 bp (PacBio with latest chemistry). The record read length is currently held by PacBio and is over 50,000 bp.
  • 6. Read pair Most sequencing is done with pairs of connected reads, separated by a short interval whose approximate length is known. Not all reads will have this exact ‘insert size’. There can be a LOT of variation. Read pairs can also overlap with each other.
  • 7. Read pair Mate pair Mate pairs, also known as jumping pairs, have much larger inserts (thousands or tens of thousands of bp), but it is hard to make good mate pair libraries. Having very large inserts is very useful for the purposes of genome assembly. Again, there is a lot of variation in the actual size of inserts (as determined by mapping mate pairs back to a known reference).
  • 8. Sequence a whole lot of read pairs, and hopefully they will overlap with each other and allow you to start making contiguous sequences...
  • 9. Contigs ...which are better known as contigs.
  • 10. Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.
  • 11. Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.
  • 12. Scaffold NNNNNNNNNNNNNNNNNNN Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.
  • 13. Assembly size NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences. ! Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
  • 14. Assembly size NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences. ! Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
  • 15. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 16. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 17. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 70 The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 18. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 95 If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running total.
  • 19. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 95 If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running total.
  • 22. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp The length of the contig or scaffold that takes you past 50% is what is reported as the N50 length. So here, we have an N50 length of 20 Mbp.
  • 23. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 5 5 N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 24. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 5 5 N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 25. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 26. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 27. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 190 Mbp Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 28. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 190 Mbp Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 29. N50 for two assemblies Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
  • 30. N50 for two assemblies 208 Mbp 190 Mbp Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
  • 31. N50 for two assemblies 208 Mbp 190 Mbp N50 = 15 Mbp N50 = 25 Mbp Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
  • 32. NG50 for two assemblies 208 Mbp 190 Mbp We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.
  • 33. NG50 for two assemblies We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.
  • 34. NG50 for two assemblies Expected genome size = 250 Mbp We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.
  • 35. Expected genome size = 250 Mbp NG50 for two assemblies The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.
  • 36. NG50 = 15 Mbp NG50 = 15 Mbp Expected genome size = 250 Mbp NG50 for two assemblies The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.
  • 37. You should check that high N50 values! are not simply due to lots of Ns in the scaffolds! You should always look at your assembly before you do anything with it!
  • 38. Assembly 'x' In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).
  • 39. Assembly 'x' Size: 859 Mbp! ! Number of scaffolds: 28! ! N50 = 70.3 Mbp In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).
  • 40. Assembly 'x' Size: 859 Mbp! ! Number of scaffolds: 28! ! N50 = 70.3 Mbp Ns = 90.6% !!! In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).
  • 41. Assembly 'x' Size: 859 Mbp! ! Number of scaffolds: 28! ! N50 = 70.3 Mbp Ns = 90.6% !!! In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).
  • 42. Basic assembly metrics Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
  • 43. Basic assembly metrics Metric Description Assembly size With or without very short contigs? N50 / NG50 For contigs and/or scaffolds Coverage When compared to a reference sequence Errors Base errors from alignment to reference sequence ! and/or input read data Number of genes From comparison to reference transcriptome ! and/or set of known genes Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
  • 44. Basic assembly metrics Metric Description Assembly size With or without very short contigs? N50 / NG50 For contigs and/or scaffolds Coverage When compared to a reference sequence Errors Base errors from alignment to reference sequence ! and/or input read data Number of genes From comparison to reference transcriptome ! and/or set of known genes And many, many more... Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
  • 45. Genome assembly Back in the day... How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?
  • 46. Genome assembly Back in the day... 1998 How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?
  • 47. Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 48. Genetic maps ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 49. Genetic maps ✓ Physical maps ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 50. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 51. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 52. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 53. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 54. So what was the result of spending millions of dollars ! to assemble genomes of well-characterized species,! with accurate long reads, and detailed maps??? So hopefully this gave us a useful set of finished genomes, right?
  • 55. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp Arabidopsis thaliana Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.
  • 56. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp ✤ Amount sequenced = 119 Mbp Arabidopsis thaliana Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.
  • 57. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp ✤ Amount sequenced = 119 Mbp ✤ Ns = 0.2% of genome Arabidopsis thaliana Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.
  • 58. Two views of the same gene The top sequence is taken from a web based view of a gene in the the Arabidopsis thaliana genome sequence (information taken from the TAIR database). There is a G missing compared to the same gene sequence that is available in a file of gene sequences to download from their FTP site.
  • 59. Two views of the same gene Top: from genome sequence view on TAIR web site! Bottom: from gene sequence file on TAIR FTP site This is one of eight cases where sequencing has not confirmed the existence of a base, but if you don’t have it you lead to a frameshift and a truncated protein. The G has been added by humans, and the difference between the two versions of the sequence is the sort of thing that gives bioinformaticians nightmares. How are you meant to know what these differences mean and which is the correct one (unless you email a TAIR curator like I did)?
  • 60. Drosophila melanogaster ✤ Genome published 1998 ✤ Heterochromatin finished 2007 The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.
  • 61. Drosophila melanogaster ✤ Genome published 1998 ✤ Heterochromatin finished 2007 ✤ Ns = 4% of genome The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.
  • 62. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 63. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2014: genome sequence changes The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 64. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2014: genome sequence changes ✤ 558 insertions ✤ 230 deletions ✤ 614 substitutions The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 65. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2014: genome sequence changes ✤ 558 insertions ✤ 230 deletions ✤ 614 substitutions }Nov 2012 The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 66. Saccharomyces cerevisiae ✤ Genome published 1997 ✤ 12 Mbp genome ✤ 1,653 changes to genome since 1997 Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.
  • 67. Saccharomyces cerevisiae ✤ Genome published 1997 ✤ 12 Mbp genome ✤ 1,653 changes to genome since 1997 ✤ Last changes made in 2011 Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.
  • 68. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then And all of this was done in an era when we had all of these supporting materials.
  • 69. Genetic maps ✗ Physical maps ✗ Understanding of target genome ✗ Haploid / low heterozygosity genome ✗ Accurate & long reads ✗ Resources (time, money, people) ✗ Genome assembly: now We don't have these now! Genome sequencing no longer requires an international consortium, rather it could be a project for a Grad student.
  • 70. Assembling & finishing! a genome is not easy! It was never easy, even when we access to lots of resources to help us put together genomes. And it is not easy now. Don't be fooled into thinking that because there are many published genome sequences, that these sequences represent the absolute ideal genome sequence. ! And don’t be fooled that just because you can afford to sequence a genome, that you will have the resources to make a useful assembly from that sequence data.
  • 71. Assemblathons A new idea is born Image from flickr.com/photos/dullhunk/4422952630
  • 72. The Assemblathon was born out of the Genome 10K project.
  • 73. If you sequence 10,000 genomes...! ...you need to assemble 10,000 genomes The Assemblathon was born out of the Genome 10K project.
  • 74. How many assembly tools are out there? There are many, many tools out there for assembling, or helping to assemble, a genome sequence (there are 114 on this page). People will not have the time or patience (or skill) to try more than a handful of these. But…
  • 75. bambus2 How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LGSGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermiTelescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA MaSuRCA GARM Cerulean TIGRA ngsShoRT PERGA SOAPdenovo REAPR FRCBam EULER-SR SSPACE Opera mip gapfiller image PBJelly HGAP FALCON Dazzler GGAKE A5 CABOG SHRAP SR-ASM SuccinctAssembly SUTTA Ragout Tedna Trinity SWAP-Assembler SILP3 AutoAssemblyD KGBAssembler MetAMOS iMetAMOS MetaVelvet-SL KmerGenie Nesoni Pilon Platanus CGAL GAGM Enly BESST Khmer GRIT IDBA-MTP dipSPAdes WhatsHap SHEAR ELOPER OMACC There are many, many tools out there for assembling, or helping to assemble, a genome sequence (there are 114 on this page). People will not have the time or patience (or skill) to try more than a handful of these. But…
  • 76. How many assembly tools are out there? …people want to know, which is the best?
  • 77. bambus2 How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LGSGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermiTelescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA MaSuRCA GARM Cerulean TIGRA ngsShoRT PERGA SOAPdenovo REAPR FRCBam EULER-SR SSPACE Opera mip gapfiller image PBJelly HGAP FALCON Dazzler GGAKE A5 CABOG SHRAP SR-ASM SuccinctAssembly SUTTA Ragout Tedna Trinity SWAP-Assembler SILP3 AutoAssemblyD KGBAssembler MetAMOS iMetAMOS MetaVelvet-SL KmerGenie Nesoni Pilon Platanus CGAL GAGM Enly BESST Khmer GRIT IDBA-MTP dipSPAdes WhatsHap SHEAR ELOPER OMACC …people want to know, which is the best?
  • 78. bambus2 How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LGSGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermiTelescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA MaSuRCA GARM Cerulean TIGRA ngsShoRT PERGA SOAPdenovo REAPR FRCBam EULER-SR SSPACE Opera mip gapfiller image PBJelly HGAP FALCON Dazzler GGAKE A5 CABOG SHRAP SR-ASM SuccinctAssembly SUTTA Ragout Tedna Trinity SWAP-Assembler SILP3 AutoAssemblyD KGBAssembler MetAMOS iMetAMOS MetaVelvet-SL KmerGenie Nesoni Pilon Platanus CGAL GAGM Enly BESST Khmer GRIT IDBA-MTP dipSPAdes WhatsHap SHEAR ELOPER OMACC Which is the best? …people want to know, which is the best?
  • 79. Comparing assemblers ✤ Can't fairly compare two assemblers if they: However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 80. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 81. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different sequencing technologies However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 82. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different sequencing technologies ✤ used same sequencing technologies but have different sequence libraries However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 83. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different sequencing technologies ✤ used same sequencing technologies but have different sequence libraries ✤ Even using different options for the same assembler may produce very different assemblies! However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 84. The PRICE genome assembler has 52 command-line options!!! This assembler has 52 command-line options! Not all of these will affect the resulting assembly, but many of them will.
  • 85. The PRICE genome assembler has 52 command-line options!!! how many of them are you going to learn? This assembler has 52 command-line options! Not all of these will affect the resulting assembly, but many of them will.
  • 86. A genome assembly competition That's where the Assemblathon came in.
  • 87. An attempt to standardize some aspects ! of the genome assembly process Genome assembly contests Others have been trying to do the same thing. E.g. GAGE, and dnGASP. If you can at least give difference assemblers the same input sequence data, you can start to take account of one of the biggest variables in genome assembly.
  • 88. ✤ 2010–2011! ✤ Used synthetic data! ✤ Small genome (~100 Mbp)! ✤ We knew the answer! Assemblathon 1 It is easier to judge a tool when you know what the final answer should look like. However, many people that work on developing assemblers would prefer to work with real data...
  • 89. Here we go again ...which is where Assemblathon 2 came in.
  • 90. Type of data Number of genomes Size of genomes Do we know the answer? Assemblathon 1 Synthetic 1 Small ✓ Assemblathon 2 became a much bigger contest compared to Assemblathon 1.
  • 91. Type of data Number of genomes Size of genomes Do we know the answer? Assemblathon 1 Synthetic 1 Small ✓ Assemblathon 2 Real 3 Large ✗ Assemblathon 2 became a much bigger contest compared to Assemblathon 1.
  • 92. Melopsittacus undulatus Boa constrictor constrictorMaylandia zebra A budgie, a cichlid fish from Lake Mawali, and a reptile.
  • 93. Bird SnakeFish Let's simplify the names for the rest of the talk.
  • 94. Why these three species? There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.
  • 95. Why these three species? Because they were there There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.
  • 96. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. ! This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
  • 97. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Illumina 285x! (14 libraries) 192x! (8 libraries) 125x! (4 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. ! This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
  • 98. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Illumina 285x! (14 libraries) 192x! (8 libraries) 125x! (4 libraries) Roche 454 16x! (3 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. ! This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
  • 99. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Illumina 285x! (14 libraries) 192x! (8 libraries) 125x! (4 libraries) Roche 454 16x! (3 libraries) PacBio 10x! (2 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. ! This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!
  • 100. Who took part? Lots of teams took part. Not just from the big sequencing/genome centers.
  • 101. Who took part? Lots of teams took part. Not just from the big sequencing/genome centers.
  • 102. Who took part? 21 teams! 43 assemblies! 52,013,623,777 bp of sequence Lots of teams took part. Not just from the big sequencing/genome centers.
  • 103. Species Bird Fish Snake Competitive entries 12 10 12 Entries There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).
  • 104. Species Bird Fish Snake Competitive entries 12 10 12 Evaluation entries 3 6 0 Entries There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).
  • 105. Goals Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.
  • 106. Goals ✤ Assess 'quality' of assemblies Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.
  • 107. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.
  • 108. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! ✤ Produce ranking of assemblies for each species Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.
  • 109. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! ✤ Produce ranking of assemblies for each species ✤ Produce ranking of assemblers across species? Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.
  • 110. Who did what? Person/group Jobs Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies David Schwarz et al. Produce & evaluate optical maps Jay Shendure et al. Produce Fosmid sequences ! (bird & snake only) Martin Hunt & Thomas Otto Performed REAPR analysis Dent Earl & Benedict Paten Help with meta-analysis of final rankings Lots of different groups were involved in the organization and assessment of the Assemblathon 2 entries.
  • 111. 91 co-authors! flickr.com/photos/jamescridland/613445810 Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.
  • 113. Lots of results! A screen grab of my master spreadsheet that contains all of the numerical results. Each row represents a submitted assembly, and each column represents a different assembly metric.
  • 114. There were a lot of metrics. Many of these were not important or highly informative (e.g. %N).
  • 115. 102 different metrics! There were a lot of metrics. Many of these were not important or highly informative (e.g. %N).
  • 116. 10 key metrics We focused on 10 of 102 metrics that we thought were a) useful and b) captured different aspects of an assembly's quality.
  • 117. Key Metric Description 1 NG50 scaffold length 2 NG50 contig length 3 Amount of assembly in 'gene-sized' scaffolds 4 Number of 'core genes' present 5 Fosmid coverage 6 Fosmid validity 7 Short-range scaffold accuracy 8 Optical map: level 1 9 Optical map: levels 1–3 10 REAPR summary score The 10 key metrics.
  • 118. Key Metric Description 1 NG50 scaffold length 2 NG50 contig length 3 Amount of assembly in 'gene-sized' scaffolds 4 Number of 'core genes' present 5 Fosmid coverage 6 Fosmid validity 7 Short-range scaffold accuracy 8 Optical map: level 1 9 Optical map: levels 1–3 10 REAPR summary score In the remainder of this talk, I’ll just focus on some of these metrics. See the Assemblathon 2 paper for more details.
  • 119. 1) Scaffold NG50 lengths ✤ Can calculate NG50 length for each assembly! ✤ But also calculate NG60, NG70 etc.! ✤ Plot all results as a graph An N50 (or NG50) value on its own doesn't tell you that much. Ideally you should always be aware of the total assembly size and the distribution of lengths when comparing assemblies. You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length of scaffold that captures 1% of the estimated genome size (when summing scaffolds from longest to shortest).
  • 120. 1) Scaffold NG50 lengths Scaffold length is on a log axis and team identifiers are shown in the legend. ! The black dashed line shows the NG50 value, but the point where each series starts on the left shows the lengths of the longest scaffolds. Also, if the NG100 value is greater than zero, then that assembly is bigger than the known/estimated genome size.
  • 121. 2) Contig vs scaffold NG50 We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
  • 122. 2) Contig vs scaffold NG50 We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
  • 123. 2) Contig vs scaffold NG50 We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
  • 124. 3) Gene-sized scaffolds It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 125. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 126. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! ✤ How long is 'long enough' for a scaffold? It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 127. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! ✤ How long is 'long enough' for a scaffold? ✤ What if you just wanted to find genes? It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 128. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! ✤ How long is 'long enough' for a scaffold? ✤ What if you just wanted to find genes? ✤ Average vertebrate gene = ~25 Kbp It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 129. 3) Gene-sized scaffolds The red data series orders the bird assemblies in order of their NG50 scaffold length. The blue line shows the percentage of the estimated genome size that is present in scaffolds of 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffold length, may contain many scaffolds that are still long enough to contain a single gene.
  • 130. 4) Core genes A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. ! Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 131. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. ! Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 132. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) ✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs) A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. ! Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 133. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) ✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs) ✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. ! Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 134. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) ✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs) ✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens ✤ How many full-length CEGs are in each assembly? A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. ! Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 135. 4) Core genes Species Bird Fish Snake Core genes (out of 458) Best individual assembly 420 436 438 In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.
  • 136. 4) Core genes Species Bird Fish Snake Core genes (out of 458) Best individual assembly 420 436 438 Across all assemblies 442 455 454 In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.
  • 137. 4) Core genes These results show the number of CEGMA genes that were present in any one assembly as a percentage of all possible CEGMA genes (i.e. those present across all assemblies for each species).
  • 138. ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED ************************************************ ***** ! ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ ****************************************************** ! ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------ MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG *************************************** ! 4) Core genes Example of one core gene predicted in bird assemblies. CEGMA gene predictions are available as supplementary material with the paper.
  • 139. 8 & 9) Optical maps For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 140. 8 & 9) Optical maps ✤ Stretch out DNA For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 141. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 142. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 143. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments ✤ Compare to in silico digest of scaffolds For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 144. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments ✤ Compare to in silico digest of scaffolds ✤ Not all scaffolds suitable for analysis For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 145. 8 & 9) Optical maps Image from University of Wisconsin-Madison An example of an optical map. After cutting, each DNA fragment is measured to estimate its length. Optical map results were divided into three categories (levels 1–3).
  • 146. 8 & 9) Optical maps White bars: total length of scaffolds that were suitable for optical map analysis. Dark blue: global alignments of scaffolds to maps (these are the best quality). Light blue: global alignments with more permissive thresholds. Orange bars: local alignments. We used level 1 (dark blue) as one key metric and levels 1+2+3 as a second key metric. The MLK assembly is good, *relatively* speaking (high percentage of suitable scaffolds are in level 1 category), but we record scores on an absolute basis (MERAC highest for level 1, SOAP highest for levels 1+2+3).
  • 147. 8 & 9) Optical maps Fish optical map results were much worse than in bird, with very few assemblies having scaffolds with 'level 1' global alignments to the optical map. SGA had the most level 1 coverage, but a much lower amount of sequence that was alignable at any level (1, 2, or 3).
  • 148. 8 & 9) Optical maps Snake optical map results were intermediate compared to bird and fish.
  • 149. What does this all mean?
  • 150. 102 metrics! per assembly 10 key ! metrics 1 final! ranking Using the 10 key metrics, we combined the results to produce a single score for each assembly by which to rank them.
  • 151. Assembly CRACS SYMB PHUS BCM SGA MERAC ABYSS SOAP RAY GAM CURT Number of ! core genes 438 436 435 434 433 430 429 428 422 415 360 Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 152. Assembly CRACS SYMB PHUS BCM SGA MERAC ABYSS SOAP RAY GAM CURT Number of ! core genes 438 436 435 434 433 430 429 428 422 415 360 Rank 1 2 3 4 5 6 7 8 9 10 11 Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 153. Assembly CRACS SYMB PHUS BCM SGA MERAC ABYSS SOAP RAY GAM CURT Number of ! core genes 438 436 435 434 433 430 429 428 422 415 360 Rank 1 2 3 4 5 6 7 8 9 10 11 Z-score +0.68 +0.59 +0.54 +0.49 +0.44 +0.30 +0.25 +0.21 –0.08 –0.41 –3.02 Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 154. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
  • 155. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
  • 156. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
  • 157. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
  • 158. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.
  • 159. In fish, the BCM entry ranked 1st though the error bars suggest there is much variability. The lack of Fosmid data means that there is only 7 key metrics rather than 10.
  • 160. Snake seemed to the only species where it looked like one assembler outperformed all others (SGA, in this case). We will return to this issue. Note that there were no evaluation entries for snake.
  • 161. Another way of looking at all of this data is to plot the Z-scores for each metric as a heat map (red = higher Z-scores).
  • 162. A parallel coordinates plot is another way of trying to show all of the information at once.
  • 163. What does this all mean?
  • 164. No really, what does this all mean? Still a bit hard to make sense of the overall rankings. What are the main findings from our paper?
  • 165. Some conclusions ✤ Very hard to find assemblers that performed well across all 10 key metrics! ✤ Assemblers that perform well in one species, do not always perform as well in another! ✤ Bird & snake assemblies appear better than fish! ✤ No real 'winner' for bird and fish This type of news is perhaps disappointing to many.
  • 166. SGA — best assembler for snake? Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?
  • 167. SGA — best assembler for snake? Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?
  • 168. Description Rank of snake SGA assembly NG50 scaffold length 2 NG50 contig length 5 Amount of assembly in 'gene-sized' scaffolds 7 Number of 'core genes' present 5 Fosmid coverage 2 Fosmid validity 2 Short-range scaffold accuracy 3 Optical map: level 1 2 Optical map: levels 1–3 1 REAPR summary score 2 SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.
  • 169. Description Rank of snake SGA assembly NG50 scaffold length 2 NG50 contig length 5 Amount of assembly in 'gene-sized' scaffolds 7 Number of 'core genes' present 5 Fosmid coverage 2 Fosmid validity 2 Short-range scaffold accuracy 3 Optical map: level 1 2 Optical map: levels 1–3 1 REAPR summary score 2 SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.
  • 170. Best assembler across species? Not all teams entered assemblies for all three species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
  • 171. Best assembler across species? Assembler Number of 1st places (out of 27) BCM 5 Meraculous 4 Symbiose 4 Ray 3 Excluding evaluation entries Not all teams entered assemblies for all three species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
  • 172. Best assembler across species? Assembler Number of 1st places (out of 27) BCM 5 Meraculous 4 Symbiose 4 Ray 3 Excluding evaluation entries Not all teams entered assemblies for all three species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
  • 173. Ray performance Species Final ranking Bird 7th Fish 7th Snake 9th However, Ray ranks much lower when looking at its performance across all key metrics. So some assemblers do very well in specific measures, and not so well in others and other assemblers do moderately well across lots of metrics (e.g. SGA).
  • 174. We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?
  • 175. We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?
  • 176. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 177. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 178. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio Coverage! Z-score +2.0 –0.3 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 179. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio Coverage! Z-score +2.0 –0.3 Validity! Z-score +1.4 –0.8 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 180. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio Coverage! Z-score +2.0 –0.3 Validity! Z-score +1.4 –0.8 NG50 Contig Z-score +1.5 +2.7 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 181. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM used PacBio data to help fill in the gaps in their scaffolds.
  • 182. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold NNNNNNNNNNNNNNNNNNN BCM used PacBio data to help fill in the gaps in their scaffolds.
  • 183. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold NNNNNNNNNNNNNNNNNNN PacBio sequence BCM used PacBio data to help fill in the gaps in their scaffolds.
  • 184. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold CGTCGNNATCNNGGTTACG Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmid sequences to scaffolds.
  • 185. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold CGTCGNNATCNNGGTTACG Mismatches from PacBio sequence penalized alignment ! score more than matching unknown bases Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmid sequences to scaffolds.
  • 186. The choice of one command-line option,! used by one tool in the calculation of one key metric... ...probably made enough difference to drop! the PacBio-containing assembly to 2nd place. This was actually down to the use of a single command-line option in the lastz alignment program. If we had not chosen this option, the PacBio-containing entry would have probably ranked 1st among all bird assemblies.
  • 187. Other conclusions ✤ Different metrics tell different stories! ✤ Heterozygosity was a big issue for bird & fish assemblies! ✤ Final rankings very sensitive to changes in metrics! ✤ N50 is a semi-useful predictor of assembly quality The last point may disappoint some. Despite looking at many different metrics, N50 scaffold length still does a reasonable job of predicting overall quality. However...
  • 188. ...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.
  • 189. ...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.
  • 190. Inter-specific differences matter Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 191. Inter-specific differences matter ✤ The three species have genomes with different properties ! ✤ repeats! ✤ heterozygosity Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 192. Inter-specific differences matter ✤ The three species have genomes with different properties ! ✤ repeats! ✤ heterozygosity ✤ The three genomes had very different NGS data sets! ✤ Only bird had PacBio & 454 data! ✤ Different insert sizes in short-insert libraries Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 193. The Big Conclusion People would like an assembler that consistently performs well across most (all?) metrics and across most species. We didn’t find such an assembler in the Assemblathon 2 contest.
  • 194. The Big Conclusion "You can't always get what you want" Sir Michael Jagger, 1969 People would like an assembler that consistently performs well across most (all?) metrics and across most species. We didn’t find such an assembler in the Assemblathon 2 contest.
  • 196. What comes next? There may one day be an Assemblathon 3 but there are no immediate plans (and no funding for us at UC Davis to do so).
  • 197. What comes next? 3? There may one day be an Assemblathon 3 but there are no immediate plans (and no funding for us at UC Davis to do so).
  • 198. A wish list for Assemblathon 3 If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 199. A wish list for Assemblathon 3 ✤ Only have 1 species If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 200. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 201. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 202. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 203. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 204. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies ✤ Use new FASTG genome assembly file format If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 205. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies ✤ Use new FASTG genome assembly file format ✤ Get someone else to write the paper! If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 206. Intermission And now a break in the scheduled program in order to let me vent a little steam.
  • 207. NGS must die! Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases — be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?
  • 208. NGS must die! ‘NGS’ is used to refer to everything post-Sanger Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases — be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?
  • 209. NGS must die! ‘NGS’ is used to refer to everything post-Sanger Pyrosequencing was developed ~1996 Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases — be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?
  • 210. There are over 5,000 papers in Google Scholar which feature ‘Next-generation sequencing’ or ‘NGS’ in the title of the article. These do not help you if were trying to find papers that focus on pyrosequencing or nanopore sequencing. How could we improve these titles?
  • 211. In many cases, including ‘next-generation’ adds nothing to the description of the paper.
  • 212. NGS madness Next generation sequencing aka second generation sequencing Some people have tried alternative names. These are all descriptions that have been used in published papers.
  • 213. NGS madness Next generation sequencing aka second generation sequencing but there’s also: Some people have tried alternative names. These are all descriptions that have been used in published papers.
  • 214. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing Some people have tried alternative names. These are all descriptions that have been used in published papers.
  • 215. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing fourth generation sequencing Some people have tried alternative names. These are all descriptions that have been used in published papers.
  • 216. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing fourth generation sequencing next-next generation sequencing Some people have tried alternative names. These are all descriptions that have been used in published papers.
  • 217. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing fourth generation sequencing next-next generation sequencing next-next-next generation sequencing Some people have tried alternative names. These are all descriptions that have been used in published papers.
  • 218. NGS madness Technology Complete Genomics Ion Torrent PacBio Oxford Nanopore According to some papers… 2nd generation 2nd generation 2nd generation 3rd generation And of course, not everyone agrees on what is 2nd, 3rd, or 4th generation!
  • 219. NGS madness Technology Complete Genomics Ion Torrent PacBio Oxford Nanopore According to some papers… 2nd generation 2nd generation 2nd generation 3rd generation According to other papers… 3rd generation 3rd generation 3rd generation 4th generation And of course, not everyone agrees on what is 2nd, 3rd, or 4th generation!
  • 220. NGS madness “PacBio is a 2.5th generation” “Helicos lies between the transition of next-generation to third generation” And of course, someone also has to be different!
  • 221. NGS madness There are different sequencing methodologies, ! and there are different sequencing platforms. I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing, nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).
  • 222. NGS madness There are different sequencing methodologies, ! and there are different sequencing platforms. Use one or the other. I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing, nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).
  • 223. NGS madness There are different sequencing methodologies, ! and there are different sequencing platforms. Use one or the other. Or just say ‘current sequencing technologies’. I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing, nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).
  • 224. Intermission And now back to our scheduled programming.
  • 225. My #1 piece! of advice flickr.com/julia_manzerova If you ever have to work with genome assemblies, here is my top piece of advice.
  • 226. flickr.com/thomashawk Look at your *input* data (what goes into the assembler) and *output* data (what comes out of the assembler). And really look at it (in a Unix terminal).
  • 227. flickr.com/thomashawk Look at your data! Look at your *input* data (what goes into the assembler) and *output* data (what comes out of the assembler). And really look at it (in a Unix terminal).
  • 228. I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of 248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.
  • 229. I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of 248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.
  • 230. I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of 248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.
  • 231. I looked at the shortest 10 sequences in 34 different genome assemblies… Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
  • 232. I looked at the shortest 10 sequences in 34 different genome assemblies… Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
  • 233. I looked at the shortest 10 sequences in 34 different genome assemblies… Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
  • 234. I looked at the shortest 10 sequences in 34 different genome assemblies… Genome assemblers can sometimes choose to exclude very short contigs/scaffolds from the final assembly. I looked at 34 assemblies to see whether the shortest 10 sequences all had the same length (indicating a cutoff had been used). Five assemblies contained an abundance of 100 nt sequences. Are these useful to anyone? Other assemblers had strange cutoff lengths (e.g. 141 bp).
  • 235. From a vertebrate genome assembly with 72,214 sequences… In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
  • 236. From a vertebrate genome assembly with 72,214 sequences… In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
  • 237. From a vertebrate genome assembly with 72,214 sequences… In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
  • 238. From a vertebrate genome assembly with 72,214 sequences… In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
  • 239. From a vertebrate genome assembly with 72,214 sequences… In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
  • 240. From a vertebrate genome assembly with 72,214 sequences… Length of 10 shortest sequences: ! 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp! In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.
  • 241. For some of the CEGMA runs that I have made, I’ve noted which assemblers was used…
  • 242. These results show that any assembler can be used to make a bad genome assembly. There is no one assembler which consistently performs well (as assessed by these two metrics). Note that these assemblies were generated from many different species.
  • 243. Reasons to be cheerful flickr.com/danielygo After sounding quite pessimistic so far, here are some more positive reasons why genome assembly might be getting better.
  • 244. Data from Lex Nederbragt’s blog, June 2014 Sequencing technologies continue to improve. 10,000 bp is sort of a ‘breakthrough’ length that would greatly assist genome assembly. Producing many reads that are >10,000 bp means that you can sequence all the way through most eukaryotic repeats (which are one of the two major scourges for genome assemblers).
  • 245. Data from Lex Nederbragt’s blog, June 2014 Sequencing technologies continue to improve. 10,000 bp is sort of a ‘breakthrough’ length that would greatly assist genome assembly. Producing many reads that are >10,000 bp means that you can sequence all the way through most eukaryotic repeats (which are one of the two major scourges for genome assemblers).
  • 246. Long-read technology Moleculo read data from Illumina BaseSpace, July 2013 Moleculo (now owned by Illumina) can take Illumina reads and somehow (not sure anyone knows the science behind how it works) combine them to make much longer reads.
  • 247. Long-read technology From https://flxlexblog.wordpress.com (Lex Nederbragt's blog) PacBio! data Library preparation is a hugely important part of the genome assembly process. The Blue Pippin library prep greatly improves the number of super long PacBio reads.
  • 248. Long-read technology MinIon from Oxford Nanopore Oxford Nanopore burst on to the scene and excited everyone. But it has been a wait before people had the chance to use their MinION devices for themselves. The UC Davis Genome Center recently received 3 MinIONs as part of the early access program.
  • 249. Long-read technology MinIon from Oxford Nanopore Oxford Nanopore burst on to the scene and excited everyone. But it has been a wait before people had the chance to use their MinION devices for themselves. The UC Davis Genome Center recently received 3 MinIONs as part of the early access program.
  • 250. Where is the data? Nick Loman was the first person to publish a ‘real world’ read from these devices.
  • 251. Where is the data? Nick Loman was the first person to publish a ‘real world’ read from these devices.
  • 252. Where is the data? Nick Loman published the first real-world data on June 10th Nick Loman was the first person to publish a ‘real world’ read from these devices.
  • 253. He also shared the data from his entire run. This nanopore sequencing technology seems limited by how large your DNA fragments are. It may be possible to generated much longer reads.
  • 254. Single chromosome assembly? Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in eukaryotes are still very long).
  • 255. Single chromosome assembly? Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in eukaryotes are still very long).
  • 256. Single chromosome assembly? Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in eukaryotes are still very long).
  • 257. Tackling heterozygosity 1000 Genomes project plans to sequence 15 'trios' in high-depth The second major problem for genome assemblers is that of heterozygosity that is present in most (diploid) genomes. The 1,000 Genomes project is trying to tackle this by sequencing ‘trios’, an individual plus their parents and will try to use the combination of datasets to resolve the heterozygosity.
  • 258. Hi-C ✤ Nature Biotechnology, 31, 2013 ! ✤ Burton et al.! ✤ Selvaraj et al.! ✤ Kaplan & Dekker Hi-C is another new technology that might be able to improve the scaffolding step of genome assembly.
  • 259. The future of genome assembly Maybe one day, genome assembly will be as simple as downloading a sequence to your iPhone and clicking ‘assemble’. That day is still some time away.
  • 260. Kwik-E-Assembler acgtaacacaancac gggaacnnnacatta acnactagcataata nnnnnnnnnnaacac actttaaattatatc The future of genome assembly Maybe one day, genome assembly will be as simple as downloading a sequence to your iPhone and clicking ‘assemble’. That day is still some time away.
  • 261. The future of genome assembly Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. ! Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).
  • 262. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. ! Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).
  • 263. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. ! Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).
  • 264. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... ✤ ...'perfect' genomes may remain elusive. Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. ! Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).
  • 265. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... ✤ ...'perfect' genomes may remain elusive. ✤ Data management will remain an issue: Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. ! Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).
  • 266. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... ✤ ...'perfect' genomes may remain elusive. ✤ Data management will remain an issue: ✤ the human genome -> human genomes -> tissue-specific genomes Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. ! Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).
  • 267. Summary The last point on this slide is something that I repeat every 5 years.
  • 268. Summary ✤ There is no real consensus on how to make a good genome assembly The last point on this slide is something that I repeat every 5 years.
  • 269. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options The last point on this slide is something that I repeat every 5 years.
  • 270. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options ✤ Decide what it is you want to get out of a genome assembly The last point on this slide is something that I repeat every 5 years.
  • 271. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options ✤ Decide what it is you want to get out of a genome assembly ✤ Look at your input and output data The last point on this slide is something that I repeat every 5 years.
  • 272. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options ✤ Decide what it is you want to get out of a genome assembly ✤ Look at your input and output data ✤ Wait 5 years and come back, we’ll (probably) have solved everything! The last point on this slide is something that I repeat every 5 years.
  • 273. Resources ✤ Lex Nederbragt’s blog - https://flxlexblog.wordpress.com! ✤ Nick Loman’s blog - http://pathogenomics.bham.ac.uk/blog/! ✤ Assemblathon twitter feed - https://twitter.com/assemblathon These are good resources for staying on top of the latest and greatest news in the world of genome assembly.