Meren's pirate presentation at the STAMPS course to talk about the basic concepts most binning algorithms use to bin contigs into genome bins: sequence composition, and differential coverage.
1. These slides are from Meren’s pirate
presentation at the STAMPS 2017 course.
The purpose of this was to provide a very
broad introduction to the two essential
concepts behind automatic identification of
microbial genome bins in metagenomic
data.
If you have questions, let us know:
http://merenlab.org/people/
2. Recovering genomes from
metagenomes using short
sequencing reads can be
challenging. But its importance
pushes us to try harder.
Here are the major steps of the
assembly-based, genome-
resolved metagenomics:
3.
4. Assembly and binning suffers
from many challenges, and we
often miss parts of genomes even
when environments we study are
not very complex.
… but we do much worse when
population abundances are not
even, which usually is the case.
5.
6. Regardless, there are
tremendous benefits when we
can get population genomes
from metagenomes.
‘Binning’ is the step during
which we organize those
contigs in our assembly results
into population genomes.
7.
8. But how do we do that when
we know almost nothing about
the origins of our contigs at the
end of the assembly?
10. There are two aspects of data
we commonly exploit to
identify contigs in our
assemblies that likely belong
to the same population
genome in the environment
11. The first one is the ‘sequence
composition’, which requires
no prior understanding of the
the likely origins of contigs
(and it is fascinating why this
works for multiple reasons)
12.
13.
14.
15. Fine, but how do we even
compute k-mer frequencies?
The following example does it
for multiple sequences by
assuming k=2
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28. Now this information can be
used to organize sequences
based on their compositional
similarities
29.
30.
31. Now you know what ‘di-
nucleotide composition space’
is (i.e. k-mer frequencies for
k=2), following papers will
probably make much more
sense
32.
33.
34. When k=4, we call it the
tetranucleotide frequency
(which you may have heard
many times before as it is the
de facto standard for
characterizing sequence
composition)
39. But in most cases sequence
signatures are not enough to
resolve things accurately.
40. The second aspect of data that
improves the resolving power
of binning algorithms when
multiple samples are available
is the ‘differential coverage’ of
contigs across metagenomes.
41. But what is ‘coverage’?
Coverage is the average
number of short reads
mapping to each nucleotide
position throughout a contig:
42.
43. Yeah. So if that is coverage, we
could use it the following way:
44.
45. And it would have worked even
when we don’t know anything
about the contigs, or
distribution patterns of
individual populations they
belong to:
46.
47. Modern algorithms often use
these two aspects of the data
to organize contigs into
genome bins automatically
49. Believe it or not, recovering
population genomes from
metagenomes is not a new
thing…
50.
51. But luckily, there are many
algorithms to standardize the
way we can do automatic
binning.
52.
53. That being said, you should
think twice before putting your
absolute trust in any genome
bin you get from automatic
binning tools.
54. Metagenomic data is complex, and
things will often work less than
optimal.
Here is a blog post you may find
relevant if you are interested in
exploring how to refine metagenomic
bins:
http://merenlab.org/2017/05/11/anvi-refine-by-veronika/