K-mers in metagenomics
K-mers play a critical role in the exploration of metagenomic data. They have been widely used to assign taxonomic attributions to the short genomic fragments characteristic of shotgun (metagenomic) sequencing. These approaches provide an assembly-free method for profiling microbial communities, and have helped elucidate the factors driving microbial community composition across biogeochemical gradients. Advances in sequencing technology are now making it cost-effective to sequence microbial communities at sufficient depths to allow for the assembly of high-quality contigs. This has made it possible to adopt k-mer based approaches to enable reliable binning of contigs originating from a single microbial population within a community. In this session, I will present both an overview of how k-mers can be used to assign taxonomic attributions to short metagenomic reads, and discuss how these approaches have advanced to a point where population genomes can be recovered en masse from even complex microbial communities.
5. 5 of 27
exploiting genomic (K-mer) signatures
PhymmBL (K≤8): interpolated Markov model
PhyloPythia (K ≈6): multiclass support vector machine
Naïve Bayes (K ≈15): probability of observing a K-mer
Kraken (K ≈31): exact K-mer matching
CLARK (K ≈31): exact matching of discriminative K-mers
denseprofilessparseprofiles
6. 6 of 27
Kraken: K-mer LCA database
Wood and Salzberg, Genome Biology, 2014
Reference Genomes
(2,256 RefSeq Genomes)
Lowest common ancestor
database
K-mer LCA
ACC … GT g__Escherichia
ACG … GT s__E. coli
AGT … AA p__Proteobacteria
…
TGA … TT d__Bacteria
Extract
K-mers
(default, K = 31)
7. 7 of 27
Kraken: classification tree
Wood and Salzberg, Genome Biology, 2014
8. 8 of 27
assessment of methods
Results from Ounit et al., BMC Genomics, 2015
and Wood and Salzberg, Genome Biology, 2014
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
Precision: (correct classifications) / (total classifications)
Sensitivity: (correct classifications) / (total reads)
Speed: reads per minute
Results for simple simulated dataset
9. 9 of 27
impact of K and reference database size
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
Kraken-GB (K = 31) 99.5 93.8 -
Performance is sensitive to K
Kraken-GB: 8,517 reference genomes instead of 2,256
10. 10 of 27
impact of taxonomic novelty
Results from Wood and Salzberg, Genome Biology, 2014
Taxonomic Novelty
Measured Rank Species Genus Family
Domain 24.4 7.9 2.8
Phylum 23.9 7.2 2.5
Class 24.7 7.1 2.0
Order 24.1 6.8 2.0
Family 25.4 8.5 -
Genus 26.3 - -
Sensitivity decreases rapidly with
taxonomic novelty
11. 11 of 27
Kraken: some practical numbers
Applied to metagenome from coalbed methane well
~82 million paired end reads (2 x 100bp)
~30 minutes to process with 8 threads
Reference database requires ~70GB of RAM
Classified 7.7% of reads
0
10
20
30
40
50
60
Relativeabundance(%)
16S profile
Kraken
12. 12 of 27
take away points
K-mers widely used to assign taxonomy to
metagenomic reads
Active area of research
Resolution limited by reference genomes
16S profiling still the gold standard
change is coming…
13. Recovering Population Genomes from
Metagenomic Data
shotgun
sequencing assembly
bin contigs into genomes
(genome-centric metagenomics)
metagenome
reads
contigs
14. 14 of 27
recovering genomes from metagenomic data
shotgun
sequencing assembly
metagenome
reads
contigs
population genomes
identify
strain-specific SNPs
binning
classify using coverage
and k-mer profiles
15. 15 of 27
differential coverage signal
contigs with
similar coverage
profiles likely
belong to the
same genome!
19. 19 of 27
MetaBAT: statistical model of tetranucleotide signatures
Empirical parameters from ~1500 reference genomes
Posterior probability that two contigs are from different
genomes:
Kang et al., bioRxiv, 2014
contig size = 10kb
𝑃 𝑖𝑛𝑡𝑒𝑟 𝐷 =
𝛼𝑃(𝐷|𝑖𝑛𝑡𝑒𝑟)
𝛼𝑃 𝐷 𝑖𝑛𝑡𝑒𝑟 + 𝑃(𝐷|𝑖𝑛𝑡𝑟𝑎)
tetranucleotide distance, D tetranucleotide distance, D
probability,P(inter|D)
20. 20 of 27
rapidly filling out tree of life
60 bacterial phyla
>3000 population genomes
23 habitats
51 phyla with population
genome representatives
21. 21 of 27
take away points
Population genomes can be recovered
from metagenomic samples
K-mer profiles complement differential
coverage signal
Rapidly expanding reference genomes
Improve gene-centric metagenomics
25. 25 of 27
identifying potential contamination
95th percentile
outliers… treat with caution
26. 26 of 27
K-mer modeling: impact of evolution
Bacteria vs. Archaea
(Intra-genome 95th percentile; K=4)
Classes of Proteobacteria
(Intra-genome 95th percentiles; K=4)
27. 27 of 27
final thoughts
K-mers widely used in gene- and genome-centric
metagenomic
Population genomes substantially improving diversity
of available reference genomes
Big win for taxonomic attribution methods
And CheckM, and many other bioinformatic programs
How best to exploit population genomes
Looking at 100,000+ reference genomes in next few years
Issues in terms of scalability
Using ‘noisy’ population genomes raises interesting questions
Basic metagenomics workflow
gene- and genome-centric metagenomics
Goal: assign taxonomy to metagenomic reads
Challenge:
reads are short (currently 100 to 300bp)
>>100 million reads
limited reference genomes (~2000 finished; ~25,000 draft)
Uses:
profiling of microbial communities
preprocessing for assembly
Show benefits of combining signals
Show results of alternative K values
Lots of approaches
Naïve bayes vs. IMM
Show benefits of combining signals
Show results of alternative K values
Lots of approaches
Naïve bayes vs. IMM
Show benefits of combining signals
Show results of alternative K values
Lots of approaches
Naïve bayes vs. IMM
Show benefits of combining signals
Show results of alternative K values
Lots of approaches
Naïve bayes vs. IMM
Ideally contigs from same genome would have the same coverage and genomic signature
Of course, there is variation which needs to be modelled leading to an interesting unsupervised or semi-supervised clustering problem
All these methods are unsupervised clustering algorithms utilizing differential coverage, k-mer profiles, and occasionally GC as features
Ideally contigs from same genome would have the same coverage and genomic signature
Of course, there is variation which needs to be modelled leading to an interesting unsupervised or semi-supervised clustering problem