SlideShare une entreprise Scribd logo
1  sur  28
K-mers in Metagenomics
by donovan parks
2 of 27
metagenomics
environmental
sample
extract and
sequence DNA
QC and error
correct reads
(K-mers!)
assemble
(K-mers!)
bin genomes
(K-mers!)
assign taxonomy
(and function)
(K-mers!)
refine genomes
(K-mers!)
Assigning Taxonomic Labels to
Metagenomic DNA Sequences
4 of 27
a plethora of approaches
 Homology: BLAST, MEGAN
 Composition: Kraken, CLARK, Naïve Bayes
 Hybrid: PhymmBL, FCP, PhyloPythia
 Phylogenetic: Treephyler, AMPHORA, GraftM
 Marker genes: 16S profiling, MetaPhlAn, PhyloSift
classifyallreadsclassifysubset
5 of 27
exploiting genomic (K-mer) signatures
 PhymmBL (K≤8): interpolated Markov model
 PhyloPythia (K ≈6): multiclass support vector machine
 Naïve Bayes (K ≈15): probability of observing a K-mer
 Kraken (K ≈31): exact K-mer matching
 CLARK (K ≈31): exact matching of discriminative K-mers
denseprofilessparseprofiles
6 of 27
Kraken: K-mer LCA database
Wood and Salzberg, Genome Biology, 2014
Reference Genomes
(2,256 RefSeq Genomes)
Lowest common ancestor
database
K-mer LCA
ACC … GT g__Escherichia
ACG … GT s__E. coli
AGT … AA p__Proteobacteria
…
TGA … TT d__Bacteria
Extract
K-mers
(default, K = 31)
7 of 27
Kraken: classification tree
Wood and Salzberg, Genome Biology, 2014
8 of 27
assessment of methods
Results from Ounit et al., BMC Genomics, 2015
and Wood and Salzberg, Genome Biology, 2014
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
 Precision: (correct classifications) / (total classifications)
 Sensitivity: (correct classifications) / (total reads)
 Speed: reads per minute
 Results for simple simulated dataset
9 of 27
impact of K and reference database size
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
Kraken-GB (K = 31) 99.5 93.8 -
 Performance is sensitive to K
 Kraken-GB: 8,517 reference genomes instead of 2,256
10 of 27
impact of taxonomic novelty
Results from Wood and Salzberg, Genome Biology, 2014
Taxonomic Novelty
Measured Rank Species Genus Family
Domain 24.4 7.9 2.8
Phylum 23.9 7.2 2.5
Class 24.7 7.1 2.0
Order 24.1 6.8 2.0
Family 25.4 8.5 -
Genus 26.3 - -
 Sensitivity decreases rapidly with
taxonomic novelty
11 of 27
Kraken: some practical numbers
 Applied to metagenome from coalbed methane well
 ~82 million paired end reads (2 x 100bp)
 ~30 minutes to process with 8 threads 
 Reference database requires ~70GB of RAM 
 Classified 7.7% of reads 
0
10
20
30
40
50
60
Relativeabundance(%)
16S profile
Kraken
12 of 27
take away points
 K-mers widely used to assign taxonomy to
metagenomic reads
 Active area of research
 Resolution limited by reference genomes
 16S profiling still the gold standard
 change is coming…
Recovering Population Genomes from
Metagenomic Data
shotgun
sequencing assembly
bin contigs into genomes
(genome-centric metagenomics)
metagenome
reads
contigs
14 of 27
recovering genomes from metagenomic data
shotgun
sequencing assembly
metagenome
reads
contigs
population genomes
identify
strain-specific SNPs
binning
classify using coverage
and k-mer profiles
15 of 27
differential coverage signal
contigs with
similar coverage
profiles likely
belong to the
same genome!
16 of 27
K-mers and coverage: complementary signals
microbial community from coalbed methane well
coverage
tetranucleotide (PC1)
Genome Comp. (%) Cont. (%) Length (Mbp)
Archaea
Methanobacteriaceae 1 98.4 1.6 2.32
Methanobacteriaceae 2 96.8 0.8 2.23
Methanobacteriaceae 3 88.6 0.0 1.57
Methanobacteriaceae 4 96.0 0.0 1.71
Bacteria
Actinobacteria 1 95.0 0.9 2.56
Actinobacteria 2 90.5 2.7 2.72
Actinobacteria 3 88.4 2.7 2.48
Clostridiales 1 92.6 9.4 2.91
Clostridiales 2 80.2 0.0 2.74
Elusimicrobia 95.7 2.2 2.03
Thermodesulfovibrionaceae 83.9 0.0 2.66
Syntrophus 92.9 0.8 2.31
Rikenellaceae 86.7 2.3 2.72
Candidate Phylum OP1 83.9 0.0 1.66
Rhodocyclaceae 69.0 1.63 3.73
17 of 27
many ways to combine coverage + K-mer profiles
 GroopM: http://minillinim.github.io/GroopM/
 DBB: https://github.com/dparks1134/DBB
 CONCOCT: https://github.com/BinPro/CONCOCT
 MetaWatt: http://sourceforge.net/projects/metawatt/
 MetaBAT: https://bitbucket.org/berkeleylab/metabat
18 of 27
MetaBAT overview
Kang et al., bioRxiv, 2014
19 of 27
MetaBAT: statistical model of tetranucleotide signatures
 Empirical parameters from ~1500 reference genomes
 Posterior probability that two contigs are from different
genomes:
Kang et al., bioRxiv, 2014
contig size = 10kb
𝑃 𝑖𝑛𝑡𝑒𝑟 𝐷 =
𝛼𝑃(𝐷|𝑖𝑛𝑡𝑒𝑟)
𝛼𝑃 𝐷 𝑖𝑛𝑡𝑒𝑟 + 𝑃(𝐷|𝑖𝑛𝑡𝑟𝑎)
tetranucleotide distance, D tetranucleotide distance, D
probability,P(inter|D)
20 of 27
rapidly filling out tree of life
60 bacterial phyla
>3000 population genomes
23 habitats
51 phyla with population
genome representatives
21 of 27
take away points
 Population genomes can be recovered
from metagenomic samples
 K-mer profiles complement differential
coverage signal
 Rapidly expanding reference genomes
 Improve gene-centric metagenomics
Assessing and Refining
Population Genomes
23 of 27
estimating quality of population genomes
Additional markers
refine quality estimates
Scaffolds
Gammaproteobacteria sp.
80 % complete, 20% contaminated
105 bacterial marker genes
estimates: 92% comp., 17% cont.
281 clade-specific marker genes
estimates: 83% comp., 22% cont.
Parks et al., Genome Res., 2015
Estimates ± 5%
24 of 27
varying quality of recovered genomes
microbial community from coalbed methane well
coverage
tetranucleotide (PC1)
Genome Comp. (%) Cont. (%) Length (Mbp)
Archaea
Methanobacteriaceae 1 98.4 1.6 2.32
Methanobacteriaceae 2 96.8 0.8 2.23
Methanobacteriaceae 3 88.6 0.0 1.57
Methanobacteriaceae 4 96.0 0.0 1.71
Bacteria
Actinobacteria 1 95.0 0.9 2.56
Actinobacteria 2 90.5 2.7 2.72
Actinobacteria 3 88.4 2.7 2.48
Clostridiales 1 92.6 9.4 2.91
Clostridiales 2 80.2 0.0 2.74
Elusimicrobia 95.7 2.2 2.03
Thermodesulfovibrionaceae 83.9 0.0 2.66
Syntrophus 92.9 0.8 2.31
Rikenellaceae 86.7 2.3 2.72
Candidate Phylum OP1 83.9 0.0 1.66
Rhodocyclaceae 69.0 1.63 3.73
25 of 27
identifying potential contamination
95th percentile
outliers… treat with caution
26 of 27
K-mer modeling: impact of evolution
Bacteria vs. Archaea
(Intra-genome 95th percentile; K=4)
Classes of Proteobacteria
(Intra-genome 95th percentiles; K=4)
27 of 27
final thoughts
 K-mers widely used in gene- and genome-centric
metagenomic
 Population genomes substantially improving diversity
of available reference genomes
 Big win for taxonomic attribution methods
 And CheckM, and many other bioinformatic programs
 How best to exploit population genomes
 Looking at 100,000+ reference genomes in next few years
 Issues in terms of scalability
 Using ‘noisy’ population genomes raises interesting questions
Thank you!

Contenu connexe

Tendances

Lectut btn-202-ppt-l15. isolation and purification of total cellular rna
Lectut btn-202-ppt-l15. isolation and purification of total cellular rnaLectut btn-202-ppt-l15. isolation and purification of total cellular rna
Lectut btn-202-ppt-l15. isolation and purification of total cellular rna
Rishabh Jain
 
Triplex dna 1
Triplex dna 1Triplex dna 1
Triplex dna 1
CHAL91
 

Tendances (20)

C value, Cot Curve & Rot Curve L1-3.pdf
C value, Cot Curve & Rot Curve L1-3.pdfC value, Cot Curve & Rot Curve L1-3.pdf
C value, Cot Curve & Rot Curve L1-3.pdf
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
16S classifier
16S classifier16S classifier
16S classifier
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Understanding and controlling for sample and platform biases in NGS assays
Understanding and controlling for sample and platform biases in NGS assaysUnderstanding and controlling for sample and platform biases in NGS assays
Understanding and controlling for sample and platform biases in NGS assays
 
Lectut btn-202-ppt-l15. isolation and purification of total cellular rna
Lectut btn-202-ppt-l15. isolation and purification of total cellular rnaLectut btn-202-ppt-l15. isolation and purification of total cellular rna
Lectut btn-202-ppt-l15. isolation and purification of total cellular rna
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
DNA Libraries PPT
DNA Libraries PPTDNA Libraries PPT
DNA Libraries PPT
 
40 Years of Genome Assembly: Are We Done Yet?
40 Years of Genome Assembly: Are We Done Yet?40 Years of Genome Assembly: Are We Done Yet?
40 Years of Genome Assembly: Are We Done Yet?
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Presentation on DNA Sequencing Process
Presentation on DNA Sequencing ProcessPresentation on DNA Sequencing Process
Presentation on DNA Sequencing Process
 
Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)
 
FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHM
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Triplex dna 1
Triplex dna 1Triplex dna 1
Triplex dna 1
 
Metagenomics: An overview
Metagenomics: An overviewMetagenomics: An overview
Metagenomics: An overview
 
Dna sequencing methods
Dna sequencing methods Dna sequencing methods
Dna sequencing methods
 

En vedette

Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic StudiesPhylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Jonathan Eisen
 
2009 hattori metagenomics
2009 hattori metagenomics2009 hattori metagenomics
2009 hattori metagenomics
drugmetabol
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
Mads Albertsen
 

En vedette (20)

Microbial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureMicrobial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New Cyberinfrastructure
 
Computational analysis of metagenomic data: delineation of compositional feat...
Computational analysis of metagenomic data: delineation of compositional feat...Computational analysis of metagenomic data: delineation of compositional feat...
Computational analysis of metagenomic data: delineation of compositional feat...
 
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic StudiesPhylogeny Driven Approaches to Genomic and Metagenomic Studies
Phylogeny Driven Approaches to Genomic and Metagenomic Studies
 
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics ResearchersThe Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
 
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingDr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Phytobiomes
Phytobiomes Phytobiomes
Phytobiomes
 
Advancing the Metagenomics Revolution
Advancing the Metagenomics RevolutionAdvancing the Metagenomics Revolution
Advancing the Metagenomics Revolution
 
Metagenomic
MetagenomicMetagenomic
Metagenomic
 
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
introduction to metagenomics
introduction to metagenomicsintroduction to metagenomics
introduction to metagenomics
 
Multiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasetsMultiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasets
 
2009 hattori metagenomics
2009 hattori metagenomics2009 hattori metagenomics
2009 hattori metagenomics
 
metagenomics
metagenomicsmetagenomics
metagenomics
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 

Similaire à Parks kmer metagenomics

L14 human genome
L14 human genomeL14 human genome
L14 human genome
MUBOSScz
 
Q biomarkersomaticmutation
Q biomarkersomaticmutationQ biomarkersomaticmutation
Q biomarkersomaticmutation
Elsa von Licy
 
Next generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable cropsNext generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable crops
Pulipati Gangadhara Rao
 

Similaire à Parks kmer metagenomics (20)

bai2
bai2bai2
bai2
 
Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
 
defense 2.0
defense 2.0defense 2.0
defense 2.0
 
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...
 
L14 human genome
L14 human genomeL14 human genome
L14 human genome
 
Q biomarkersomaticmutation
Q biomarkersomaticmutationQ biomarkersomaticmutation
Q biomarkersomaticmutation
 
Microsatellite
MicrosatelliteMicrosatellite
Microsatellite
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
Next generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable cropsNext generation seqencing tecnologies and application vegetable crops
Next generation seqencing tecnologies and application vegetable crops
 
Gene Expression Analysis by Real Time PCR
Gene Expression Analysis by Real Time PCRGene Expression Analysis by Real Time PCR
Gene Expression Analysis by Real Time PCR
 
9739142.ppt
9739142.ppt9739142.ppt
9739142.ppt
 
Beiko dcsi2013
Beiko dcsi2013Beiko dcsi2013
Beiko dcsi2013
 
CRISPR /Cas9
CRISPR /Cas9CRISPR /Cas9
CRISPR /Cas9
 
Sigma Xi 2016
Sigma Xi 2016Sigma Xi 2016
Sigma Xi 2016
 
PDC Libraries
PDC LibrariesPDC Libraries
PDC Libraries
 
Aptamer as therapeutic
Aptamer as therapeuticAptamer as therapeutic
Aptamer as therapeutic
 
Next generation genomics for chickpea (Cicer arietinum L.) improvement
Next generation genomics for chickpea (Cicer arietinum L.) improvementNext generation genomics for chickpea (Cicer arietinum L.) improvement
Next generation genomics for chickpea (Cicer arietinum L.) improvement
 
A novel phylum-level archaea characterized by combining single-cell and metag...
A novel phylum-level archaea characterized by combining single-cell and metag...A novel phylum-level archaea characterized by combining single-cell and metag...
A novel phylum-level archaea characterized by combining single-cell and metag...
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Poster ESCS 2020 - PROIMI - CONICET
Poster ESCS 2020 - PROIMI - CONICETPoster ESCS 2020 - PROIMI - CONICET
Poster ESCS 2020 - PROIMI - CONICET
 

Dernier

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 

Dernier (20)

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 

Parks kmer metagenomics

  • 2. 2 of 27 metagenomics environmental sample extract and sequence DNA QC and error correct reads (K-mers!) assemble (K-mers!) bin genomes (K-mers!) assign taxonomy (and function) (K-mers!) refine genomes (K-mers!)
  • 3. Assigning Taxonomic Labels to Metagenomic DNA Sequences
  • 4. 4 of 27 a plethora of approaches  Homology: BLAST, MEGAN  Composition: Kraken, CLARK, Naïve Bayes  Hybrid: PhymmBL, FCP, PhyloPythia  Phylogenetic: Treephyler, AMPHORA, GraftM  Marker genes: 16S profiling, MetaPhlAn, PhyloSift classifyallreadsclassifysubset
  • 5. 5 of 27 exploiting genomic (K-mer) signatures  PhymmBL (K≤8): interpolated Markov model  PhyloPythia (K ≈6): multiclass support vector machine  Naïve Bayes (K ≈15): probability of observing a K-mer  Kraken (K ≈31): exact K-mer matching  CLARK (K ≈31): exact matching of discriminative K-mers denseprofilessparseprofiles
  • 6. 6 of 27 Kraken: K-mer LCA database Wood and Salzberg, Genome Biology, 2014 Reference Genomes (2,256 RefSeq Genomes) Lowest common ancestor database K-mer LCA ACC … GT g__Escherichia ACG … GT s__E. coli AGT … AA p__Proteobacteria … TGA … TT d__Bacteria Extract K-mers (default, K = 31)
  • 7. 7 of 27 Kraken: classification tree Wood and Salzberg, Genome Biology, 2014
  • 8. 8 of 27 assessment of methods Results from Ounit et al., BMC Genomics, 2015 and Wood and Salzberg, Genome Biology, 2014 Classifier Precision Sensitivity Speed Megablast 99.0 79.0 - Naïve Bayes (K = 15) 82.3 82.3 8 Naïve Bayes (K = 11) 59.0 59.0 20 PhymmBL 82.3 82.3 - CLARK 99.3 77.2 3.1 million Kraken (K = 31) 99.3 77.8 2.3 million Kraken (K = 20) 80.2 82.7 1.5 million  Precision: (correct classifications) / (total classifications)  Sensitivity: (correct classifications) / (total reads)  Speed: reads per minute  Results for simple simulated dataset
  • 9. 9 of 27 impact of K and reference database size Classifier Precision Sensitivity Speed Megablast 99.0 79.0 - Naïve Bayes (K = 15) 82.3 82.3 8 Naïve Bayes (K = 11) 59.0 59.0 20 PhymmBL 82.3 82.3 - CLARK 99.3 77.2 3.1 million Kraken (K = 31) 99.3 77.8 2.3 million Kraken (K = 20) 80.2 82.7 1.5 million Kraken-GB (K = 31) 99.5 93.8 -  Performance is sensitive to K  Kraken-GB: 8,517 reference genomes instead of 2,256
  • 10. 10 of 27 impact of taxonomic novelty Results from Wood and Salzberg, Genome Biology, 2014 Taxonomic Novelty Measured Rank Species Genus Family Domain 24.4 7.9 2.8 Phylum 23.9 7.2 2.5 Class 24.7 7.1 2.0 Order 24.1 6.8 2.0 Family 25.4 8.5 - Genus 26.3 - -  Sensitivity decreases rapidly with taxonomic novelty
  • 11. 11 of 27 Kraken: some practical numbers  Applied to metagenome from coalbed methane well  ~82 million paired end reads (2 x 100bp)  ~30 minutes to process with 8 threads   Reference database requires ~70GB of RAM   Classified 7.7% of reads  0 10 20 30 40 50 60 Relativeabundance(%) 16S profile Kraken
  • 12. 12 of 27 take away points  K-mers widely used to assign taxonomy to metagenomic reads  Active area of research  Resolution limited by reference genomes  16S profiling still the gold standard  change is coming…
  • 13. Recovering Population Genomes from Metagenomic Data shotgun sequencing assembly bin contigs into genomes (genome-centric metagenomics) metagenome reads contigs
  • 14. 14 of 27 recovering genomes from metagenomic data shotgun sequencing assembly metagenome reads contigs population genomes identify strain-specific SNPs binning classify using coverage and k-mer profiles
  • 15. 15 of 27 differential coverage signal contigs with similar coverage profiles likely belong to the same genome!
  • 16. 16 of 27 K-mers and coverage: complementary signals microbial community from coalbed methane well coverage tetranucleotide (PC1) Genome Comp. (%) Cont. (%) Length (Mbp) Archaea Methanobacteriaceae 1 98.4 1.6 2.32 Methanobacteriaceae 2 96.8 0.8 2.23 Methanobacteriaceae 3 88.6 0.0 1.57 Methanobacteriaceae 4 96.0 0.0 1.71 Bacteria Actinobacteria 1 95.0 0.9 2.56 Actinobacteria 2 90.5 2.7 2.72 Actinobacteria 3 88.4 2.7 2.48 Clostridiales 1 92.6 9.4 2.91 Clostridiales 2 80.2 0.0 2.74 Elusimicrobia 95.7 2.2 2.03 Thermodesulfovibrionaceae 83.9 0.0 2.66 Syntrophus 92.9 0.8 2.31 Rikenellaceae 86.7 2.3 2.72 Candidate Phylum OP1 83.9 0.0 1.66 Rhodocyclaceae 69.0 1.63 3.73
  • 17. 17 of 27 many ways to combine coverage + K-mer profiles  GroopM: http://minillinim.github.io/GroopM/  DBB: https://github.com/dparks1134/DBB  CONCOCT: https://github.com/BinPro/CONCOCT  MetaWatt: http://sourceforge.net/projects/metawatt/  MetaBAT: https://bitbucket.org/berkeleylab/metabat
  • 18. 18 of 27 MetaBAT overview Kang et al., bioRxiv, 2014
  • 19. 19 of 27 MetaBAT: statistical model of tetranucleotide signatures  Empirical parameters from ~1500 reference genomes  Posterior probability that two contigs are from different genomes: Kang et al., bioRxiv, 2014 contig size = 10kb 𝑃 𝑖𝑛𝑡𝑒𝑟 𝐷 = 𝛼𝑃(𝐷|𝑖𝑛𝑡𝑒𝑟) 𝛼𝑃 𝐷 𝑖𝑛𝑡𝑒𝑟 + 𝑃(𝐷|𝑖𝑛𝑡𝑟𝑎) tetranucleotide distance, D tetranucleotide distance, D probability,P(inter|D)
  • 20. 20 of 27 rapidly filling out tree of life 60 bacterial phyla >3000 population genomes 23 habitats 51 phyla with population genome representatives
  • 21. 21 of 27 take away points  Population genomes can be recovered from metagenomic samples  K-mer profiles complement differential coverage signal  Rapidly expanding reference genomes  Improve gene-centric metagenomics
  • 23. 23 of 27 estimating quality of population genomes Additional markers refine quality estimates Scaffolds Gammaproteobacteria sp. 80 % complete, 20% contaminated 105 bacterial marker genes estimates: 92% comp., 17% cont. 281 clade-specific marker genes estimates: 83% comp., 22% cont. Parks et al., Genome Res., 2015 Estimates ± 5%
  • 24. 24 of 27 varying quality of recovered genomes microbial community from coalbed methane well coverage tetranucleotide (PC1) Genome Comp. (%) Cont. (%) Length (Mbp) Archaea Methanobacteriaceae 1 98.4 1.6 2.32 Methanobacteriaceae 2 96.8 0.8 2.23 Methanobacteriaceae 3 88.6 0.0 1.57 Methanobacteriaceae 4 96.0 0.0 1.71 Bacteria Actinobacteria 1 95.0 0.9 2.56 Actinobacteria 2 90.5 2.7 2.72 Actinobacteria 3 88.4 2.7 2.48 Clostridiales 1 92.6 9.4 2.91 Clostridiales 2 80.2 0.0 2.74 Elusimicrobia 95.7 2.2 2.03 Thermodesulfovibrionaceae 83.9 0.0 2.66 Syntrophus 92.9 0.8 2.31 Rikenellaceae 86.7 2.3 2.72 Candidate Phylum OP1 83.9 0.0 1.66 Rhodocyclaceae 69.0 1.63 3.73
  • 25. 25 of 27 identifying potential contamination 95th percentile outliers… treat with caution
  • 26. 26 of 27 K-mer modeling: impact of evolution Bacteria vs. Archaea (Intra-genome 95th percentile; K=4) Classes of Proteobacteria (Intra-genome 95th percentiles; K=4)
  • 27. 27 of 27 final thoughts  K-mers widely used in gene- and genome-centric metagenomic  Population genomes substantially improving diversity of available reference genomes  Big win for taxonomic attribution methods  And CheckM, and many other bioinformatic programs  How best to exploit population genomes  Looking at 100,000+ reference genomes in next few years  Issues in terms of scalability  Using ‘noisy’ population genomes raises interesting questions

Notes de l'éditeur

  1. Basic metagenomics workflow gene- and genome-centric metagenomics
  2. Goal: assign taxonomy to metagenomic reads Challenge: reads are short (currently 100 to 300bp) >>100 million reads limited reference genomes (~2000 finished; ~25,000 draft) Uses: profiling of microbial communities preprocessing for assembly
  3. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  4. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  5. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  6. Show benefits of combining signals Show results of alternative K values Lots of approaches Naïve bayes vs. IMM
  7. Ideally contigs from same genome would have the same coverage and genomic signature Of course, there is variation which needs to be modelled leading to an interesting unsupervised or semi-supervised clustering problem
  8. All these methods are unsupervised clustering algorithms utilizing differential coverage, k-mer profiles, and occasionally GC as features
  9. Ideally contigs from same genome would have the same coverage and genomic signature Of course, there is variation which needs to be modelled leading to an interesting unsupervised or semi-supervised clustering problem