Robustness, Reproducibility & Ecological Consistency in the Demarcation of Operational Taxonomic Units
1. Robustness, Reproducibility!
& Ecological Consistency!
in the Demarcation of Operational Taxonomic Units
Sebastian Schmidt!
Institute for Molecular Life Sciences!
University of Zürich!
sebastian.schmidt@imls.uzh.ch
2. A general workflow in (targeted) metagenomics
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
3. A general workflow in (targeted) metagenomics
Jean Tinguely, “Heureka”!
Lake Zürich
Sampling &!
Sequencing “Making OTUs”
ISME15, Seoul, 2014/08/29
Understanding!
your data!
(hopefully)
sebastian.schmidt@imls.uzh.ch
5. Concepts
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
replicability!
!
robustness!
!
reproducibility!
!
ecological consistency
42!
Life, the Universe and
Everything?
42!
Life, the Universe and
Everything?
6. Concepts
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
replicability!
!
robustness!
!
reproducibility!
!
ecological consistency
42!
Life, the Universe and
Everything?
42!
Life, Microbial Ecology
and Everything?
7. Concepts
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
replicability!
!
robustness!
!
reproducibility!
!
ecological consistency
42!
Life, the Universe and
Everything?
Life, the Universe and
Everything?
42!
8. The Human Skin Microbiome (HSM) dataset:!
!
~115,000 full-length 16S sequences!
!
sampled from 21 distinct body sites!
Grice et al, Science, 2009
!
clustered to 97% sequence identity
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
9. OTU A
UPARSE
all methods
agree (almost)
5,423 SEQ. perfectly
SMALL OTUS
õ4EQ
PER OTU
methods provide
different # of “small”
OTUs
õTFRQFS056
OTU D
2,692 SEQ.
TQMJUUJOH
by Uclust
OTU C
4EQ.
TQMJUUJOH
by CL
OTU B
8,465 SEQ.
MVNQJOH
by SL
OTUS
UCLUST
3,282 OTUS
CD-HIT
OTUS
SINGLE LINKAGE
OTUS
COMPLETE LINKAGE
OTUS
AVERAGE LINKAGE
OTUS
ISME15, Seoul, 2014/08/29 Schmidt et al, Environ Microbiol, in press
11. 90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
AVERAGE LINKAGE
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
0.5 0.6 0.7 0.8 0.9 1.0
UCLUST CD-HIT SINGLE LINKAGE COMPLETE LINKAGE AVERAGE LINKAGE
COMPLETE LINKAGE SINGLE LINKAGE CD-HIT UCLUST UPARSE
UPARSE
ADJUSTED
MUTUAL INF
A ‘global’ 16S dataset!
~1.1M full-length sequences!
≥30k samples, diverse
environments!
!
Adjusted Mutual
Information (AMI), a
measure of partition
similarity!
!
high replicability!
…when clustering twice to
the exact same threshold!
!
differential robustness!
…to slight threshold changes
Schmidt et al, Environ Microbiol,!
in press
12. 90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
AVERAGE LINKAGE
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
0.5 0.6 0.7 0.8 0.9 1.0
UCLUST CD-HIT SINGLE LINKAGE COMPLETE LINKAGE AVERAGE LINKAGE
COMPLETE LINKAGE SINGLE LINKAGE CD-HIT UCLUST UPARSE
UPARSE
ADJUSTED
MUTUAL INF
A ‘global’ 16S dataset!
~1.1M full-length sequences!
≥30k samples, diverse
environments!
!
Adjusted Mutual
Information (AMI), a
measure of partition
similarity!
!
high replicability!
…when clustering twice to
the exact same threshold!
!
differential robustness!
…to slight threshold changes!
!
differential reproducibility!
pairwise similarity maxima
between methods off-diagonal!
comparability of results across
studies?
Schmidt et al, Environ Microbiol,!
in press
13. 90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
AVERAGE LINKAGE
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
0.5 0.6 0.7 0.8 0.9 1.0
UCLUST CD-HIT SINGLE LINKAGE COMPLETE LINKAGE AVERAGE LINKAGE
COMPLETE LINKAGE SINGLE LINKAGE CD-HIT UCLUST UPARSE
UPARSE
ADJUSTED
MUTUAL INF
“Greengenes 97”!
vs.!
“SILVA 99”!
AMI ~ 0.65
A ‘global’ 16S dataset!
~1.1M full-length sequences!
≥30k samples, diverse
environments!
!
Adjusted Mutual
Information (AMI), a
measure of partition
similarity!
!
high replicability!
…when clustering twice to
the exact same threshold!
!
differential robustness!
…to slight threshold changes!
Schmidt et al, Environ Microbiol,!
in press
!
differential reproducibility!
pairwise similarity maxima
between methods off-diagonal!
comparability of results across
studies?
14. 90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
90
95
100
AVERAGE LINKAGE
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
0.5 0.6 0.7 0.8 0.9 1.0
UCLUST CD-HIT SINGLE LINKAGE COMPLETE LINKAGE AVERAGE LINKAGE
COMPLETE LINKAGE SINGLE LINKAGE CD-HIT UCLUST UPARSE
UPARSE
ADJUSTED
MUTUAL INF
A
~1.1M
≥
environments
!
Adjusted Mutual
Information (AMI)
measure of partition
similarity!
!
high
… the exact same threshold!
!
differential
…to slight threshold changes!
!
differential
pairwise similarity maxima
between
comparability of results across
studies?
Schmidt et al, Environ Microbiol,!
in press
But which method makes the ‘best’ OTUs?
15. ‘Good’ OTUs should correspond to ‘true’ bacterial lineages (‘species’)!
they should comply with evolutionary theory of bacterial speciation!
BUT: no unifying / commonly accepted bacterial species concept!
!
!
Two main criteria for theory-compliant OTUs!
phylogenetic consistency (represent monophyletic lineages)!
ecological consistency (represent ecologically homogenous groups of organisms)
Gevers et al., Nat Rev Microbiol, 2005!
Cohan, Philos T R Soc B, 2006!
Koeppel et al., PNAS, 2008!
Hunt et al., Science, 2008!
Fraser et al., Science, 2009!
Vos, Trends Microbiol, 2011!
Koeppel Wu, NAR, 2013!
Preheim et al, Appl Env Microbiol, 2013!
!
[and many more…]
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
16. rumen
halotolerant
hypersaline
pathogenic
intestinal infection
degradation
day
resistant
producing
gut
endosymbiont
deep
mat
thermophilic
high
metal
activated
cold
milk
soil
environmental diverse
iron
diversity
sediment
water
community marine
associated
acid
plant
sludge
anaerobic
field
sea
rhizosphere lake
spring
halophilic
culture
consortium
extremely
archaeon
paddy
pesticide
activity root
surface
production
contaminated
wastewater
structure
degrading
seawater
treatment
hydrothermal
oil
feces
hot
biofilm
waste
endophytic
nodule
freshwater deepsea
reactor
vent
enrichment
microbiota
growth
disease
pathogen
salt
patient
aerobic
coastal
mine host
fermented
culturable
habitat archaeal actinomycete
res
pond
lactic
forest
region
clinical
symbiont
biodegradation
temperature
skin
moderately
antarctic
methanogenic
swab
reveal
zone
ocean
tract
natural
control
bioreactor
river
sponge
produced
carbon
blood
fluid
coral
mud
food
shift
highly
leaf
ice
organic
rock
draft
diet
oral
tree
solar
stream
coast
wild
core
fed
low
grown
tidal
fecal
mineral
flat
compost
saline
symbiotic
content
saltern
alkaline
diseased
rhizobia
wound
active
intestine
traditional
sand
subsurface
antimicrobial
fermentation
effluent
comb
sewage
condition
caused
product
treating
sulfatereducing
ecology
purification
station
hydrocarbon
nitrogen
coidentity
degrade
resistance
mangrove
methane
polluted
acidic
antibiotic
cultivation
oxidation
probiotic cultured
methanogen
process
revealed
tissue
agricultural
chemical
heterotrophic
biocontrol
alkaliphilic
legume
denitrifying
indigenous
industrial
correlate
defense
cluster
heavy
reduction
tolerant
aquifer
reservoir
wetland
diabetic
enriched
chloroplast
cultivated
cultureindependent
nitrogenfixing
prolonged
protease
basin
compound
mesophilic
microbiome
removal
formation
laboratory
adult
anoxic
petroleum
termite
functional
aquatic
association
factory
fresh
antifungal
korean
terrestrial
involved
promoting
geothermal
bay
black
island
sulfur
drainage
farm
groundwater
hydrogen
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
17. AVERAGE LINKAGE
SINGLE LINKAGE
1000 10000 100000
NUMBER OF OTUS
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
A
ECOLOGICAL CONSISTENCY SCORE (ECS)
COMPLETE LINKAGE
UCLUST
CD-HIT
97% NOMINAL SIMILARITY
ISME15, Seoul, 2014/08/29 Schmidt et al, PLOS Comp Biol, 2014
18. AVERAGE LINKAGE
SINGLE LINKAGE
1000 10000 100000
NUMBER OF OTUS
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
A
ECOLOGICAL CONSISTENCY SCORE (ECS)
COMPLETE LINKAGE
UCLUST
CD-HIT
97% NOMINAL SIMILARITY
D BACTERIA, SAMPLING SITES
B ARCHAEA, ECOLOGICAL TERMS
100 1000 10000
E BACTERIA, HOST TAXONOMY
F
5000
4000
3000
2000
1000
1000 10000 100000
2500
2000
1500
1000
500
0
1000 10000 100000
2500
2000
1500
1000
500
BACTERIA, ENVO TERMS
1000 10000 100000
C
100 1000 10000
400
300
200
100
EUKARYA, ECOLOGICAL TERMS
700
600
500
400
300
ISME15, Seoul, 2014/08/29 Schmidt et al, PLOS Comp Biol, 2014
19. Conclusions
ISME15, Seoul, 2014/08/29 sebastian.schmidt@imls.uzh.ch
replicability!
clustering was generally replicable!
!
robustness!
AL, CL CD-HIT were highly robust to (slightly) changing thresholds, UCLUST, UPARSE SL more sensitive!
similar trends for robustness to clustering context and choice of subregion (not shown)!
!
reproducibility!
surprisingly discordant partitions by different methods!
similarity maxima generally off-diagonal!
AL and CD-HIT most similar pair!
implications for reference-based OTU-binning: choice of reference clustering determines quality!!
!
ecological consistency!
CL provided most consistent OTU sets!
implications for taxonomy and species definitions?