“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012
1. Phylogeny-Driven Approaches to
Genomics and Metagenomics
June 23, 2012
Canadian Society for Microbiology
Jonathan A. Eisen
University of California, Davis
@phylogenomics
2. Acknowledgements
• $$$
• DOE
• NSF
• GBMF
• Sloan
• DARPA
• DSMZ
• DHS
• People, places
• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides
• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell
Neches, Jenna Morgan-Lang
• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,
Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward,
Hans-Peter Klenk
4. Phylogeny: What is it?
• Phylogeny is a description of
the evolutionary history of
relationships among organisms
(or their parts).
• This is frequently portrayed in
a diagram called a phylogenetic
tree.
• Phylogenies can be more
complex than a bifurcating tree
(e.g., lateral gene transfer,
recombination, hybridization)
5. Whatever the History:
Trying to Incorporate it is Critical
from Lake et al. doi: 10.1098/rstb.2009.0035
6. Phylogeny
• Applies to
• Species
• Genes
• Genomes
10. rRNA Phylotyping
DNA
extraction PCR
Makes lots of Sequence
PCR copies of the rRNA genes
rRNA genes
in sample
rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
Phylogenetic tree Sequence alignment = Data matrix
rRNA2
rRNA1 rRNA2
rRNA1 A C A C A C 5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA4
rRNA3 rRNA2 T A C A G T
rRNA3
rRNA3 C A C T G T 5’...ACGGCAAAATAGGTGGATT
E. coli Humans rRNA4 C A C A G T CTAGCGATATAGA... 3’
Yeast E. coli A G A C A G rRNA4
5’...ACGGCCCGATAGGTGGATT
Humans T A T A G T CTAGCGCCATAGA... 3’
Yeast T A C A G T
11. rRNA Phylotyping
• Collect DNA from
environment
• PCR amplify rRNA
genes using broad
(so-called universal)
primers
• Sequence
• Align to others
• Infer evolutionary tree
• Unknowns “identified”
by placement on tree
12. Era IV: Genomes in Environment
shotgun
sequence
Metagenomics
15. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG
ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu
ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70
or
ob
i
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
C
RecA
hl
or
of
le
xi
Sp
iro
ch
ae
te
s
RpoB
Fu
so
ba
De ct
in er
ia
oc
oc
cu
s-
rRNA
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 2004
29. OTUs on Tree
OTU1
OTU5
OTU4
OTU6
OTU2
OTU3
OTU7
OTU9
OTU8
OTU10
30. OTUs on Tree
OTU1 • Clades
OTU5
• Rates of
OTU4
change
OTU6 • LGT
OTU2
OTU3 • Convergence
OTU7 • Character
OTU9
OTU8 history
OTU10
31. Unifrac
nuscript
typically used as a qualitative measure because duplicate se- Weighted UniFrac. Weighted UniFrac is a new variant of the original un-
quences are usually removed from the tree. However, the P weighted UniFrac measure that weights the branches of a phylogenetic tree
test may be used in a semiquantitative manner if all clones, based on the abundance of information (Fig. 1B). Weighted UniFrac is thus a
quantitative measure of  diversity that can detect changes in how many se-
even those with identical or near-identical sequences, are in-
quences from each lineage are present, as well as detect changes in which taxa
cluded in the tree (13). are present. This ability is important because the relative abundance of different
Here we describe a quantitative version of UniFrac that we kinds of bacteria can be critical for describing community changes. In contrast,
call “weighted UniFrac.” We show that weighted UniFrac be- the original, unweighted UniFrac (Fig. 1A) is a qualitative  diversity measure
haves similarly to the FST test in situations where both are because duplicate sequences contribute no additional branch length to the tree
(by definition, the branch length that separates a pair of duplicate sequences is
zero, because no substitutions separate them).
The first step in applying weighted UniFrac is to calculate the raw weighted
UniFrac value (u), according to the first equation:
NIH-PA Author Manuscript
n
uϭ bi ϫ ͯA Ϫ B ͯ
Ai
T
B
T
i
i
Here, n is the total number of branches in the tree, bi is the length of branch i,
Ai and Bi are the numbers of sequences that descend from branch i in commu-
nities A and B, respectively, and AT and BT are the total numbers of sequences
in communities A and B, respectively. In order to control for unequal sampling
effort, Ai and Bi are divided by AT and BT.
If the phylogenetic tree is not ultrametric (i.e., if different sequences in the
sample have evolved at different rates), clustering with weighted UniFrac will
place more emphasis on communities that contain quickly evolving taxa. Since
these taxa are assigned more branch length, a comparison of the communities
FIG. 1. Calculation of the unweighted and the weighted UniFrac that contain them will tend to produce higher values of u. In some situations, it
measures. Squares and circles represent sequences from two different may be desirable to normalize u so that it has a value of 0 for identical commu-
environments. (a) In unweighted UniFrac, the distance between the nities and 1 for nonoverlapping communities. This is accomplished by dividing u
circle and square communities is calculated as the fraction of the by a scaling factor (D), which is the average distance of each sequence from the
branch length that has descendants from either the square or the circle root, as shown in the equation as follows:
environment (black) but not both (gray). (b) In weighted UniFrac,
ͩ
branch lengths are weighted by the relative abundance of sequences in
ͪ
n
the square and circle communities; square sequences are weighted Aj Bj
Dϭ dj ϫ ϩ
twice as much as circle sequences because there are twice as many total AT BT
circle sequences in the data set. The width of branches is proportional Figure 1. j
NIH-PA Author Manuscript
to the degree to which each branch is weighted in the calculations, and Here, dj is the distance of sequence j from the root, (PD) and PD Gain (G) for the grey community. The
Estimates of Phylogenetic Diversity Aj and Bj are the numbers
gray branches have no weight. Branches 1 and 2 have heavy weights of times the sequences were observed in communitieswhite, and grey communities. (A) PD is the sum of the
boxes represent taxa from the black, A and B, respectively, and
since the descendants are biased toward the square and circles, respec- AT and BT are the total numbers of sequences from communities A and B,
tively. Branch 3 contributes no value since it has an equal contribution branches leading to the grey taxa. (B) G is the sum of the branches leading only to the grey
respectively.
from circle and square sequences after normalization. Clustering with normalized u values treatsshowing the increase inof
taxa. (C) PD rarefaction curves each sample equally instead branch length with sampling effort
for the intestinal and stool bacteria from three healthy individuals. Aligned16S rRNA
sequences from the three individuals were available with the Supplementary Materials in
(Eckburg, et al., 2005). The Arb parsimony insertion tool was used to add the sequences to a
tree containing over 9,000 sequences (Hugenholtz, 2002) that is available for download at
the rRNA Database Project II website (Maidak, et al., 2001). The curves represent the
average values for 50 replicate trials.
FEMS Microbiol Rev. Author manuscript; available in PMC 2009 July 1.
33. RecA, RpoB in GOS
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Wu et al PLoS One 2011
34. Uses of Phylogeny
in Genomics and Metagenomics
Example 2:
Functional Diversity and
Functional Predictions
35. Predicting Function
• Key step in genome projects
• More accurate predictions help guide
experimental and computational
analyses
• Many diverse approaches
• All improved both by “phylogenomic”
type analyses that integrate
evolutionary reconstructions and
understanding of how new functions
evolve
36. Predicting Function
• Identification of motifs
– Short regions of sequence similarity that are indicative of
general activity
– e.g., ATP binding
• Homology/similarity based methods
– Gene sequence is searched against a databases of other
sequences
– If significant similar genes are found, their functional
information is used
• Problem
– Genes frequently have similarity to hundreds of motifs and
multiple genes, not all with the same function
37. From Eisen et al.
1997 Nature
Medicine 3:
1076-1078.
38. Blast Search of H. pylori “MutS”
• Blast search pulls up Syn. sp MutS#2 with much higher p
value than other MutS homologs
• Based on this TIGR predicted this species had mismatch
repair
• Assumes functional constancy
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
40. Overlaying Functions onto Tree
MutS2
Aquae
MSH5 StrpyBacsuSynsp
Deira Helpy
Yeast
Human Borbu
Celeg Metth
MSH6 mSaco
Yeast
Human
Mouse
Arath
Yeast MSH4
Celeg
Human
Arath
Human
MSH3 Mouse
Fly
Spombe
Yeast Xenla
Rat
Mouse
Yeast Human
MSH1 Spombe Yeast MSH2
Neucr
Arath
Aquae Trepa
Chltr
Deira Theaq
Bacsu Borbu
Thema
Synsp Strpy
Ecoli Based on Eisen,
Neigo
1998 Nucl Acids Res
MutS1 26: 4291-4300.
41.
42. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2A 3A 1B 2B 3B 1 2 3 4 5 6
1A
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
Based on
1A 1B 2A 2B 3A 3B 1 2 3 4 5 6
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Eisen, 1998
Genome Res 8:
Duplication 163-167.
43. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2A 3A 1B 2B 3B 1 2 3 4 5 6
1A
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
1A 1B 1 2 3 4 5 6
2A 2B 3A 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Based on
Duplication
Eisen, 1998
Genome Res 8:
48. Uses of Phylogeny
in Genomics and Metagenomics
Example 3:
Selecting Organisms for Study
49. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group phyla of
bacteria
OP8
Nitrospira
Bacteroides
Chlorobi
Fibrobacteres
Marine GroupA
WS3
Gemmimonas
Firmicutes
Fusobacteria
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
50. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
51. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some studies
Deferribacteres
Chrysiogenetes in other phyla
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroflexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
52. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely
OP3
Planctomycetes
Spriochaetes
sampled
Coprothmermobacter
OP10 • Same trend in
Thermomicrobia
Chloroflexi
TM7
Eukaryotes
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
53. As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely
OP3
Planctomycetes
Spriochaetes
sampled
Coprothmermobacter
OP10 • Same trend in
Thermomicrobia
Chloroflexi
TM7
Viruses
Deinococcus-Thermus
Dictyoglomus
Aquificae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002
56. GEBA: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan
Eisen, Eddy Rubin, Jim Bristow)
• Project management (David Bruce, Eileen Dalin, Lynne
Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla
Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen,
Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor
Markowitz, et al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu,
Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain,
Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati,
Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, Eddy Rubin, Jim Bristow)
57. GEBA Now
• 300+ genomes
• Rich sampling of major groups of
cultured organisms
58. GEBA Lesson 1:
The rRNA Tree of Life is a Useful Tool
From Wu et al. 2009 Nature 462, 1056-1060
59. GEBA Lesson 2:
The rRNA Tree of Life is not perfect ...
16s WGT, 23S
Badger et al. 2005 Int J System Evol Microbiol 55: 1021-1026.
60. GEBA Lesson 3:
Phylogeny improves genome annotation
• Took 56 GEBA genomes and compared results vs. 56
randomly sampled new genomes
• Better definition of protein family sequence “patterns”
• Greatly improves “comparative” and “evolutionary”
based predictions
• Conversion of hypothetical into conserved hypotheticals
• Linking distantly related members of protein families
• Improved non-homology prediction
63. Phylogenetic Distribution Novelty:
Bacterial Actin Related Protein
C. boidinii gi57157304
S. cerevisiae gi14318479
L. starkeyi gi166080363
S. japonicus gi213407080 ACTIN
A. cliftonii gi14269497
99 U. pertusa gi50355609
H. sapiens gi4501889
M. cerebralis gi46326807
67 C. cinerea gi169844021
N. crassa gi85101929 ARP1
100 I. scapularis gi215507378
51 100 H. sapiens gi5031569
65 S. japonicus gi213404844
100 S. cerevisiae gi6320175
ARP2
D. melanogaster gi24642545
100 G. gallus gi45382569
75 C. neoformans gi58266690
S. cerevisiae gi6322525 ARP3
100 D. melanogaster gi17737543
100 H. sapiens gi5031573
H. ochraceum gi227395998 BARP
S. cerevisiae gi1008244
73 P. patens gi168051992 ARP4
99 A. thaliana gi18394608
94 S. cerevisiae gi1301932
100 S. japonicus gi213408393 ARP5
87 D. discoideum gi66802418
74 D. melanogaster gi17737347
97 S. cerevisiae gi6323114
100 D. hansenii gi21851 1921 ARP6
100 O. sativa gi182657420
A. thaliana gi1841 1737 ARP7
D. melanogater gi19920358
100 M. musculus gi226246593 ARP10
0.5
Haliangium ochraceum DSM 14365 Patrik D’haeseleer, Adam Zemla, Victor Kunin
Wu et al. 2009 Nature 462, 1056-1060 See also Guljamow et al. 2007 Current Biology.
64. Protein Family Rarefaction
• Take data set of multiple complete
genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families
72. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
Ac
tin
es
analysis
improves
ob
ac
te
ria
C
hl
or
ob
i
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
metagenomic
GEBA Project
C
hl
or
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
Metagenomic Phylotyping
s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
EFG
EFTu
rRNA
RecA
RpoB
HSP70
Venter et al., Science 304: 66-74. 200
73. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
es
Ac
tin
ob
ac
te
ria
C
hl
or
ob
i
But not a lot
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
C
hl
or
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
Metagenomic Phylotyping
s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
EFG
EFTu
rRNA
RecA
RpoB
HSP70
Venter et al., Science 304: 66-74. 200
77. Major Issues in Phylotpying
Beyond Moore’s Law Metagenomics
Short reads
78. Major Issues in Phylotpying
Beyond Moore’s Law Metagenomics
Short reads
WE NEED NEW
METHODS
79. Method 1: Each is an island
• Each new sequences is an island
• Take reference data
• Build alignment, models, trees
• Add new sequence to reference alignment
and build tree
80. STAP ss-rRNA Taxonomy Pip
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using a
the CLUSTALW profile alignment algorithm [40] as described w
above for domain assignment. By adapting the profile alignment s
a
t
o
G
t
t
Each sequence
s
T
c
analyzed separately a
q
c
e
b
b
S
p
a
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t
each query sequence based on its position in a maximum likelihood d
tree of representative ss-rRNA sequences. Because the tree illustrated ‘
here is not rooted, domain assignment would not be accurate and s
reliable (sequence similarity based methods cannot make an accurate
s
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely s
automatic identification of deep-branching environmental ss-rRNAs. d
doi:10.1371/journal.pone.0002566.g002 a
PLoS ONE | www.plosone.org 5
Wu et al. 2008 PLoS One
81. AMPHORA
Wu and Eisen Genome
Biology 2008 9:R151
doi:10.1186/
gb-2008-9-10-r151 Guide tree
85. Phylogenetic Challenge
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
A single tree with everything?
86. Phylogenetic Challenge
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
A single tree with everything
(as long as there is a lot of overlap)
87. Phylogenetic Challenge
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
A single tree with everything
(as long as there is a lot of overlap)
92. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG
ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu
ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70
or
ob
i
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
C
RecA
hl
or
of
le
xi
Sp
iro
ch
ae
te
RpoB
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
s-
rRNA
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
Protein vs. rRNA Sargasso Data
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 200
94. Method 3: All in the family
• Combine new sequences into one tree
• Take reference data
• Build alignment, models, trees
• Add all sequences to reference alignment
and build tree
97. PhylOTU Finding Metagenomic OT
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this general
workflow of PhylOTU. See Results section for details. Bio 2011
PhylOTU - Sharpton et al. PLoS Comp.
doi:10.1371/journal.pcbi.1001061.g001
100. Method 4: All in the genome
• Combine new sequences from different
gene families into one tree
• Take reference data
• Build alignment, models
• Concatenate
• Add all sequences to reference alignment
and build tree
101. Challenge
• Each gene poorly sampled in
metagenomes
• Can we combine all into a single tree?
103. Kembel Combiner
VOL. 73, 2007 PHYL
TABLE 1.
Measure
Only presence/absence of taxa considered Qua
Additionally accounts for the no. of times that Qua
each taxon was observed
cally defined by a sequence similarity threshold) in the sam
as equally related. Newer  diversity measures that incorpo
phylogenetic information are more powerful because they
count for the degree of divergence between sequences (13
29, 30). Phylogenetic  diversity measures can also be ei
quantitative or qualitative depending on whether abundanc
taken into account. The original, unweighted UniFrac mea
(13) is a qualitative measure. Unweighted UniFrac meas
the distance between two communities by calculating the f
tion of the branch length in a phylogenetic tree that lead
descendants in either, but not both, of the two commun
(Fig. 1A). The fixation index (FST), which measures
distance between two communities by comparing the gen
diversity within each community to the total genetic diversit
the communities combined (18), is a quantitative measure
accounts for different levels of divergence between sequen
The phylogenetic test (P test), which measures the significa
of the association between environment and phylogeny (18
typically used as a qualitative measure because duplicate
quences are usually removed from the tree. However, th
test may be used in a semiquantitative manner if all clo
even those with identical or near-identical sequences, are
cluded in the tree (13).
Here we describe a quantitative version of UniFrac tha
call “weighted UniFrac.” We show that weighted UniFrac
haves similarly to the FST test in situations where both
FIG. 1. Calculation of the unweighted and the weighted Uni
measures. Squares and circles represent sequences from two diffe
environments. (a) In unweighted UniFrac, the distance between
109. Sifting Families
Representative
Genomes
Extract New
Protein Genomes
Annotation
Extract
All v. All
Protein
BLAST
Annotation
Homology
Screen for
Clustering
Homologs
(MCL)
SFams HMMs
Align
Build Sharpton et al. submitted
Figure 1
HMMs