"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012

Phylogeny-Driven Approaches to
Genomics and Metagenomics
June 23, 2012
Canadian Society for Microbiology

Jonathan A. Eisen
University of California, Davis
@phylogenomics

Acknowledgements

• $$$
• DOE
• NSF
• GBMF
• Sloan
• DARPA
• DSMZ
• DHS
• People, places
• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides
• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell
Neches, Jenna Morgan-Lang
• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,
Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward,
Hans-Peter Klenk

Phylogeny: What is it?

• Phylogeny is a description of
the evolutionary history of
relationships among organisms
(or their parts).
• This is frequently portrayed in
a diagram called a phylogenetic
tree.
• Phylogenies can be more
complex than a bifurcating tree
(e.g., lateral gene transfer,
recombination, hybridization)

Whatever the History:
Trying to Incorporate it is Critical

from Lake et al. doi: 10.1098/rstb.2009.0035

Phylogeny

• Applies to
• Species
• Genes
• Genomes

Phylogeny: What is it good for?

Phylogeny: What is it good for?

Uses of Phylogeny
in Genomics and Metagenomics

Uses of Phylogeny

Example 1:

Phylotyping

rRNA Phylotyping
DNA
extraction PCR

Makes lots of Sequence
PCR copies of the rRNA genes
rRNA genes
in sample

rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
Phylogenetic tree Sequence alignment = Data matrix
rRNA2
rRNA1 rRNA2
rRNA1 A C A C A C 5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA4
rRNA3 rRNA2 T A C A G T
rRNA3
rRNA3 C A C T G T 5’...ACGGCAAAATAGGTGGATT
E. coli Humans rRNA4 C A C A G T CTAGCGATATAGA... 3’

Yeast E. coli A G A C A G rRNA4
5’...ACGGCCCGATAGGTGGATT
Humans T A T A G T CTAGCGCCATAGA... 3’
Yeast T A C A G T

rRNA Phylotyping
• Collect DNA from
environment
• PCR amplify rRNA
genes using broad
(so-called universal)
primers
• Sequence
• Align to others
• Infer evolutionary tree
• Unknowns “identified”
by placement on tree

Era IV: Genomes in Environment

shotgun
sequence

Metagenomics

rRNA Phylotyping in Sargasso

Venter et al., Science
304: 66. 2004

RecA Phylotyping in Sargasso Data

304: 66. 2004

Weighted % of Clones

0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG

ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu

ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70

or
ob
i
C

Major Phylogenetic Group
FB
Sargasso Phylotypes

C
RecA

hl
or
of
le
xi
Sp
iro
ch
ae
te
s
RpoB

Fu
so
ba
De ct
in er
ia
oc
oc
cu
s-
rRNA

Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 2004

Binning challenge

Best binning method: reference genomes

Binning challenge

No reference genome? What do you do?

Binning challenge


Composition, Assembly, others

Binning challenge


Phylogeny

Sulcia makes amino acids

Baumannia makes vitamins and cofactors

Wu et al. 2006 PLoS Biology 4: e188.

rRNA survey

• Sequence
rRNAs
• Cluster

rRNA survey

OTU1 • Sequence
OTU2 rRNAs
OTU3 • Cluster
OTU4
• Identify
OTU5
OTU6 “OTUs”
OTU7
OTU8
OTU9
OTU10

OTUs on Tree

OTU1
OTU5
OTU4

OTU6
OTU2
OTU3
OTU7
OTU9
OTU8
OTU10

OTUs on Tree

OTU1 • Clades
OTU5
• Rates of
OTU4
change
OTU6 • LGT
OTU2
OTU3 • Convergence
OTU7 • Character
OTU9
OTU8 history
OTU10

Unifrac

nuscript
typically used as a qualitative measure because duplicate se- Weighted UniFrac. Weighted UniFrac is a new variant of the original un-
quences are usually removed from the tree. However, the P weighted UniFrac measure that weights the branches of a phylogenetic tree
test may be used in a semiquantitative manner if all clones, based on the abundance of information (Fig. 1B). Weighted UniFrac is thus a
quantitative measure of ␤ diversity that can detect changes in how many se-
even those with identical or near-identical sequences, are in-
quences from each lineage are present, as well as detect changes in which taxa
cluded in the tree (13). are present. This ability is important because the relative abundance of different
Here we describe a quantitative version of UniFrac that we kinds of bacteria can be critical for describing community changes. In contrast,
call “weighted UniFrac.” We show that weighted UniFrac be- the original, unweighted UniFrac (Fig. 1A) is a qualitative ␤ diversity measure
haves similarly to the FST test in situations where both are because duplicate sequences contribute no additional branch length to the tree
(by definition, the branch length that separates a pair of duplicate sequences is
zero, because no substitutions separate them).
The first step in applying weighted UniFrac is to calculate the raw weighted
UniFrac value (u), according to the first equation:

NIH-PA Author Manuscript
͸
n

uϭ bi ϫ ͯA Ϫ B ͯ
Ai

T
B
T
i

i

Here, n is the total number of branches in the tree, bi is the length of branch i,
Ai and Bi are the numbers of sequences that descend from branch i in commu-
nities A and B, respectively, and AT and BT are the total numbers of sequences
in communities A and B, respectively. In order to control for unequal sampling
effort, Ai and Bi are divided by AT and BT.
If the phylogenetic tree is not ultrametric (i.e., if different sequences in the
sample have evolved at different rates), clustering with weighted UniFrac will
place more emphasis on communities that contain quickly evolving taxa. Since
these taxa are assigned more branch length, a comparison of the communities
FIG. 1. Calculation of the unweighted and the weighted UniFrac that contain them will tend to produce higher values of u. In some situations, it
measures. Squares and circles represent sequences from two different may be desirable to normalize u so that it has a value of 0 for identical commu-
environments. (a) In unweighted UniFrac, the distance between the nities and 1 for nonoverlapping communities. This is accomplished by dividing u
circle and square communities is calculated as the fraction of the by a scaling factor (D), which is the average distance of each sequence from the
branch length that has descendants from either the square or the circle root, as shown in the equation as follows:
environment (black) but not both (gray). (b) In weighted UniFrac,

͸ ͩ
branch lengths are weighted by the relative abundance of sequences in
ͪ
n
the square and circle communities; square sequences are weighted Aj Bj
Dϭ dj ϫ ϩ
twice as much as circle sequences because there are twice as many total AT BT
circle sequences in the data set. The width of branches is proportional Figure 1. j
NIH-PA Author Manuscript

to the degree to which each branch is weighted in the calculations, and Here, dj is the distance of sequence j from the root, (PD) and PD Gain (G) for the grey community. The
Estimates of Phylogenetic Diversity Aj and Bj are the numbers
gray branches have no weight. Branches 1 and 2 have heavy weights of times the sequences were observed in communitieswhite, and grey communities. (A) PD is the sum of the
boxes represent taxa from the black, A and B, respectively, and
since the descendants are biased toward the square and circles, respec- AT and BT are the total numbers of sequences from communities A and B,
tively. Branch 3 contributes no value since it has an equal contribution branches leading to the grey taxa. (B) G is the sum of the branches leading only to the grey
respectively.
from circle and square sequences after normalization. Clustering with normalized u values treatsshowing the increase inof
taxa. (C) PD rarefaction curves each sample equally instead branch length with sampling effort
for the intestinal and stool bacteria from three healthy individuals. Aligned16S rRNA
sequences from the three individuals were available with the Supplementary Materials in
(Eckburg, et al., 2005). The Arb parsimony insertion tool was used to add the sequences to a
tree containing over 9,000 sequences (Hugenholtz, 2002) that is available for download at
the rRNA Database Project II website (Maidak, et al., 2001). The curves represent the
average values for 50 replicate trials.

FEMS Microbiol Rev. Author manuscript; available in PMC 2009 July 1.

Caveat: Not Everything in Groups

RecA, RpoB in GOS

GOS 1

GOS 2

GOS 3

GOS 4

GOS 5
Wu et al PLoS One 2011

Uses of Phylogeny

Example 2:

Functional Diversity and
Functional Predictions

Predicting Function

• Key step in genome projects
• More accurate predictions help guide
experimental and computational
analyses
• Many diverse approaches
• All improved both by “phylogenomic”
type analyses that integrate
evolutionary reconstructions and
understanding of how new functions
evolve

Predicting Function

• Identification of motifs
– Short regions of sequence similarity that are indicative of
general activity
– e.g., ATP binding
• Homology/similarity based methods
– Gene sequence is searched against a databases of other
sequences
– If significant similar genes are found, their functional
information is used
• Problem
– Genes frequently have similarity to hundreds of motifs and
multiple genes, not all with the same function

From Eisen et al.
1997 Nature
Medicine 3:
1076-1078.

Blast Search of H. pylori “MutS”

• Blast search pulls up Syn. sp MutS#2 with much higher p
value than other MutS homologs
• Based on this TIGR predicted this species had mismatch
repair
• Assumes functional constancy
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.

MutL??

Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.

Overlaying Functions onto Tree
MutS2
Aquae
MSH5 StrpyBacsuSynsp
Deira Helpy
Yeast
Human Borbu
Celeg Metth

MSH6 mSaco

Yeast
Human
Mouse
Arath
Yeast MSH4
Celeg
Human
Arath
Human
MSH3 Mouse
Fly
Spombe
Yeast Xenla
Rat
Mouse
Yeast Human
MSH1 Spombe Yeast MSH2
Neucr
Arath

Aquae Trepa
Chltr
Deira Theaq
Bacsu Borbu
Thema
Synsp Strpy
Ecoli Based on Eisen,
Neigo
1998 Nucl Acids Res
MutS1 26: 4291-4300.

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

EXAMPLE A METHOD EXAMPLE B

2A CHOOSE GENE(S) OF INTEREST 5

3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6

ALIGN SEQUENCES

1A 2A 3A 1B 2B 3B 1 2 3 4 5 6

CALCULATE GENE TREE

Duplication?

1A 2A 3A 1B 2B 3B 1 2 3 4 5 6

OVERLAY KNOWN
FUNCTIONS ONTO TREE

Duplication?

2A 3A 1B 2B 3B 1 2 3 4 5 6
1A

INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?

Species 1 Species 2 Species 3

Based on
1A 1B 2A 2B 3A 3B 1 2 3 4 5 6

ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Eisen, 1998
Genome Res 8:
Duplication 163-167.

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

EXAMPLE A METHOD EXAMPLE B

2A CHOOSE GENE(S) OF INTEREST 5

3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6

ALIGN SEQUENCES

1A 2A 3A 1B 2B 3B 1 2 3 4 5 6

CALCULATE GENE TREE

Duplication?

1A 2A 3A 1B 2B 3B 1 2 3 4 5 6

OVERLAY KNOWN
FUNCTIONS ONTO TREE

Duplication?

2A 3A 1B 2B 3B 1 2 3 4 5 6
1A

INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?

Species 1 Species 2 Species 3
1A 1B 1 2 3 4 5 6
2A 2B 3A 3B

ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Based on
Duplication
Eisen, 1998
Genome Res 8:

Diversity of Proteorhodopsins

Venter et al., 2004

Carboxydothermus sporulates

Wu et al. 2005 PLoS Genetics 1: e65.

Wu et al. 2005 PLoS Genetics 1: e65.

Uses of Phylogeny

Example 3:

Selecting Organisms for Study

As of 2002 Proteobacteria
TM6
OS-K
• At least 40
Acidobacteria
Termite Group phyla of
bacteria
OP8
Nitrospira
Bacteroides
Chlorobi
Fibrobacteres
Marine GroupA
WS3
Gemmimonas
Firmicutes
Fusobacteria
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroﬂexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquiﬁcae
Thermudesulfobacteria
Thermotogae
OP1 Based on Hugenholtz,
OP11 2002

TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Gemmimonas from three
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
Deferribacteres
Chrysiogenetes
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroﬂexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquiﬁcae
Thermotogae
OP11 2002

TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some studies
Deferribacteres
Chrysiogenetes in other phyla
NKB19
Verrucomicrobia
Chlamydia
OP3
Planctomycetes
Spriochaetes
Coprothmermobacter
OP10
Thermomicrobia
Chloroﬂexi
TM7
Deinococcus-Thermus
Dictyoglomus
Aquiﬁcae
Thermotogae
OP11 2002

TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely
OP3
Planctomycetes
Spriochaetes
sampled
Coprothmermobacter
OP10 • Same trend in
Thermomicrobia
Chloroﬂexi
TM7
Eukaryotes
Deinococcus-Thermus
Dictyoglomus
Aquiﬁcae
Thermotogae
OP11 2002

TM6
OS-K
• At least 40
Acidobacteria
Termite Group
OP8
phyla of
Nitrospira
Bacteroides
bacteria
Chlorobi
Fibrobacteres
Marine GroupA
• Most genomes
WS3
Firmicutes
Fusobacteria phyla
Actinobacteria
OP9
Cyanobacteria
Synergistes
• Some other
Deferribacteres
Chrysiogenetes phyla are only
NKB19
Verrucomicrobia
Chlamydia
sparsely
OP3
Planctomycetes
Spriochaetes
sampled
Coprothmermobacter
OP10 • Same trend in
Thermomicrobia
Chloroﬂexi
TM7
Viruses
Deinococcus-Thermus
Dictyoglomus
Aquiﬁcae
Thermotogae
OP11 2002

GEBA

http://www.jgi.doe.gov/programs/GEBA/pilot.html

GEBA: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan
Eisen, Eddy Rubin, Jim Bristow)
• Project management (David Bruce, Eileen Dalin, Lynne
Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla
Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen,
Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor
Markowitz, et al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu,
Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain,
Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati,
Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, Eddy Rubin, Jim Bristow)

GEBA Now

• 300+ genomes
• Rich sampling of major groups of
cultured organisms

GEBA Lesson 1:
The rRNA Tree of Life is a Useful Tool

From Wu et al. 2009 Nature 462, 1056-1060

GEBA Lesson 2:
The rRNA Tree of Life is not perfect ...

16s WGT, 23S

Badger et al. 2005 Int J System Evol Microbiol 55: 1021-1026.

GEBA Lesson 3:
Phylogeny improves genome annotation

• Took 56 GEBA genomes and compared results vs. 56
randomly sampled new genomes
• Better definition of protein family sequence “patterns”
• Greatly improves “comparative” and “evolutionary”
based predictions
• Conversion of hypothetical into conserved hypotheticals
• Linking distantly related members of protein families
• Improved non-homology prediction

GEBA Lesson 4 :
Metadata Important

GEBA Lesson 5:
Improves discovering new genetic diversity

Phylogenetic Distribution Novelty:
Bacterial Actin Related Protein
C. boidinii gi57157304
S. cerevisiae gi14318479
L. starkeyi gi166080363
S. japonicus gi213407080 ACTIN
A. cliftonii gi14269497
99 U. pertusa gi50355609
H. sapiens gi4501889
M. cerebralis gi46326807
67 C. cinerea gi169844021
N. crassa gi85101929 ARP1
100 I. scapularis gi215507378
51 100 H. sapiens gi5031569
65 S. japonicus gi213404844
100 S. cerevisiae gi6320175
ARP2
D. melanogaster gi24642545
100 G. gallus gi45382569
75 C. neoformans gi58266690
S. cerevisiae gi6322525 ARP3
100 D. melanogaster gi17737543
100 H. sapiens gi5031573
H. ochraceum gi227395998 BARP
S. cerevisiae gi1008244
73 P. patens gi168051992 ARP4
99 A. thaliana gi18394608
100 S. japonicus gi213408393 ARP5
87 D. discoideum gi66802418
74 D. melanogaster gi17737347
100 D. hansenii gi21851 1921 ARP6
100 O. sativa gi182657420
A. thaliana gi1841 1737 ARP7
D. melanogater gi19920358
100 M. musculus gi226246593 ARP10

0.5

Haliangium ochraceum DSM 14365 Patrik D’haeseleer, Adam Zemla, Victor Kunin

Wu et al. 2009 Nature 462, 1056-1060 See also Guljamow et al. 2007 Current Biology.

Protein Family Rarefaction

• Take data set of multiple complete
genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families

Wu et al. 2009 Nature 462, 1056-1060

Synapomorphies exist

Wu et al. 2009 Nature 462, 1056-1060

GEBA Lesson 6:
Improves Analysis of Uncultured


0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
Ac
tin
es

analysis
improves

ob
ac
te
ria
C
hl
or
ob
i
C

FB
Sargasso Phylotypes

metagenomic
GEBA Project

C
hl
or
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
Metagenomic Phylotyping

s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
EFG
EFTu

rRNA
RecA
RpoB
HSP70



0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
es
Ac
tin
ob
ac
te
ria
C
hl
or
ob
i
But not a lot

C

FB
Sargasso Phylotypes

C
hl
or
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
Metagenomic Phylotyping

s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
EFG
EFTu

rRNA
RecA
RpoB
HSP70


• AND THEN ALL OF THEM WERE
DECEIVED

• For each of these areas - need to do a
MUCH better job ...

Major Issues in Phylotpying
Beyond Moore’s Law Metagenomics

Short reads

Major Issues in Phylotpying
Beyond Moore’s Law Metagenomics

Short reads

WE NEED NEW
METHODS

Method 1: Each is an island

• Each new sequences is an island

• Take reference data
• Build alignment, models, trees
• Add new sequence to reference alignment
and build tree

STAP ss-rRNA Taxonomy Pip
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001

STAP database, and the query sequence is aligned to them using a
the CLUSTALW profile alignment algorithm [40] as described w
above for domain assignment. By adapting the profile alignment s
a
t
o
G
t

t

Each sequence
s
T
c

analyzed separately a
q
c
e
b

b
S
p
a
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t
each query sequence based on its position in a maximum likelihood d
tree of representative ss-rRNA sequences. Because the tree illustrated ‘
here is not rooted, domain assignment would not be accurate and s
reliable (sequence similarity based methods cannot make an accurate
s
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely s
automatic identification of deep-branching environmental ss-rRNAs. d
doi:10.1371/journal.pone.0002566.g002 a

PLoS ONE | www.plosone.org 5

Wu et al. 2008 PLoS One

AMPHORA

Wu and Eisen Genome
Biology 2008 9:R151
doi:10.1186/
gb-2008-9-10-r151 Guide tree

Phylotyping w/ Proteins

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Whole Genome Tree

Wu and Eisen
Genome Biology
2008 9:R151 doi:
10.1186/
gb-2008-9-10-r151

Phylogenetic Challenge

xxxxxxxxxxxxxxxxxxxxxxx

xxxxxx xxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

A single tree with everything?


xxxxxxxxxxxxxxxxxxxxxxx

xxxxxx xxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

A single tree with everything
(as long as there is a lot of overlap)


A single tree with everything?

rRNA in Sargasso Metagenome

304: 66. 2004

STAP All ss-rRNA Taxonomy Pip

Combine all into
one alignment

Figure 1. A flow chart of the STAP pipeline.

RecA in Sargasso

304: 66. 2004


0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG

ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu

ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70

or
ob
i
C

FB
Sargasso Phylotypes

C
RecA

hl
or
of
le
xi
Sp
iro
ch
ae
te
RpoB

Fu s
so
ba
De ct
in er
ia
oc
oc
cu
s-
rRNA

Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
Protein vs. rRNA Sargasso Data

rc
ha
eo
ta

Method 3: All in the family

• Combine new sequences into one tree

• Build alignment, models, trees
• Add all sequences to reference alignment
and build tree

PhylOTU Finding Metagenomic OT

Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this general
workflow of PhylOTU. See Results section for details. Bio 2011
PhylOTU - Sharpton et al. PLoS Comp.
doi:10.1371/journal.pcbi.1001061.g001

Method 4: All in the genome

• Combine new sequences from different
gene families into one tree

• Build alignment, models
• Concatenate
• Add all sequences to reference alignment
and build tree

Challenge

• Each gene poorly sampled in
metagenomes
• Can we combine all into a single tree?

Kembel Combiner

Kembel et al. The phylogenetic diversity of metagenomes. PLoS One 2011

Kembel Combiner

VOL. 73, 2007 PHYL

TABLE 1.
Measure

Only presence/absence of taxa considered Qua
Additionally accounts for the no. of times that Qua
each taxon was observed

cally defined by a sequence similarity threshold) in the sam
as equally related. Newer ␤ diversity measures that incorpo
phylogenetic information are more powerful because they
count for the degree of divergence between sequences (13
29, 30). Phylogenetic ␤ diversity measures can also be ei
quantitative or qualitative depending on whether abundanc
taken into account. The original, unweighted UniFrac mea
(13) is a qualitative measure. Unweighted UniFrac meas
the distance between two communities by calculating the f
tion of the branch length in a phylogenetic tree that lead
descendants in either, but not both, of the two commun
(Fig. 1A). The fixation index (FST), which measures
distance between two communities by comparing the gen
diversity within each community to the total genetic diversit
the communities combined (18), is a quantitative measure
accounts for different levels of divergence between sequen
The phylogenetic test (P test), which measures the significa
of the association between environment and phylogeny (18
typically used as a qualitative measure because duplicate
quences are usually removed from the tree. However, th
test may be used in a semiquantitative manner if all clo
even those with identical or near-identical sequences, are
cluded in the tree (13).
Here we describe a quantitative version of UniFrac tha
call “weighted UniFrac.” We show that weighted UniFrac
haves similarly to the FST test in situations where both

FIG. 1. Calculation of the unweighted and the weighted Uni
measures. Squares and circles represent sequences from two diffe
environments. (a) In unweighted UniFrac, the distance between

Improving Phylotyping II

• We need to analyze more gene families

Families/PD not uniform
31

6

More Markers
Phylogenetic group Genome Gene Maker
Number Number Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacter 126 483632 118
ia
Deltaproteobacteria 25 102115 206
Epislonproteobacter 18 33416 455
ia
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroﬂexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684

Improving Functional Predictions

Improving Functional Predictions

• We need to analyze even more gene
families

Sifting Families
Representative
Genomes

Extract New
Protein Genomes
Annotation

Extract
All v. All
Protein
BLAST
Annotation

Homology
Screen for
Clustering
Homologs
(MCL)

SFams HMMs

Align
Build Sharpton et al. submitted
Figure 1
HMMs

B
A

C

Sharpton et al. submitted

"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012

"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012

Recommandé

Recommandé

Contenu connexe

Similaire à "Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012

Similaire à "Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012 (19)

Plus de Jonathan Eisen

Plus de Jonathan Eisen (20)

Dernier

Dernier (20)

"Phylogeny-driven studies in genomics and metagenomics" talk by Jonathan Eisen at #CSMUBC2012