"Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting
1. Phylogenomic Approaches to the
Study of Microbial Diversity
September 6, 2012
Bay Area Illumina User’s Meeting
Jonathan A. Eisen
University of California, Davis
@phylogenomics
Thursday, September 6, 12
2. Phylogenomic Approaches to
Studying Microbial Diversity
Example 1:
Phylotyping
and
Phylogenetic Diversity
Thursday, September 6, 12
3. rRNA Phylotyping
DNA
extraction PCR
Makes lots of Sequence
PCR copies of the rRNA genes
rRNA genes
in sample
rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
Sequence alignment = Data matrix
rRNA2
rRNA1 A C A C A C 5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA2 T A C A G T
rRNA3
rRNA3 C A C T G T 5’...ACGGCAAAATAGGTGGATT
rRNA4 C A C A G T CTAGCGATATAGA... 3’
E. coli A G A C A G rRNA4
5’...ACGGCCCGATAGGTGGATT
Humans T A T A G T CTAGCGCCATAGA... 3’
Yeast T A C A G T
Thursday, September 6, 12
5. Phylotyping
E. coli Humans
Yeast
Thursday, September 6, 12
6. Phylotyping
E. coli Humans
Yeast
OTU2 OTU1
OTU4
OTU3
E. coli Humans
Yeast
Thursday, September 6, 12
7. Phylotyping
B
A
Cluster C
Thursday, September 6, 12
8. Phylotyping
B
A
Cluster C
B
A
OTUs C
Thursday, September 6, 12
9. Phylotyping
B
A
Cluster C
B
A
OTUs C
OTU1
OTU2
OTU3
OTU4
Thursday, September 6, 12
10. Phylotyping
B
A
Cluster C
B
A
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Thursday, September 6, 12
11. Phylotyping
E. coli Humans
Yeast
Thursday, September 6, 12
12. Phylotyping
Just
E. coli Humans
Phylogeny
Yeast
Thursday, September 6, 12
13. Phylotyping
B
A
Cluster C
Just
B E. coli Humans
Phylogeny
A
Yeast
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Thursday, September 6, 12
14. Phylotyping
• OTUs
• Taxonomic lists
• Relative abundance of taxa
• Ecological metrics (alpha and beta diversity)
• Phylogenetic metrics
• Binning
• Identification of novel groups
• Clades
• Rates of change
• LGT
• Convergence
• PD
• Phylogenetic ecology (e.g., Unifrac)
Thursday, September 6, 12
16. What’s New in Phylotyping I
• More PCR products
• Deeper sequencing
• The rare biosphere
• Relative abundance estimates
• More samples (with barcoding)
• Times series
• Spatially diverse sampling
• Fine scale sampling
Thursday, September 6, 12
19. Things You Could Do
• Mississippi River: 2320 miles long
Thursday, September 6, 12
20. Things You Could Do
• Mississippi River: 2320 miles long
• 1 site / mile
• 3 samples / site
• 6960 samples
• rRNA PCR w/ barcodes
• metagenomics w/ barcodes
• Miseq Run:
• 30 million sequence reads
• 4310 sequences / sample
• Hiseq 2000
• 6 billion sequence reads
• 862,068 sequences / sample
Thursday, September 6, 12
21. Things You Could Do
• Mississippi River: 12,249,600 feet long
• 1 site / 500 feet
• 3 samples / site
• 73497 samples
• rRNA PCR w/ barcodes
• metagenomics w/ barcodes
• Miseq Run:
• 30 million sequence reads
• 408 sequences / sample
• Hiseq 2000
• 6 billion sequence reads
• 81,635 sequences / sample
Thursday, September 6, 12
22. What’s New in Phylotyping II
• Metagenomics avoids biases of rRNA
PCR
shotgun
sequence
Thursday, September 6, 12
23. Metagenomic Phylotyping
B
A
Cluster C
Just
B E. coli Humans
Phylogeny
A
Yeast
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Thursday, September 6, 12
28. Method 1: Each is an island
• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time
Thursday, September 6, 12
29. Method 1: Each is an island
• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time
Thursday, September 6, 12
30. Method 1: Each is an island
• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time
Thursday, September 6, 12
31. STAP ss-rRNA Taxonomy Pip
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using a
the CLUSTALW profile alignment algorithm [40] as described w
above for domain assignment. By adapting the profile alignment s
a
t
o
G
t
t
Each sequence
s
T
c
analyzed separately a
q
c
e
b
b
S
p
a
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t
each query sequence based on its position in a maximum likelihood d
tree of representative ss-rRNA sequences. Because the tree illustrated ‘
here is not rooted, domain assignment would not be accurate and s
reliable (sequence similarity based methods cannot make an accurate
s
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely s
automatic identification of deep-branching environmental ss-rRNAs. d
doi:10.1371/journal.pone.0002566.g002 a
PLoS ONE | www.plosone.org 5
Wu et al. 2008 PLoS One
Figure 1. A flow chart of the STAP pipeline.
Thursday, September 6, 12
32. AMPHORA
Wu and Eisen Genome
Biology 2008 9:R151
doi:10.1186/
gb-2008-9-10-r151 Guide tree
Thursday, September 6, 12
33. Phylotyping w/ Proteins
Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151
Thursday, September 6, 12
36. Method 2: Most in family
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxx
One tree for those w/ overlap
Thursday, September 6, 12
37. rRNA in Sargasso Metagenome
Venter et al., Science
304: 66. 2004
Thursday, September 6, 12
38. RecA Phylotyping in Sargasso Data
Venter et al., Science
304: 66. 2004
Thursday, September 6, 12
39. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
Thursday, September 6, 12
pr er
ot ia
eo
G
304: 66. 2004
am b ac
m t er
ap ia
ro
Ep t eo
si ba
lo ct
Venter et al., Science
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG
ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu
ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70
or
ob
i
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
C
RecA
hl
or
of
le
xi
Sp
iro
ch
ae
te
s
RpoB
Fu
so
ba
De ct
in er
ia
oc
Sargasso Phylotyping
oc
cu
s-
rRNA
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
40. STAP, QIIME, Mothur ss-rRNA Taxonomy Pip
Combine all into
one alignment
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
Thursday, September 6, 12
41. Method 3: All in the family
Thursday, September 6, 12
44. rRNA analysis
B
A
Cluster C
Just
B E. coli Humans
Phylogeny
A
Yeast
OTUs C
OTU2 OTU1
OTU1 OTU4
OTU3
OTU2
OTU3 E. coli Humans
OTU4 Yeast
Thursday, September 6, 12
45. PhylOTU Finding Meta
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in
workflow of PhylOTU. See Results section for details.
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011)
doi:10.1371/journal.pcbi.1001061.g001
PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel
Taxa from Metagenomic used toPLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
alignment Data. build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap betw
sequence alignment of full-length reference sequences and clusters and recognized taxonomic designations, and
Thursday, September 6, 12 metagenomic reads. The final step of the alignment process is a the accuracy of PhylOTU clusters from shotgun re
46. RecA, RpoB in GOS
GOS 1
GOS 2
GOS 3
GOS 4
Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking
the Fourth Domain in Metagenomic Data: Searching for, Discovering,
GOS 5
and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic
Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Thursday, September 6, 12
47. Phylosift/ pplacer
Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric
Lowe, and others
Thursday, September 6, 12
48. Method 4: All in the genome
Thursday, September 6, 12
49. Multiple Genes?
A single tree with everything?
Thursday, September 6, 12
50. Kembel Combiner
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS
ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Thursday, September 6, 12
51. typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
Kembel Combiner
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS
ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Thursday, September 6, 12
52. Uses of Phylogeny
in Genomics and Metagenomics
Example 2:
Functional Diversity and
Functional Predictions
Thursday, September 6, 12
53. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
EXAMPLE A METHOD EXAMPLE B
2A CHOOSE GENE(S) OF INTEREST 5
3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6
ALIGN SEQUENCES
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
CALCULATE GENE TREE
Duplication?
1A 2A 3A 1B 2B 3B 1 2 3 4 5 6
OVERLAY KNOWN
FUNCTIONS ONTO TREE
Duplication?
2A 3A 1B 2B 3B 1 2 3 4 5 6
1A
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?
Species 1 Species 2 Species 3
Based on
1A 1B 2A 2B 3A 3B 1 2 3 4 5 6
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Eisen, 1998
Genome Res 8:
Duplication 163-167.
Thursday, September 6, 12
55. Improving Functional Predictions
• Same methods discussed for phylotyping
improve phylogenomic functional
prediction for protein families
• Increase in sequence diversity helps too
Thursday, September 6, 12
56. NMF in Metagenomes
Characterizing the niche-space distributions of components
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .2 0 .4 0 .6 0 .8 1 .0
Polyne sia Archipe la gos_ G S 0 4 8 a _ C ora l R e e f
India n O ce a n_ G S 1 2 0 _ O pe n O ce a n
Polyne sia Archipe la gos_ G S 0 4 9 _ C oa sta l
G a la pa gos Isla nds_ G S 0 2 6 _ O pe n O ce a n
India n O ce a n_ G S 1 1 9 _ O pe n O ce a n
G e ne ra l
C a ribbe a n S e a _ G S 0 1 5 _ C oa sta l
C a ribbe a n S e a _ G S 0 1 9 _ C oa sta l
India n O ce a n_ G S 1 1 4 _ O pe n O ce a n H igh
E a ste rn Tropica l Pa cific_ G S 0 2 3 _ O pe n O ce a n M e dium
India n O ce a n_ G S 1 1 0 a _ O pe n O ce a n
India n O ce a n_ G S 1 0 8 a _ La goon R e e f Low
C a ribbe a n S e a _ G S 0 1 8 _ O pe n O ce a n NA
G a la pa gos Isla nds_ G S 0 3 4 _ C oa sta l
India n O ce a n_ G S 1 2 2 a _ O pe n O ce a n
India n O ce a n_ G S 1 2 1 _ O pe n O ce a n
C a ribbe a n S e a _ G S 0 1 7 _ O pe n O ce a n
India n O ce a n_ G S 1 1 2 a _ O pe n O ce a n
India n O ce a n_ G S 1 1 3 _ O pe n O ce a n
India n O ce a n_ G S 1 4 8 _ F ringing R e e f
C a ribbe a n S e a _ G S 0 1 6 _ C oa sta l S e a
India n O ce a n_ G S 1 2 3 _ O pe n O ce a n
India n O ce a n_ G S 1 4 9 _ H a rbor
G a la pa gos Isla nds_ G S 0 2 7 _ C oa sta l
E a ste rn Tropica l Pa cific_ G S 0 2 2 _ O pe n O ce a n W a te r de pth
S ites
S a rga sso S e a _ G S 0 0 1 c_ O pe n O ce a n
G a la pa gos Isla nds_ G S 0 3 5 _ C oa sta l
G a la pa gos Isla nds_ G S 0 3 0 _ W a rm S e e p
G a la pa gos Isla nds_ G S 0 2 9 _ C oa sta l >4000m
G a la pa gos Isla nds_ G S 0 3 1 _ C oa sta l upwe lling
India n O ce a n_ G S 1 1 7 a _ C oa sta l sa m ple
2000!4000m
G a la pa gos Isla nds_ G S 0 2 8 _ C oa sta l 900!2000m
G a la pa gos Isla nds_ G S 0 3 6 _ C oa sta l 100!200m
Polyne sia Archipe la gos_ G S 0 5 1 _ C ora l R e e f Atoll
N orth Am e rica n E a st C oa st_ G S 0 1 4 _ C oa sta l 20!100m
N orth Am e rica n E a st C oa st_ G S 0 0 6 _ E stua ry 0!20m
E a ste rn Tropica l Pa cific_ G S 0 2 1 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 9 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 1 _ E stua ry
N orth Am e rica n E a st C oa st_ G S 0 0 8 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 3 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 4 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 7 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 3 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 2 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 5 _ E m baym e nt
Co Co Co Co Co
Chlorophyll
Salinity
Temperature
Water Depth
Sample Depth
Insolation
mp mp mp mp mp
on on on on on
en en en en en
t1 t2 t3 t4 t5
(a) (b) (c)
Figure 3: a) Niche-space distributions for our five components (H T );Weitz,site-
Non-negative c) environmental variables for the sites. w/ matrices Dushoff,
ˆ ˆ
similarity matrix (H T H);
matrix factorization b) the
Langille, Neches,
The are
aligned so that et al. Inrow corresponds to One. site in each matrix. Sites are
Jiang the same press PLoS the same
Levin, etc
ordered by applying spectral reordering to the similarity matrix (see Materials and
Methods). Rows are aligned across the three matrices.
Thursday, September 6, 12
57. Uses of Phylogeny
in Genomics and Metagenomics
Example 3:
Selecting Organisms for Study
Thursday, September 6, 12
58. GEBA
http://www.jgi.doe.gov/programs/GEBA/pilot.html
Thursday, September 6, 12
59. GEBA: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan
Eisen, Eddy Rubin, Jim Bristow)
• Project management (David Bruce, Eileen Dalin, Lynne
Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla
Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen,
Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor
Markowitz, et al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu,
Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain,
Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati,
Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, Eddy Rubin, Jim Bristow)
Thursday, September 6, 12
60. GEBA Now
• 300+ genomes
• Rich sampling of major groups of
cultured organisms
Thursday, September 6, 12
62. Protein Family Rarefaction
• Take data set of multiple complete
genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families
Thursday, September 6, 12
63. Wu et al. 2009 Nature 462, 1056-1060
Thursday, September 6, 12
64. Wu et al. 2009 Nature 462, 1056-1060
Thursday, September 6, 12
65. Wu et al. 2009 Nature 462, 1056-1060
Thursday, September 6, 12
66. Wu et al. 2009 Nature 462, 1056-1060
Thursday, September 6, 12
67. Wu et al. 2009 Nature 462, 1056-1060
Thursday, September 6, 12
70. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
Thursday, September 6, 12
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
es
Ac
tin
ob
ac
te
ria
C
hl
or
ob
i
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
phylotyping &
C
hl
or
GEBA benefits
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
Metagenomic Phylotyping
functional prediction
s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 2004
EFG
EFTu
rRNA
RecA
RpoB
HSP70
71. GEBA improves genome annotation
• Took 56 GEBA genomes and compared results vs. 56
randomly sampled new genomes
• Better definition of protein family sequence “patterns”
• Greatly improves “comparative” and “evolutionary”
based predictions
• Conversion of hypothetical into conserved hypotheticals
• Linking distantly related members of protein families
• Improved non-homology prediction
Thursday, September 6, 12
72. Weighted % of Clones
0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia
Thursday, September 6, 12
ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
es
Ac
tin
ob
ac
te
ria
C
hl
or
ob
i
But not a lot
C
Major Phylogenetic Group
FB
Sargasso Phylotypes
C
hl
or
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu
Metagenomic Phylotyping
s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 2004
EFG
EFTu
rRNA
RecA
RpoB
HSP70
74. Sifting Families
Representative
Genomes
B
A Extract
Protein
New
Genomes
Annotation
Extract
All v. All
Protein
BLAST
Annotation
Homology
Screen for
(MCL) C
Clustering
Homologs
SFams HMMs
Align &
Build
Sharpton et al. submitted Figure 1
HMMs
Thursday, September 6, 12