"Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenomic Approaches to the
Study of Microbial Diversity
September 6, 2012
Bay Area Illumina User’s Meeting

Jonathan A. Eisen
University of California, Davis
@phylogenomics

Thursday, September 6, 12

Phylogenomic Approaches to
Studying Microbial Diversity

Example 1:

Phylotyping
and
Phylogenetic Diversity


rRNA Phylotyping
DNA
extraction PCR

Makes lots of Sequence
PCR copies of the rRNA genes
rRNA genes
in sample

rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
Sequence alignment = Data matrix
rRNA2
rRNA1 A C A C A C 5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA2 T A C A G T
rRNA3
rRNA3 C A C T G T 5’...ACGGCAAAATAGGTGGATT
rRNA4 C A C A G T CTAGCGATATAGA... 3’

E. coli A G A C A G rRNA4
5’...ACGGCCCGATAGGTGGATT
Humans T A T A G T CTAGCGCCATAGA... 3’
Yeast T A C A G T


Phylotyping


Phylotyping

E. coli Humans

Yeast


Phylotyping

E. coli Humans

Yeast

OTU2 OTU1

OTU4
OTU3

E. coli Humans

Yeast


Phylotyping
B
A

Cluster C


Phylotyping
B
A

Cluster C

B
A

OTUs C


Phylotyping
B
A

Cluster C

B
A

OTUs C

OTU1

OTU2

OTU3

OTU4


Phylotyping
B
A

Cluster C

B
A

OTUs C

OTU2 OTU1

OTU1 OTU4
OTU3
OTU2

OTU3 E. coli Humans
OTU4 Yeast


Phylotyping

Just
E. coli Humans
Phylogeny
Yeast


Phylotyping
B
A

Cluster C

Just
B E. coli Humans
Phylogeny
A

Yeast
OTUs C

OTU2 OTU1

OTU1 OTU4
OTU3
OTU2

OTU3 E. coli Humans
OTU4 Yeast


Phylotyping
• OTUs
• Taxonomic lists
• Relative abundance of taxa
• Ecological metrics (alpha and beta diversity)
• Phylogenetic metrics
• Binning
• Identification of novel groups
• Clades
• Rates of change
• LGT
• Convergence
• PD
• Phylogenetic ecology (e.g., Unifrac)

What’s New in Phylotyping


What’s New in Phylotyping I

• More PCR products

• Deeper sequencing
• The rare biosphere
• Relative abundance estimates

• More samples (with barcoding)
• Times series
• Spatially diverse sampling
• Fine scale sampling


Earth Microbiome Project


Things You Could Do
• Mississippi River: 2320 miles long


Things You Could Do
• Mississippi River: 2320 miles long
• 1 site / mile
• 3 samples / site
• 6960 samples
• rRNA PCR w/ barcodes
• metagenomics w/ barcodes
• Miseq Run:
• 30 million sequence reads
• 4310 sequences / sample
• Hiseq 2000
• 6 billion sequence reads
• 862,068 sequences / sample


Things You Could Do
• Mississippi River: 12,249,600 feet long
• 1 site / 500 feet
• 3 samples / site
• 73497 samples
• rRNA PCR w/ barcodes
• metagenomics w/ barcodes
• Miseq Run:
• 30 million sequence reads
• 408 sequences / sample
• Hiseq 2000
• 6 billion sequence reads
• 81,635 sequences / sample


What’s New in Phylotyping II

• Metagenomics avoids biases of rRNA
PCR

shotgun
sequence


Metagenomic Phylotyping
B
A

Cluster C

Just
B E. coli Humans
Phylogeny
A

Yeast
OTUs C

OTU2 OTU1

OTU1 OTU4
OTU3
OTU2

OTU3 E. coli Humans
OTU4 Yeast


Phylogenetic Challenge

??



Multiple approaches


Method 1: Each is an island


Method 1: Each is an island

• Build alignment, models, trees for full length seqs
• Analyze fragmented reads one at a time


STAP ss-rRNA Taxonomy Pip
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001

STAP database, and the query sequence is aligned to them using a
the CLUSTALW profile alignment algorithm [40] as described w
above for domain assignment. By adapting the profile alignment s
a
t
o
G
t

t

Each sequence
s
T
c

analyzed separately a
q
c
e
b

b
S
p
a
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to t
each query sequence based on its position in a maximum likelihood d
tree of representative ss-rRNA sequences. Because the tree illustrated ‘
here is not rooted, domain assignment would not be accurate and s
reliable (sequence similarity based methods cannot make an accurate
s
assignment in this case either). However the figure illustrates an
important role of the tree-based domain assignment step, namely s
automatic identification of deep-branching environmental ss-rRNAs. d
doi:10.1371/journal.pone.0002566.g002 a

PLoS ONE | www.plosone.org 5

Wu et al. 2008 PLoS One


AMPHORA

Wu and Eisen Genome
Biology 2008 9:R151
doi:10.1186/
gb-2008-9-10-r151 Guide tree

Phylotyping w/ Proteins

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Method 2: Most in the Family



xxxxxxxxxxxxxxxxxxxxxxx

xxxxxx xxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

??


Method 2: Most in family

xxxxxxxxxxxxxxxxxxxxxxx

xxxxxx xxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

One tree for those w/ overlap


rRNA in Sargasso Metagenome

Venter et al., Science
304: 66. 2004


RecA Phylotyping in Sargasso Data

304: 66. 2004


Weighted % of Clones

0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct

pr er
ot ia
eo
G

304: 66. 2004
am b ac
m t er
ap ia
ro
Ep t eo
si ba
lo ct

np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C
EFG

ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
EFTu

ut
es
Ac
tin
ob
ac
te
ria
C
hl
HSP70

or
ob
i
C

Major Phylogenetic Group
FB
Sargasso Phylotypes

C
RecA

hl
or
of
le
xi
Sp
iro
ch
ae
te
s
RpoB

Fu
so
ba
De ct
in er
ia
oc
Sargasso Phylotyping

oc
cu
s-
rRNA

Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta

STAP, QIIME, Mothur ss-rRNA Taxonomy Pip

Combine all into
one alignment

doi:10.1371/journal.pone.0002566.g001

Method 3: All in the family



A single tree with everything?


rRNA analysis
B
A

Cluster C

Just
B E. coli Humans
Phylogeny
A

Yeast
OTUs C

OTU2 OTU1

OTU1 OTU4
OTU3
OTU2

OTU3 E. coli Humans
OTU4 Yeast


PhylOTU Finding Meta

Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in
workflow of PhylOTU. See Results section for details.
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011)
doi:10.1371/journal.pcbi.1001061.g001
PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel
Taxa from Metagenomic used toPLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
alignment Data. build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap betw
sequence alignment of full-length reference sequences and clusters and recognized taxonomic designations, and
Thursday, September 6, 12 metagenomic reads. The final step of the alignment process is a the accuracy of PhylOTU clusters from shotgun re

RecA, RpoB in GOS

GOS 1

GOS 2

GOS 3

GOS 4

Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking
the Fourth Domain in Metagenomic Data: Searching for, Discovering,
GOS 5
and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic
Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011


Phylosift/ pplacer

Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric
Lowe, and others

Method 4: All in the genome


Multiple Genes?

A single tree with everything?


Kembel Combiner

Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS
ONE 6(8): e23214. doi:10.1371/journal.pone.0023214


typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone

Kembel Combiner
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a

FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.

Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS
ONE 6(8): e23214. doi:10.1371/journal.pone.0023214


Uses of Phylogeny
in Genomics and Metagenomics

Example 2:

Functional Diversity and
Functional Predictions


PHYLOGENENETIC PREDICTION OF GENE FUNCTION

EXAMPLE A METHOD EXAMPLE B

2A CHOOSE GENE(S) OF INTEREST 5

3A 1 3 4
2B 2
IDENTIFY HOMOLOGS 5
1A 2A 1B 3B 6

ALIGN SEQUENCES

1A 2A 3A 1B 2B 3B 1 2 3 4 5 6

CALCULATE GENE TREE

Duplication?

1A 2A 3A 1B 2B 3B 1 2 3 4 5 6

OVERLAY KNOWN
FUNCTIONS ONTO TREE

Duplication?

2A 3A 1B 2B 3B 1 2 3 4 5 6
1A

INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
Ambiguous
Duplication?

Species 1 Species 2 Species 3

Based on
1A 1B 2A 2B 3A 3B 1 2 3 4 5 6

ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN) Eisen, 1998
Genome Res 8:
Duplication 163-167.


Diversity of Proteorhodopsins

Venter et al., 2004.
Science 304: 66.

Improving Functional Predictions

• Same methods discussed for phylotyping
improve phylogenomic functional
prediction for protein families
• Increase in sequence diversity helps too


NMF in Metagenomes
Characterizing the niche-space distributions of components

0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .2 0 .4 0 .6 0 .8 1 .0

Polyne sia Archipe la gos_ G S 0 4 8 a _ C ora l R e e f
India n O ce a n_ G S 1 2 0 _ O pe n O ce a n
Polyne sia Archipe la gos_ G S 0 4 9 _ C oa sta l
G a la pa gos Isla nds_ G S 0 2 6 _ O pe n O ce a n
G e ne ra l
C a ribbe a n S e a _ G S 0 1 5 _ C oa sta l
C a ribbe a n S e a _ G S 0 1 9 _ C oa sta l
India n O ce a n_ G S 1 1 4 _ O pe n O ce a n H igh
E a ste rn Tropica l Pa cific_ G S 0 2 3 _ O pe n O ce a n M e dium
India n O ce a n_ G S 1 1 0 a _ O pe n O ce a n
India n O ce a n_ G S 1 0 8 a _ La goon R e e f Low
C a ribbe a n S e a _ G S 0 1 8 _ O pe n O ce a n NA
G a la pa gos Isla nds_ G S 0 3 4 _ C oa sta l
C a ribbe a n S e a _ G S 0 1 7 _ O pe n O ce a n
India n O ce a n_ G S 1 4 8 _ F ringing R e e f
C a ribbe a n S e a _ G S 0 1 6 _ C oa sta l S e a
India n O ce a n_ G S 1 4 9 _ H a rbor
E a ste rn Tropica l Pa cific_ G S 0 2 2 _ O pe n O ce a n W a te r de pth
S ites

S a rga sso S e a _ G S 0 0 1 c_ O pe n O ce a n
G a la pa gos Isla nds_ G S 0 3 0 _ W a rm S e e p
G a la pa gos Isla nds_ G S 0 2 9 _ C oa sta l >4000m
G a la pa gos Isla nds_ G S 0 3 1 _ C oa sta l upwe lling
India n O ce a n_ G S 1 1 7 a _ C oa sta l sa m ple
2000!4000m
G a la pa gos Isla nds_ G S 0 2 8 _ C oa sta l 900!2000m
G a la pa gos Isla nds_ G S 0 3 6 _ C oa sta l 100!200m
Polyne sia Archipe la gos_ G S 0 5 1 _ C ora l R e e f Atoll
N orth Am e rica n E a st C oa st_ G S 0 1 4 _ C oa sta l 20!100m
N orth Am e rica n E a st C oa st_ G S 0 0 6 _ E stua ry 0!20m
E a ste rn Tropica l Pa cific_ G S 0 2 1 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 0 9 _ C oa sta l
N orth Am e rica n E a st C oa st_ G S 0 1 1 _ E stua ry
N orth Am e rica n E a st C oa st_ G S 0 0 5 _ E m baym e nt

Co Co Co Co Co

Chlorophyll
Salinity

Temperature

Water Depth
Sample Depth

Insolation
mp mp mp mp mp
on on on on on
en en en en en
t1 t2 t3 t4 t5

(a) (b) (c)

Figure 3: a) Niche-space distributions for our ﬁve components (H T );Weitz,site-
Non-negative c) environmental variables for the sites. w/ matrices Dushoff,
ˆ ˆ
similarity matrix (H T H);
matrix factorization b) the
Langille, Neches,
The are
aligned so that et al. Inrow corresponds to One. site in each matrix. Sites are
Jiang the same press PLoS the same
Levin, etc
ordered by applying spectral reordering to the similarity matrix (see Materials and
Methods). Rows are aligned across the three matrices.

Uses of Phylogeny
in Genomics and Metagenomics

Example 3:

Selecting Organisms for Study


GEBA

http://www.jgi.doe.gov/programs/GEBA/pilot.html

GEBA: Components
• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan
Eisen, Eddy Rubin, Jim Bristow)
• Project management (David Bruce, Eileen Dalin, Lynne
Goodwin)
• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)
• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla
Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen,
Jan-Fang Cheng)
• Annotation and data release (Nikos Kyrpides, Victor
Markowitz, et al)
• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu,
Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain,
Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati,
Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)
• Outreach (David Gilbert)
• $$$ (DOE, Eddy Rubin, Jim Bristow)


GEBA Now

• 300+ genomes
• Rich sampling of major groups of
cultured organisms


GEBA Lesson 1


Protein Family Rarefaction

• Take data set of multiple complete
genomes
• Identify all protein families using MCL
• Plot # of genomes vs. # of protein families


Wu et al. 2009 Nature 462, 1056-1060


Synapomorphies exist

Wu et al. 2009 Nature 462, 1056-1060


GEBA Lesson 2



0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia

ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
es
Ac
tin
ob
ac
te
ria
C
hl
or
ob
i
C

FB
Sargasso Phylotypes

phylotyping &

C
hl
or
GEBA benefits

of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu

functional prediction

s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 2004
EFG
EFTu

rRNA
RecA
RpoB
HSP70

GEBA improves genome annotation

• Took 56 GEBA genomes and compared results vs. 56
randomly sampled new genomes
• Better definition of protein family sequence “patterns”
• Greatly improves “comparative” and “evolutionary”
based predictions
• Conversion of hypothetical into conserved hypotheticals
• Linking distantly related members of protein families
• Improved non-homology prediction



0
0.125
0.250
0.375
0.500
Al
ph
ap
ro
t eo
Be ba
ta ct
er
pr ia

ot
eo
G b
am ac
m t er
ap ia
ro
Ep teo
si ba
lo ct
np er
ro ia
eo t
De ba
lta ct
pr er
ot ia
eo
ba
C ct
ya er
no ia
ba
ct
er
Fi ia
rm
ic
ut
es
Ac
tin
ob
ac
te
ria
C
hl
or
ob
i
But not a lot

C

FB
Sargasso Phylotypes

C
hl
or
of
le
xi
Sp
iro
ch
ae
te
Fu s
so
ba
De ct
in er
ia
oc
oc
cu

s-
Th
Eu er
ry m
ar u
ch s
ae
C ot
a
re
na
rc
ha
eo
ta
Venter et al., Science 304: 66-74. 2004
EFG
EFTu

rRNA
RecA
RpoB
HSP70

Improving Functional Predictions


Sifting Families
Representative
Genomes

B
A Extract
Protein
New
Genomes
Annotation

Extract
All v. All
Protein
BLAST
Annotation

Homology
Screen for
(MCL) C
Clustering
Homologs

SFams HMMs

Align &
Build
Sharpton et al. submitted Figure 1
HMMs


Improving Phylotyping


More Markers
Phylogenetic group Genome Gene Maker
Number Number Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684


Better Reference Tree

Morgan et al.
submitted

GEBA Lesson 3

We have still only scratched the
surface of microbial diversity


PD: All

From Wu et al. 2009 Nature 462, 1056-1060

GEBA uncultured
Number of SAGs from Candidate Phyla

406
1
OD1

OP1

OP3

SAR
Site A: Hydrothermal vent 4 1 - -
Site B: Gold Mine 6 13 2 -
Site C: Tropical gyres (Mesopelagic) - - - 2
Site D: Tropical gyres (Photic zone) 1 - - -

Sample collections at 4 additional sites are underway.

Phil Hugenholtz

76


GEBA Lesson IV

Need Experiments from Across
the Tree of Life too


Conclusion


MICROBES


Acknowledgements

• $$$
• DOE
• NSF
• GBMF
• Sloan
• DARPA
• DSMZ
• DHS
• People, places
• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides
• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell
Neches, Jenna Morgan-Lang
• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,
Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward,
Hans-Peter Klenk


"Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (7)

En vedette

En vedette (20)

Plus de Jonathan Eisen

Plus de Jonathan Eisen (20)

Dernier

Dernier (20)

"Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting