This document discusses the analysis of microbial communities through sequencing of the 16S rRNA gene. It presents WATERS, a workflow system that automates and bundles various software tools for analyzing 16S rRNA sequence data. The goals of WATERS are to simplify the analysis process for users without specialized bioinformatics expertise and to facilitate reproducibility through tracking of data provenance. WATERS guides users through the typical sequence analysis steps of alignment, chimera filtering, OTU clustering, taxonomy assignment, phylogeny tree building, and ecological analyses and visualization. By integrating existing tools into a single automated workflow, WATERS aims to reduce the effort required for 16S rRNA data analysis and allow researchers to focus on biological interpretation of results.
Evolution of microbiomes and the evolution of the study and politics of micro...
Diversity Diversity Diversity Diversity ....
1. Diversity Diversity Diversity
Diversity Diversity Diversity
Diversity Diversity Diversity
Diversity Diversity Diversity
SSE 2015
Jonathan A. Eisen
@phylogenomics
University of California, Davis
17. Phylotyping via rRNA PCR: One Taxon
• v
DNA
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACTGCACCTATCGTTCG
EukaryotesBacteria Archaea
!5
Many
sequences
from one
sample all
point to the
same branch
on the tree
25. Approaching to NGS
Discovery of DNA structure
(Cold Spring Harb. Symp. Quant. Biol. 1953;18:123-31)
1953
Sanger sequencing method by F. Sanger
(PNAS ,1977, 74: 560-564)
1977
PCR by K. Mullis
(Cold Spring Harb Symp Quant Biol. 1986;51 Pt 1:263-73)
1983
Development of pyrosequencing
(Anal. Biochem., 1993, 208: 171-175; Science ,1998, 281: 363-365)
1993
1980
1990
2000
2010
Single molecule emulsion PCR 1998
Human Genome Project
(Nature , 2001, 409: 860–92; Science, 2001, 291: 1304–1351)
Founded 454 Life Science 2000
454 GS20 sequencer
(First NGS sequencer)
2005
Founded Solexa 1998
Solexa Genome Analyzer
(First short-read NGS sequencer)
2006
GS FLX sequencer
(NGS with 400-500 bp read lenght)
2008
Hi-Seq2000
(200Gbp per Flow Cell)
2010
Illumina acquires Solexa
(Illumina enters the NGS business)
2006
ABI SOLiD
(Short-read sequencer based upon ligation)
2007
Roche acquires 454 Life Sciences
(Roche enters the NGS business)
2007
NGS Human Genome sequencing
(First Human Genome sequencing based upon NGS technology)
2008
From Slideshare presentation of Cosentino Cristian
http://www.slideshare.net/cosentia/high-throughput-equencing
Miseq
Roche Jr
Ion Torrent
PacBio
Oxford
Drowning in Data
AAATCGCTAGCGC
CGGCGAGCTAGC
CGAGCGATCGAGC
CGAGCATCGAGTA
26. Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Open AccessSOFTWARE
Software
Introducing W.A.T.E.R.S.: a Workflow for the
Alignment, Taxonomy, and Ecology of Ribosomal
Sequences
Amber L Hartman†1,3, Sean Riddle†2, Timothy McPhillips2, Bertram Ludäscher2 and Jonathan A Eisen*1
Abstract
Background: For more than two decades microbiologists have used a highly conserved microbial gene as a
phylogenetic marker for bacteria and archaea. The small-subunit ribosomal RNA gene, also known as 16 S rRNA, is
encoded by ribosomal DNA, 16 S rDNA, and has provided a powerful comparative tool to microbial ecologists. Over
time, the microbial ecology field has matured from small-scale studies in a select number of environments to massive
collections of sequence data that are paired with dozens of corresponding collection variables. As the complexity of
data and tool sets have grown, the need for flexible automation and maintenance of the core processes of 16 S rDNA
sequence analysis has increased correspondingly.
Results: We present WATERS, an integrated approach for 16 S rDNA analysis that bundles a suite of publicly available 16
S rDNA analysis software tools into a single software package. The "toolkit" includes sequence alignment, chimera
removal, OTU determination, taxonomy assignment, phylogentic tree construction as well as a host of ecological
analysis and visualization tools. WATERS employs a flexible, collection-oriented 'workflow' approach using the open-
source Kepler system as a platform.
Conclusions: By packaging available software tools into a single automated workflow, WATERS simplifies 16 S rDNA
analyses, especially for those without specialized bioinformatics, programming expertise. In addition, WATERS, like
some of the newer comprehensive rRNA analysis tools, allows researchers to minimize the time dedicated to carrying
out tedious informatics steps and to focus their attention instead on the biological interpretation of the results. One
advantage of WATERS over other comprehensive tools is that the use of the Kepler workflow system facilitates result
interpretation and reproducibility via a data provenance sub-system. Furthermore, new "actors" can be added to the
workflow as desired and we see WATERS as an initial seed for a sizeable and growing repository of interoperable, easy-
to-combine tools for asking increasingly complex microbial ecology questions.
Background
Microbial communities and how they are surveyed
Microbial communities abound in nature and are crucial
for the success and diversity of ecosystems. There is no
end in sight to the number of biological questions that
can be asked about microbial diversity on earth. From
animal and human guts to open ocean surfaces and deep
sea hydrothermal vents, to anaerobic mud swamps or
boiling thermal pools, to the tops of the rainforest canopy
and the frozen Antarctic tundra, the composition of
microbial communities is a source of natural history,
intellectual curiosity, and reservoir of environmental
health [1]. Microbial communities are also mediators of
insight into global warming processes [2,3], agricultural
success [4], pathogenicity [5,6], and even human obesity
[7,8].
In the mid-1980 s, researchers began to sequence ribo-
somal RNAs from environmental samples in order to
characterize the types of microbes present in those sam-
ples, (e.g., [9,10]). This general approach was revolution-
ized by the invention of the polymerase chain reaction
(PCR), which made it relatively easy to clone and then
* Correspondence: jaeisen@ucdavis.edu
1 Department of Medical Microbiology and Immunology and the Department
of Evolution and Ecology, Genome Center, University of California Davis, One
Shields Avenue, Davis, CA, 95616, USA
† Contributed equally
Full list of author information is available at the end of the article
WATERS - Kepler Workflow for rRNA
matics 2010, 11:317
.com/1471-2105/11/317
Page 2 of 14
genes for ribosomal RNA) in partic-
ubunit ribosomal RNA (ss-rRNA).
ed a large amount of previously
l diversity [1,11-13]. Researchers
all subunit rRNA gene not only
ith which it can be PCR amplified,
has variable and highly conserved
to be universally distributed among
nd it is useful for inferring phyloge-
4,15]. Since then, "cultivation-inde-
" have brought a revolution to the
by allowing scientists to study a
mount of diversity in many different
ments [16-18]. The general premise
Figure 1 Overview of WATERS. Schema of WATERS where white
boxes indicate "behind the scenes" analyses that are performed in WA-
Align
Check
chimeras
Cluster Build
Tree
Assign
Taxonomy
Tree w/
Taxonomy
Diversity
statistics &
graphs
Unifrac
files
Cytoscape
network
OTU table
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Page 3 of 14
Motivations
As outlined above, successfully processing microbial
sequence collections is far from trivial. Each step is com-
plex and usually requires significant bioinformatics
expertise and time investment prior to the biological
interpretation. In order to both increase efficiency and
ensure that all best-practice tools are easily usable, we
sought to create an "all-inclusive" method for performing
all of these bioinformatics steps together in one package.
To this end, we have built an automated, user-friendly,
workflow-based system called WATERS: a Workflow for
the Alignment, Taxonomy, and Ecology of Ribosomal
Sequences (Fig. 1). In addition to being automated and
simple to use, because WATERS is executed in the Kepler
scientific workflow system (Fig. 2) it also has the advan-
tage that it keeps track of the data lineage and provenance
of data products [23,24].
Automation
The primary motivation in building WATERS was to
minimize the technical, bioinformatics challenges that
arise when performing DNA sequence clustering, phylo-
genetic tree, and statistical analyses by automating the 16
S rDNA analysis workflow. We also hoped to exploit
additional features that workflow-based approaches
entail, such as optimized execution and data lineage
tracking and browsing [23,25-27]. In the earlier days of 16
S rDNA analysis, simply knowing which microbes were
present and whether they were biologically novel was a
noteworthy achievement. It was reasonable and expected,
therefore, to invest a large amount of time and effort to
get to that list of microbes. But now that current efforts
are significantly more advanced and often require com-
parison of dozens of factors and variables with datasets of
thousands of sequences, it is not practically feasible to
process these large collections "by hand", and hugely inef-
ficient if instead automated methods can be successfully
employed.
Broadening the user base
A second motivation and perspective is that by minimiz-
ing the technical difficulty of 16 S rDNA analysis through
the use of WATERS, we aim to make the analysis of these
datasets more widely available and allow individuals with
Figure 2 Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input
and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler
actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double-
clicking on any actor or connector allows it to be manipulated and re-arranged.
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Page 9
default is 97% and 99%), and they are also generated for
every metadata variable comparison that the user
includes.
Data pruning
To assist in troubleshooting and quality con
WATERS returns to the user three fasta files of seque
Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves s
ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on ph
genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) represent
the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
BA
3 3HUFHQW YDULDWLRQ H[SODLQHG
33HUFHQWYDULDWLRQH[SODLQHG
$%
&
')
$ %
&
'(
)
6
$
%&
'
()
6
3&$ 3 YV 3
C
%$&7(52,'(7(6
%$&7(52,'$/(6
'(/7$3527(2%$&7(5,$
$&7,12%$&7(5,$
9(558&20,&52%,$
(36,/213527(2%$&7(5,$
),50,&87(6
&/2675,',$
&/2675,',$/(6
*$00$3527(2%$&7(5,$
&<$12%$&7(5,$
$/3+$3527(2%$&7(5,$
)862%$&7(5,$
),50,&87(6
%$&,//,
),50,&87(6
02//,&87(6
Amber
Hartman
27. Phylogenetic Copy # Correction
Kembel SW, Wu M, Eisen JA, Green JL (2012) Incorporating 16S Gene Copy
Number Information Improves Estimates of Microbial Diversity and Abundance. PLoS
Comput Biol 8(10): e1002743. doi:10.1371/journal.pcbi.1002743
Steven
Kembel
Jessica
Green
28. alignment used to build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap between PhylOT
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generaliz
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTU
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard
KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity
and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:
10.1371/journal.pcbi.1001061
PhylOTU
Tom Sharpton
Katie Pollard
Jessica Green
29. Beta-Diversity
a broader range of Proteobacteria, but yielded similar results
(Fig. S1 and Tables S2 and S3).
Across all samples, we identified 4,931 quality Nitrosomadales
sequences, which grouped into 176 OTUs (operational taxo-
nomic units) using an arbitrary 99% sequence similarity cutoff.
This cutoff retained a high amount of sequence diversity, but
minimized the chance of including diversity because of se-
quencing or PCR errors. Most (95%) of the sequences appear
closely related either to the marine Nitrosospira-like clade,
known to be abundant in estuarine sediments (e.g., ref. 19) or to
marine bacterium C-17, classified as Nitrosomonas (20) (Fig. S2).
Pairwise community similarity between the samples was calcu-
lated based on the presence or absence of each OTU using
somonadales community similarity. Geographic distance con-
tributed the largest partial regression coefficient (b = 0.40,
P < 0.0001), with sediment moisture, nitrate concentration, plant
cover, salinity, and air and water temperature contributing to
smaller, but significant, partial regression coefficients (b = 0.09–
Fig. 1. The 13 marshes sampled (see Table S1 for details). Marshes com-
pared with one another within regions are circled. (Inset) The arrangement
of sampling points within marshes. Six points were sampled along a 100-m
transect, and a seventh point was sampled ∼1 km away. Two marshes in the
Northeast United States (outlined stars) were sampled more intensively,
along four 100-m transects in a grid pattern.
Fig. 2. Distance-decay curves for the Nitrosomadales communities. The
dashed, blue line denotes the least-squares linear regression across all spatial
scales. The solid lines denote separate regressions within each of the three
spatial scales: within marshes, regional (across marshes within regions circled in
Fig. 1), and continental (across regions). The slopes of all lines (except the solid
light blue line) are significantly less than zero. The slopes of the solid red lines
are significantly different from the slope of the all scale (blue dashed) line.
ECOLOGY
a broader range of Proteobacteria, but yielded similar results
(Fig. S1 and Tables S2 and S3).
Across all samples, we identified 4,931 quality Nitrosomadales
sequences, which grouped into 176 OTUs (operational taxo-
nomic units) using an arbitrary 99% sequence similarity cutoff.
This cutoff retained a high amount of sequence diversity, but
minimized the chance of including diversity because of se-
quencing or PCR errors. Most (95%) of the sequences appear
closely related either to the marine Nitrosospira-like clade,
known to be abundant in estuarine sediments (e.g., ref. 19) or to
marine bacterium C-17, classified as Nitrosomonas (20) (Fig. S2).
Pairwise community similarity between the samples was calcu-
lated based on the presence or absence of each OTU using
a rarefied Sørensen’s index (4). Community similarity using this
somonadales community similarity. Geographic distance con-
tributed the largest partial regression coefficient (b = 0.40,
P < 0.0001), with sediment moisture, nitrate concentration, plant
cover, salinity, and air and water temperature contributing to
smaller, but significant, partial regression coefficients (b = 0.09–
0.17, P < 0.05) (Table 1). Because salt marsh bacteria may be
Fig. 1. The 13 marshes sampled (see Table S1 for details). Marshes com-
pared with one another within regions are circled. (Inset) The arrangement
of sampling points within marshes. Six points were sampled along a 100-m
transect, and a seventh point was sampled ∼1 km away. Two marshes in the
Northeast United States (outlined stars) were sampled more intensively,
along four 100-m transects in a grid pattern.
Fig. 2. Distance-decay curves for the Nitrosomadales communities. The
dashed, blue line denotes the least-squares linear regression across all spatial
scales. The solid lines denote separate regressions within each of the three
spatial scales: within marshes, regional (across marshes within regions circled in
Fig. 1), and continental (across regions). The slopes of all lines (except the solid
light blue line) are significantly less than zero. The slopes of the solid red lines
are significantly different from the slope of the all scale (blue dashed) line.
ECOLOGY
Drivers of bacterial β-diversity depend on spatial scale
Jennifer B. H. Martinya,1
, Jonathan A. Eisenb
, Kevin Pennc
, Steven D. Allisona,d
, and M. Claire Horner-Devinee
a
Department of Ecology and Evolutionary Biology, and d
Department of Earth System Science, University of California, Irvine, CA 92697; b
Department of
Evolution and Ecology, University of California Davis Genome Center, Davis, CA 95616; c
Center for Marine Biotechnology and Biomedicine, The Scripps
Institution of Oceanography, University of California at San Diego, La Jolla, CA 92093; and e
School of Aquatic and Fishery Sciences, University of Washington,
Seattle, WA 98195
Edited by Edward F. DeLong, Massachusetts Institute of Technology, Cambridge, MA, and approved March 31, 2011 (received for review November 1, 2010)
The factors driving β-diversity (variation in community composi-
tion) yield insights into the maintenance of biodiversity on the
planet. Here we tested whether the mechanisms that underlie
bacterial β-diversity vary over centimeters to continental spatial
scales by comparing the composition of ammonia-oxidizing bacte-
ria communities in salt marsh sediments. As observed in studies
of macroorganisms, the drivers of salt marsh bacterial β-diversity
depend on spatial scale. In contrast to macroorganism studies,
however, we found no evidence of evolutionary diversification
of ammonia-oxidizing bacteria taxa at the continental scale, de-
spite an overall relationship between geographic distance and
community similarity. Our data are consistent with the idea that
dispersal limitation at local scales can contribute to β-diversity,
even though the 16S rRNA genes of the relatively common taxa
are globally distributed. These results highlight the importance
of considering multiple spatial scales for understanding microbial
biogeography.
microbial composition | distance-decay | Nitrosomonadales | ecological drift
Biodiversity supports the ecosystem processes upon which so-
ciety depends (1). Understanding the mechanisms that gen-
erate and maintain biodiversity is thus key to predicting ecosystem
responses to future environmental changes. The decrease in
community similarity with geographic distance is a universal
biogeographic pattern observed in communities from all
spatial scale (12). Fifty-years ago, Preston (13) noted that the
turnover rate (rate of change) of bird species composition across
space within a continent is lower than that across continents. He
attributed the high turnover rate across continents to evolu-
tionary diversification (i.e., speciation) between faunas as a result
of dispersal limitation and the lower turnover rates of bird spe-
cies within continents as a result of environmental variation.
Here we investigate whether the mechanisms underlying β-
diversity in bacteria also vary by spatial scale. We chose to focus
on the ammonia-oxidizing bacteria (AOB), which along with the
ammonia-oxidizing archaea (14), perform the rate-limiting step of
nitrification and thus play a key role in nitrogen dynamics. We
compared AOB community composition in 106 sediment samples
from 12 salt marshes on three continents. A partially nested
sampling design achieved a relatively balanced distribution of
pairwise distance classes over nine orders of magnitude, from
3 cm to 12,500 km (Fig. 1 and Table S1). We limited our sam-
pling to a monophyletic group of bacteria, the AOB within the
β-Proteobacteria, and one habitat, salt marshes primarily domi-
nated by cordgrass (Spartina spp.). This approach constrained
the pool of total diversity (richness) and kept the environmental
and plant variation relatively constant, increasing our ability to
identify if dispersal limitation influences AOB composition.
We then asked two questions: (i) Does bacterial β-diversity—
specifically, the slope of the distance-decay curve—vary over
community composition) yield insights into the maintenance of
biodiversity. These studies are still relatively rare for micro-
organisms, however, and thus our understanding of the mecha-
nisms underlying microbial diversity—most of the tree of life—
remains limited.
β-Diversity, and therefore distance-decay patterns, could be
driven solely by differences in environmental conditions across
space, a hypothesis summed up by microbiologists as, “every-
thing is everywhere—the environmental selects” (10). Under this
model, a distance-decay curve is observed because environmen-
tal variables tend to be spatially autocorrelated, and organisms
with differing niche preferences are selected from the available
pool of taxa as the environment changes with distance.
Dispersal limitation can also give rise to β-diversity, as it per-
mits historical contingencies to influence present-day biogeo-
graphic patterns. For example, neutral niche models, in which an
organism’s abundance is not influenced by its environmental
preferences, predict a distance-decay curve (8, 11). On relatively
short time scales, stochastic births and deaths contribute to
a heterogeneous distribution of taxa (ecological drift). On longer
time scales, stochastic genetic processes allow for taxon di-
versification across the landscape (evolutionary drift). If dispersal
is limiting, then current environmental or biotic conditions will
not fully explain the distance-decay curve, and thus geographic
distance will be correlated with community similarity even after
controlling for other factors (2).
For macroorganisms, the relative contribution of environ-
mental factors or dispersal limitation to β-diversity depends on
vary by spatial scale? Because most bacteria
and hardy, we predicted that dispersal lim
primarily across continents, resulting in
microbial “provinces” (15). At the same tim
environmental factors would contribute
decay at all scales, resulting in the steepest sl
scale as reported in plant and animal comm
Results and Discussion
We characterized AOB community compo
Sanger sequencing of 16S rRNA gene reg
primer sets. Here we focus on the results f
sequences from the order Nitrosomonada
primers specific for AOB within the β-Prot
The second primer set (18) generated lo
Author contributions: J.B.H.M. and M.C.H.-D. designed rese
M.C.H.-D. performed research; J.B.H.M., S.D.A., and M.C.H.-D
and M.C.H.-D. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access opti
Data deposition: The sequences reported in this paper hav
Bank database (accession nos. HQ271472–HQ276885 and H
1
To whom correspondence should be addressed. E-mail: jm
This article contains supporting information online at www.
1073/pnas.1016308108/-/DCSupplemental.
7850–7854 | PNAS | May 10, 2011 | vol. 108 | no. 19 www.pnas.org/cgi/do
Our data are consistent with the idea
that dispersal limitation at local
scales can contribute to à-diversity,
even though the 16S rRNA genes of
the relatively common taxa are
globally distributed.
Jen Hughes_
Martiny
M. Claire
Horner-Devine
30. Drosophila microbiome
Both natural surveys and laboratory
experiments indicate that host diet
plays a major role in shaping the
Drosophila bacterial microbiome.
Laboratory strains provide only a
limited model of natural host–microbe
interactions
Jenna Lang Angus Chandler
33. Culture Independent “Metagenomics”
DNA DNADNA
!19
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 AGGGGAGCTCTGCCTCG
New3 ACTCCAGCTATCGATCG
New4 ACTGCACCTATCGTTCG
RecA RecARecA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
Genome Biology 2008, 9:R151
sequences are not conserved at the nucleotide level [29]. As a
result, the nr database does not actually contain many more
protein marker sequences that can be used as references than
those available from complete genome sequences.
Comparison of phylogeny-based and similarity-based phylotyping
Although our phylogeny-based phylotyping is fully auto-
mated, it still requires many more steps than, and is slower
than, similarity based phylotyping methods such as a
MEGAN [30]. Is it worth the trouble? Similarity based phylo-
typing works by searching a query sequence against a refer-
ence database such as NCBI nr and deriving taxonomic
information from the best matches or 'hits'. When species
that are closely related to the query sequence exist in the ref-
erence database, similarity-based phylotyping can work well.
However, if the reference database is a biased sample or if it
contains no closely related species to the query, then the top
hits returned could be misleading [31]. Furthermore, similar-
ity-based methods require an arbitrary similarity cut-off
value to define the top hits. Because individual bacterial
genomes and proteins can evolve at very different rates, a uni-
versal cut-off that works under all conditions does not exist.
As a result, the final results can be very subjective.
In contrast, our tree-based bracketing algorithm places the
query sequence within the context of a phylogenetic tree and
only assigns it to a taxonomic level if that level has adequate
sampling (see Materials and methods [below] for details of
the algorithm). With the well sampled species Prochlorococ-
cus marinus, for example, our method can distinguish closely
related organisms and make taxonomic identifications at the
species level. Our reanalysis of the Sargasso Sea data placed
672 sequences (3.6% of the total) within a P. marinus clade.
On the other hand, for sparsely sampled clades such as
Aquifex, assignments will be made only at the phylum level.
Thus, our phylogeny-based analysis is less susceptible to data
sampling bias than a similarity based approach, and it makes
Major phylotypes identified in Sargasso Sea metagenomic dataFigure 3
Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using
AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The
breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
RpoB RpoBRpoB
Rpl4 Rpl4Rpl4 rRNA rRNArRNA
Hsp70 Hsp70Hsp70
EFTu EFTuEFTu
Many other genes
better than rRNA
34. Phylosift for Other Marker Genes
DNA DNADNA
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 AGGGGAGCTCTGCCTCG
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
fast candidate search
LAST
fast candidate search
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
fast candidate search
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
eachinputsequencescannedagainstbothworkflows
https://phylosift.wordpress.com
PeerJ 2:e243 https://dx.doi.org/10.7717/peerj.243
Aaron Darling
Holly Bik
35. Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Phylogenetic Binning
Nancy Moran
Dongying Wu
42. Automated Accurate Genome Tree
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of
Bacterial and Archaeal Genomes Using Conserved
Genes: Supertrees and Supermatrices. PLoS ONE
8(4): e62510. doi:10.1371/journal.pone.0062510
Jenna Lang
43. Automated Protein Family Surveys
A
B
C
Representative
Genomes
Extract
Protein
Annotation
All v. All
BLAST
Homology
Clustering
(MCL)
SFams
Align &
Build
HMMs
HMMs
Screen for
Homologs
New
Genomes
Extract
Protein
Annotation
Figure 1
Tom Sharpton
Katie Pollardhttp://www.biomedcentral.com/1471-2105/13/264
48. GEBA Cyanobacteria
Shih et al. 2013. PNAS 10.1073/pnas.1217107110
0.3
B1
B2
C1
Paulinella
Glaucophyte
Green
Red
Chromalveolates
C2
C3
A
E
F
G
B3
D
A
B
Fig. 2. Implications on plastid evolution. (A) Maxi-
mum-likelihood phylogenetic tree of plastids and cya-
nobacteria, grouped by subclades (Fig. 1). The red dot
Cheryl
Kerfeld
53. Chlorobi
)LUPLFXWHV
Tenericutes
)XVREDFWHULD
Chrysiogenetes
Proteobacteria
)LEUREDFWHUHV
TG3
Spirochaetes
WWE1 (Cloacamonetes)
70
ZB3
093í
'HLQRFRFFXVí7KHUPXV
OP1 (Acetothermia)
Bacteriodetes
TM7
GN02 (Gracilibacteria)
SR1
BH1
OD1 (Parcubacteria)
:6
OP11 (Microgenomates)
Euryarchaeota
Micrarchaea
DSEG (Aenigmarchaea)
Nanohaloarchaea
Nanoarchaea
Cren MCG
Thaumarchaeota
Cren C2
Aigarchaeota
Cren pISA7
Cren Thermoprotei
Korarchaeota
pMC2A384 (Diapherotrites)
BACTERIA ARCHAEA
archaeal toxins (Nanoarchaea)
lytic murein transglycosylase
stringent response
(Diapherotrites, Nanoarchaea)
ppGpp
limiting
amino acids
SpotT RelA
(GTP or GDP)
+ PPi
GTP or GDP
+ATP
limiting
phosphate,
fatty acids,
carbon, iron
DksA
Expression of components
for stress response
sigma factor (Diapherotrites, Nanoarchaea)
ı4
ȕ ȕ¶
ı2ı3 ı1
-35 -10
Į17'
Į7'
51$ SROPHUDVH
oxidoretucase
+ +e- donor e- acceptor
H
1
Ribo
ADP
+
1+2
O
Reduction
Oxidation
H
1
Ribo
ADP
1+
O
2H
1$' + H 1$'++ + -
HGT from Eukaryotes (Nanoarchaea)
Eukaryota
O
+2+2
OH
1+
2+3
O
O
+2+2
1+
2+3
O
tetra-
peptide
O
+2+2
OH
1+
2+3
O
O
+2+2
1+
2+3
O
tetra-
peptide
murein (peptido-glycan)
archaeal type purine synthesis
(Microgenomates)
PurF
PurD
3XU1
PurL/Q
PurM
PurK
PurE
3XU
PurB
PurP
?
Archaea
adenine guanine
O
+ 12
+
1
1+2
1
1
H
H
1
1
1
H
H
H1 1
H
PRPP )$,$5
IMP
$,$5
A
GUA
G U
G
U
A
G
U
A U
A U
A U
Growing
AA chain
W51$*O
63. Acknowledgements
DOE JGI Sloan GBMF NSF
DHS DARPA
Aaron Darling
Lizzy
Wilbanks
Jenna Lang Russell
Neches
Rob Knight
Jack Gilbert Tanja Woyke Rob Dunn
Katie Pollard
Jessica
Green
Darlene
Cavalier
Eddy RubinWendy Brown
Dongying Wu
Phil
Hugenholtz
DSMZ
Sundar
Srijak
Bhatnagar David Coil
Alex Alexiev
Hannah
Holland-Moritz
Holly Bik
John Zhang
Holly
Menninger
Guillaume
Jospin
David Lang
Cassie
Ettinger
Tim HarkinsJennifer Gardy
Holly Ganz