Exploring the Future Potential of AI-Enabled Smartphone Processors
A probabilistic parsimonious model for species tree reconstruction
1. A probabilistic parsimonious model
for species tree reconstruction
Leonardo de Oliveira Martins
David Posada
●
leomrtns@uvigo.es
●
dposada@uvigo.es
with invaluable help from Klaus Schliep and Diego Mallo
2. What do we want
●
To estimate species trees given arbitrary gene families ←
can contain paralogous, missing data, etc.
To account for uncertainty in gene tree and species tree
estimation ← some gene families may be more informative, or
●
maybe we don't have signal at all
●
To allow for several sources of disagreement ← real data
seldomly can be explained by just one biological phenomenon
●
Fast computation ← improvement provided by slower, fully
probabilistic methods may be elusive, and they can benefit from
our output nonetheless
3. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
4. Model for the evolution of gene families
S
G1
D1
G2
D2
Gn
Dn
.
.
.
5. Model for the evolution of gene families
S
G1
D1
We just need to consider the
simplest explanation for the
P(G/S)
Our assumption:
difference between the gene
and species trees
we may use several such
simple explanations
●
distance between G and S
6. Model for the evolution of gene families
S
G1
D1
We just need to consider the
simplest explanation for the
difference between the gene
and species trees
P(G/S)
Our assumption:
Rodrigo and Steel.
2008. SystBiol 57: 243
ML supertrees
we may use several such
simple explanations
●
work with unrooted gene
trees
●
penalize gene trees very
different from species tree
●
distance between G and S
7. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
8. Quantifying the disagreement
assuming deepcoal:
gene tree
species tree
reconciliation
1 deepcoal
assuming duplosses:
1 dup
3 losses
assuming HGT:
1 event
9. Quantifying the disagreement
assuming deepcoal:
gene tree
species tree
reconciliation
1 deepcoal
assuming duplosses:
1 dup
3 losses
assuming HGT:
1 event
Stochastic error/nonparametric
10. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
11. Quantifying the disagreement – other measures
mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from
Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
12. Quantifying the disagreement – other measures
de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the
Distance between Trees. PLoS ONE 3(7): e2651.
13. Quantifying the disagreement – other measures
see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints
1:e18v1
14. Quantifying the disagreement – other measures
Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two
alternative phylogenetic trees. Bioinformatics 22: 117-119
15. Now we have estimates for these
assuming deepcoal:
1 deepcoal
assuming duplosses:
1 dup
3 losses
assuming HGT:
1 event
Stochastic error/nonparametric
16. Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:
1 dup
3 losses
assuming HGT:
1 event
Stochastic error/nonparametric
17. Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:
Gene tree parsimony
1 dup
3 losses
assuming HGT:
1 event
Stochastic error/nonparametric
18. Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:
Gene tree parsimony
1 dup
3 losses
assuming HGT:
(approximate) dSPR
1 event
Stochastic error/nonparametric
19. Now we have estimates for these
assuming deepcoal:
Gene tree parsimony
1 deepcoal
assuming duplosses:
Gene tree parsimony
1 dup
3 losses
assuming HGT:
(approximate) dSPR
1 event
RF, Hdist
Stochastic error/nonparametric
20. Considering several measures of disagreement:
Thus we can incorporate e.g. duplications
and losses while accounting for HGT and
random errors
Easy to include other
distances in the future
21. Considering several measures of disagreement:
Thus we can incorporate e.g. duplications
and losses while accounting for HGT and
random errors
Easy to include other
distances in the future
Problem: the normalization constant
Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426
Solution: importance sampling estimate of Z(.)
E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating
Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26:
1663-1676.
22. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
24. Distribution of gene trees: probabilistic model
G1
S
λdup1
D1
Q1
.
.
.
λdupprior
Gn
Dn
Qn
λdupn
25. Distribution of gene trees: probabilistic model
G1
S
λdup1
D1
Q1
λloss1
.
.
.
λspr1
λdupprior
Gn
Dn
Qn
.
.
.
λdupn
λlossn .
.
λsprn .
λlossprior
λsprprior
26. Distribution of gene trees: probabilistic model
G1
S
λdup1
Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference
λloss1
.
.
.
λspr1
.
.
.
λdupprior
Gn
λdupn
λlossn .
.
λsprn .
λlossprior
λsprprior
27. Distribution of gene trees: probabilistic model
G1
S
λdup1
Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference
λloss1
.
.
.
λspr1
.
.
.
λdupprior
Gn
λdupn
λlossn .
.
λsprn .
Input
λlossprior
λsprprior
28. Distribution of gene trees: probabilistic model
G1
S
λdup1
Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference
λloss1
.
.
.
λspr1
.
.
.
λdupprior
Gn
λdupn
λlossn .
.
λsprn .
Output
λlossprior
λsprprior
29. Distribution of gene trees: probabilistic model
G1
S
λdup1
Importance
Sampling
So we can use complex,
state-of-the-art software
for phylogenetic
inference
We should not rely on
single estimates of gene
phylogenies
λloss1
.
.
.
λspr1
.
.
.
λdupprior
Gn
λdupn
λlossn .
.
λsprn .
λlossprior
λsprprior
Output
E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of
species and gene trees. Genome research 23: 323-330.
30. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
31. Example: distances between gene families
●
567 single-copy gene trees for 23 species
Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong
phylogenetic signals. Nature 497: 327–331
●
Analysis under a model where only RF, Hdist and dSPR are considered
●
Not interested in data set per se (unreliable)
●
Use it just as a didactical tool about how the model works
32. Example: distances between gene families
●
567 single-copy gene trees for 23 species
Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong
phylogenetic signals. Nature 497: 327–331
●
Analysis under a model where only RF, Hdist and dSPR are considered
●
Not interested in data set per se (unreliable)
●
Use it just as a didactical tool about how the model works
RF
Hdist
SPR
36. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
37. Analysis of simulated data sets
●
Fully probabilistic simulation of gene trees by Diego Mallo and
David Posada
●
Birth and death of new loci, conditioned on a multispecies
coalescent, followed by sequence evolution
We use gene trees only, and simulate
tree inference error
Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using
a locus tree. Genome Res. 22: 755-765
41. Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
42. Single copy genes from Drosophila (TreeFam)
●
4591 informative, single-copy gene families
●
(TreeFam database has 14250 informative gene families)
43. Single copy genes from Drosophila (TreeFam)
●
4591 informative, single-copy gene families
●
(TreeFam database has 14250 informative gene families)
44. Single copy genes from Drosophila (TreeFam)
●
4591 informative, single-copy gene families
Estimated species tree:
●
Root location uncertain
45. Single copy genes from Drosophila (TreeFam)
●
4591 informative, single-copy gene families
Estimated species tree:
●
Root location uncertain
●
Only one unrooted topology
46. Large gene families from Drosophila (TreeFam)
●
43 gene families with 102~295 tips
47. Large gene families from Drosophila (TreeFam)
●
43 gene families with 102~295 tips
best species tree:
~100%
48. To recap, our model can
●
Estimate species trees given arbitrary gene families ← can
contain paralogous, missing data, etc.
The larger, the better – specially for rooting the species tree
Account for uncertainty in gene tree and species tree
estimation ← some gene families may be more informative, or
●
maybe we don't have signal at all
Do not assume gene trees are known – embrace ignorance!
●
Allow for several sources of disagreement ← real data
seldomly can be explained by just one biological phenomenon
Different gene families may be product of distinct processes
●
Be fast ← improvement provided by slower, fully probabilistic
methods may be elusive, and they can benefit from our output
nonetheless
It's parallelized, and all distances can be calculated very fast.