This document presents FastMulRFS, a new method for estimating species trees from multi-copy gene families that leverages phylogenetic information across gene trees while handling gene duplication and loss. FastMulRFS improves upon existing methods by preprocessing gene trees for faster computation while maintaining accuracy. On simulated and real plant datasets, FastMulRFS was faster than comparable methods and produced species trees with high accuracy, even with high levels of gene tree estimation error. The researchers plan to further evaluate FastMulRFS and compare it to additional species tree estimation methods.
1. FastMulRFS: Fast and accurate
species tree estimation under generic
gene duplication and loss models
Erin Molloy & Tandy Warnow, University of Illinois at Urbana Champaign
Systematics, Biogeography and Evolution (SBE) Meeting 2020
Symposium: Methods in Phylogenetic Inference
Funding: Ira & Debra Cohen Graduate Fellowship in CS to EKM, U.S. NSF Grant #1535977 to TW
2. Motivation
In most studies, species trees are estimated from single-copy genes, so multi-copy genes are excluded
[e.g. Wickett et al. (2014) estimated species tree from ~400 single-copy genes, excluding ~9,000 multi-
copy genes]
Goal
Leverage phylogenetic information from multi-copy genes for species tree estimation
Species Tree Estimation Pipeline
1.Estimate phylogeny for each gene family — the result is a multi-labeled gene tree or MUL-tree
2.Run method that computes species tree from MUL-trees — for example:
• DupTree [Wehe et al., 2008]
• ASTRAL-multi [Rabiee et al., 2019; Legried et al., 2020]
• MulRF [Chaudhary et al., 2013]
MOTIVATION & BACKGROUND
3. MulRF: Robinson-Foulds (RF)
Supertree Problem for MUL-trees
Find species tree that minimizes
total RF distance between extended
species tree and MUL-trees
FastMulRFS
1. Preprocesses each MUL-tree,
producing a tree that is singly-
labeled and potentially unresolved
2. Applies FastRFS [Vachaspati &
Warnow, 2017], which solves RF
supertree problem exactly within
constrained search space defined
by input, to preprocessed trees
NEW METHOD — FASTMULRFS
(b) Extended Species Tree
A1 A2 B1 B2 C1 C2 D1 D2
E
(a) Species Tree
D EA B C
(d) Preprocessed MUL-tree
D EA B C
A1 B1 C1 C2 B2 A2 D1 D2
E
(c) MUL-tree
4. Running Time
• DupTree — 2.7 hrs
• ASTRAL-multi — >48 hrs
• MulRF — ran out of mem (>256 GB)
• FastMulRFS — 5.4 hrs
Species Tree Comparison
• FastMulRFS found 2 equally optimal
species trees in constrained search
space — so their strict consensus has
76 internal branches
• 69 of these internal branches agree
with species tree estimated by
running ASTRAL on ~400 single-copy
genes [Wickett et al., 2014]
No GTEE
(N=10)
52% Mean GTEE
(N=10)
0.0
0.1
0.2
0.3
SpeciesTreeError
Model: High GDL + low/mod ILS
No GTEE
(N=10)
52% Mean GTEE
(N=10)
0
50
100
150
RunningTime(m)
DupTree
ASTRAL-multi
MulRF
FastMulRFS
No GTEE
(N=10)
52% Mean GTEE
(N=10)
0.0
0.1
0.2
0.3
SpeciesTreeError
Model: High GDL + low/mod ILS
No GTEE
(N=10)
52% Mean GTEE
(N=10)
0
50
100
150
RunningTime(m)
DupTree
ASTRAL-multi
MulRF
FastMulRFS
NOTES: Boxes left to right = names top to bottom; GDL = Gene Duplication & Loss;
ILS = Incomplete Lineage Sorting; GTEE = Gene Tree Estimation Error
L: Simulated Datasets (100 taxa & 500 genes) | R: OneKP Dataset (83 taxa & 9,237 genes)
5. Get FastMulRFS:
https://github.com/ekmolloy/fastmulrfs
Learn more:
https://doi.org/10.1093/bioinformatics/btaa444
In our study, FastMulRFS
• was as accurate as DupTree and ASTRAL-multi when GTEE was low
• was more accurate than DupTree and ASTRAL-multi when GTEE was high
• was faster than MulRF (different heuristic for same optimization problem) and had similar accuracy
• enabled analysis of OneKP Plant dataset, which was not possible with MulRF
Future Work
• Examine 7 branches in FastMulRFS tree that disagree with OneKP analysis by Wickett et al. (2014)
• Evaluate FastMulRFS on more model conditions
• Compare FastMulRFS to other methods (PHYLDOG [Boussau et al., 2012], A-Pro [Zhang et al., 2020])
CONCLUSIONS & FUTURE WORK