1. Phylogenetics Workshop
Part I : Introduction
De Landtsheer Sébastien, University of Luxemburg
Ahead of the BeNeLux Bioinformatics Conference 2011
2. Outline of the Workshop
Part I :
• General introduction
• Alignments
• Distance-based methods
Part II :
• Maximum likelihood trees
• Bayesian trees
Part III :
• Advanced bayesian phylogenetics
• Hypothesis testing
3. Outline of Part I
• General introduction : what is
phylogenetics ?
• Basic DNA alignment algorithm
• Distance matrices
• Distance-based tree inference methods
4. Software featured in Part I
• Seaview (http://pbil.univ-lyon1.fr/software/seaview.html)
• BioEdit (http://www.mbio.ncsu.edu/bioedit/bioedit.html)
• MEGA (http://www.megasoftware.net/)
• FigTree (http://tree.bio.ed.ac.uk/software/figtree/)
5. What is Phylogenetics ?
• Classification of living species into
categories
• Study of characters → states
• Underlying assumption of evolution
(cladogram / dendrogram)
6. What is Phylogenetics
• Characters :
– Morphological
– Biochemical
– Genetic
• States :
– Continuous
– Discontinous
7. Different types of Phylogenetic trees
• Phylogenetic tree : graphical representation of
our hypothesis about the evolution of a group of
organisms
• Can represent different quantities (time/genetic
distance) and be displayed in different ways
• There are several possible methods, and there
is no single method that is best
8. Phylogenetic trees jargon
Internal
branches
Root
(if there is)
Node
Terminal
branches
Leaves or
Tips or
OTUs
11. Properties of Phylogenetic trees
• The real face of unrooted trees : undirected
=
Multiple possibilities for rooting the tree
12. Properties of Phylogenetic trees
• Where to place the root ?
– Midpoint rooting : equally distant from the two most distantly
related taxa on the tree. Makes sense but more often than not it
is wrong
– Outgroup : using one distantly related taxon (uncontroversial)
• Marsupial for eutherian study
• Treeshrew for primate study
• SIV for HIV study
13. Properties of Phylogenetic trees
• How to root unrooted trees ?
1) Midpoint rooting
=
Assumes that the rates of evolution have stayed +/- constant
14. Properties of Phylogenetic trees
• How to root unrooted trees ?
2) Using an outgroup
=
Problem : difficult to find the proper outgroup
(not ambiguous choice but still not too distant)
15. Properties of Phylogenetic trees
• Rooted trees tell a story (directed)
Most Recent Common Ancestor (MRCA)
17. Properties of Phylogenetic trees
• Many topologies are always possible :
Number of possible rooted trees for n sequences
= (2n-3)! / (2n-2 (n-2))!
2 sequences: 1
3 sequences: 3
4 sequences: 15
5 sequences: 105
6 sequences: 954
7 sequences: 10395
8 sequences: 135135
9 sequences: 2027025
10 sequences: 34459425
51 sequences: >1080 (nb of particles in the universe)
18. DNA alignments
• Aligning two sequences: the Needleman–Wunsch
algorithm
– Construct a similarity matrix
– Assign similarity scores based on an arbitrary scoring system
– Finds the best GLOBAL alignment between two sequence = the
maximum number of residues from one sequence that can be
aligned with the other one
19. DNA alignments
A T G T A C C G T
0 0 0 0 0 0 0 0 0 0
T 0
G 0
A 0
C 0
T 0
C 0
G 0
T 0
20. DNA alignments
• The score in one cell is the maximum of different
possibilities :
– 0
– The upper left cell plus the value of the similarity between the
two residues
– The upper cell plus the value of a gap (in the upper sequence)
– The left cell plus the value of a gap (in the left sequence)
Hi,j = max { Hi-1,j-1+s(ai,bj), Hi,j-1+Pg(k), Hi-1,j+Pg(k) }
There is a penality for gap opening and for gap extension
21. DNA alignments
• For the example we will use the following scoring matrix :
– Identity : +1
– Gap : 0
• In real life ClustalW uses different scoring matrices
depending the code (AA or DNA) and can be set to use
word matches (k-tuples). All parameters are editable
22. DNA alignments
A T G T A C C G T
0 0 0 0 0 0 0 0 0 0
T 0 0
G 0
A 0
C 0
T 0
C 0
G 0
T 0
23. DNA alignments
A T G T A C C G T
0 0 0 0 0 0 0 0 0 0
T 0 0 1 1 2 2 2 2 2 3
G 0
A 0
C 0
T 0
C 0
G 0
T 0
24. DNA alignments
A T G T A C C G T
0 0 0 0 0 0 0 0 0 0
T 0 0 1 1 2 2 2 2 2 3
G 0 0 1
A 0 1 1
C 0 1 1
T 0 1 2
C 0 1 2
G 0 1 2
T 0 1 3
25. DNA alignments
A T G T A C C G T
0 0 0 0 0 0 0 0 0 0
T 0 0 1 1 2 2 2 2 2 3
G 0 0 1 2 2 2 2 2 3 3
A 0 1 1 2 2 3 3 3 3 3
C 0 1 1 2 2 3 4 4 4 4
T 0 1 2 2 3 3 4 4 4 5
C 0 1 2 2 3 3 4 5 5 5
G 0 1 2 3 3 3 4 5 6 6
T 0 1 3 3 4 3 4 5 6 7
26. DNA alignments
A T G T A C C G T
0 0 0 0 0 0 0 0 0 0
T 0 0 1 1 2 2 2 2 2 3
G 0 0 1 2 2 2 2 2 3 3
A 0 1 1 2 2 3 3 3 3 3
C 0 1 1 2 2 3 4 4 4 4
T 0 1 2 2 3 3 4 4 4 5
C 0 1 2 2 3 3 4 5 5 5
G 0 1 2 3 3 3 4 5 6 6
T 0 1 3 3 4 3 4 5 6 7
27. DNA alignments
• Final sequence :
A T G T A C - C G T
- T G - A C T C G T
28. DNA alignments
• More technological alignment methods include :
– T-COFFEE computes a tree that is the consistent with the
pairwise alignments scores computed from a variety of sources.
Computationnaly intensive (not good for big datasets)
– MUSCLE is an iterative refinement algorithm. Very fast
– MAFFT uses fast Fourier Transform to detect homologous
regions. Very fast
– Genetic Algorithms (ex : SAGA) generates a population of
alignments that evolves according to selection and crossing.
Very slow but allows to define custom scoring functions. Need to
be run several times (stochastic)
– Hidden Markov models (HMMs) used to be innacurate methods.
They are better now but still slow and difficult to use
29. DNA alignments
• Good practice for alignments :
– Use a variety of algorithms
– Align at the nucleotide but also at the amino acid level
(TranslatorX or manually)
– Compare the different outputs
– Check manualy :
• Consistancy given ORF (frame-shift)
• Sequencing errors
– The alignment also can be seen as an hypothesis,
therefore it needs to make sense from the biological
point of view : genes have to be HOMOLOGS (share
ancestry)
30. Building trees with distance methods
• The distance between 2 sequences can be calculated in
different ways:
– number of differences
– according to a substitution model
• The clustering can be achieved in different ways:
– UPGMA
– Neighbor-joining
– (Parsimony)
31. Building trees with distance methods
• Building a UPGMA tree with the number of differences :
1. Calculate the pairwise distance matrix
A B C D E F
A 0 1 3 6 7 10
B 1 0 3 6 7 10
C 3 3 0 5 6 9
D 6 6 5 0 1 7
E 7 7 6 1 0 8
F 10 10 9 7 8 0
32. Building trees with distance methods
• Building a UPGMA tree with the number of differences :
2. Group the 2 most closely related sequences
A B C D E F
A 0 1 3 6 7 10
B 1 0 3 6 7 10
C 3 3 0 5 6 9
D 6 6 5 0 1 7
E 7 7 6 1 0 8
F 10 10 9 7 8 0
A
B
0.5
0.5
33. Building trees with distance methods
• Building a UPGMA tree with the number of differences :
3. Recalculate the distance matrix and take the next smallest distance
A/B C D E F
A/B 0 3 6 7 10
C 3 0 5 6 9
D 6 5 0 1 7
E 7 6 1 0 8
F 10 9 7 8 0
A
B
0.5
0.5
D
E
0.5
0.5
34. Building trees with distance methods
• Building a UPGMA tree with the number of differences :
3. Recalculate the distance matrix and take the next smallest distance
A
B
0.5
0.5
D
E
0.5
0.5
A/B C D/E F
1
A/B 0 3 6.5 10
C 3 0 5.5 9
D/E 6.5 5.5 0 7.5
F 10 9 7.5 0 1.5
C
35. Building trees with distance methods
• Building a UPGMA tree with the number of differences :
3. Recalculate the distance matrix and take the next smallest distance
A
B
0.5
0.5
D
E
0.5
0.5
C
1
1.5
A/B/
C D/E F
A/B/C 0 6 9.5
D/E 6 0 7.5
F 9.5 7.5 0
1.5
2.5
36. Building trees with distance methods
• Building a UPGMA tree with the number of differences :
3. Recalculate the distance matrix and take the next smallest distance
A
B
0.5
0.5
D
E
0.5
0.5
C
1
1.5
1.5
2.5
A/B/C/D/E F
A/B/C/D/E 0 8.5
F 8.5 0
4.25 F
1.25
37. Building trees with distance methods
• Assumption of the UPGMA method : constant rate of evolution
across time and for all branches. This assumption is frequently
violated in real-life datasets and therefore the UPGMA can find a
wrong tree.
• How can we relax this assumption ? We calculate the total
divergence for each tip and compute a corrected distance matrix
• Starting from a star-like tree, we create branches to minimize the
length of the tree and agglomeratively join the closest neighbors
=> Neighbor-joining
38. Building trees with distance methods
• Building a Neighbog-Joining tree with the number of differences
A
B
1
4
1 TRUE topology where
D
E
3
2
C
1
2
1
1
4 F
B has accumulated 4
times as much
mutations as A since
their divergence
39. Building trees with distance methods
• Building a Neighbog-Joining tree with the number of differences
A
B
1
4
D
E
3
2
C
1
2
1
1
4 F
1
A B C D E F
A 0 5 4 7 6 8
B 5 0 7 10 9 11
C 4 7 0 7 6 8
D 7 10 7 0 5 9
E 6 9 6 5 0 8
F 8 11 8 9 8 0
UPGMA would cluster A and C
together because B is more
distant
40. Building trees with distance methods
• A global divergence is calculated by summing all distances, and a
new distance matrix is computed
A B C D E F
A 0 5 4 7 6 8
B 5 0 7 10 9 11
C 4 7 0 7 6 8
D 7 10 7 0 5 9
E 6 9 6 5 0 8
F 8 11 8 9 8 0
Div 30 42 32 38 34 44
A B C D E F
A 0 -13 -11.5 -10 -10 -10.5
B -13 0 -11.5 -10 -10 -10.5
C -11.5 -11.5 0 -10.5 -10.5 -11
D -10 -10 -10.5 0 -13 -11.5
E -10 -10 -10.5 -13 0 -11.5
F -10.5 -10.5 -11 -11.5 -11.5 0
Div(A) = Σi dist(A,i) = 5+4+7+6+8 = 30
Div(B) = Σi dist(B,i) = 5+7+10+9+11 = 42
Div(C) = Σi dist(C,i) = 32
Div(D) = Σi dist(D,i) = 38
Div(E) = Σi dist(E,i) = 34
Div(F) = Σi dist(F,i) = 44
M(i,j) = dist(i,j)-(Div(i)+Div(j))/N-2
M(A,B) = 5-(30+42)/4 = -13
M(A,C) = 4-(30+32)/4=-11.5
etc…
41. Building trees with distance methods
• Starting with a star-like tree, the nodes are created sequentially
A
B
C
D
E
F
A
B
C
D
E
F 1 4
…
42. Advantages and disadvantages of
the Neighbor-Joining method
• Fast method that will always produce a reasonnable tree. Always
produces the same tree if the same alignment is used
• Relaxes the most irrealistic assumptions of the UPGMA
• Long Branches Attraction : two taxa with similar converging
properties (increased GC content or high evolutionary rates) will
have the tendency to group together
43. How to test the reliability of trees ?
• One popular method : BOOTSTRAPPING
– Randomly generates new alignment from the original one, by drawing
positions with replacement
– The new alignments will have the same length, but slightly different
composition than the original one (i.e. some positions will be represented
more than once and some positions will be omitted)
– Tree reconstruction is applied to these new alignment.
– The clustering in the original tree are investigated, to see how often they
occur in the bootstrapped trees. The more a group appears, the more
that node is supported by a high bootstrap value
44. How to test the reliability of trees ?
• Bootstrapping example : 1) The Data
x y
1 0.969977
2 1.744463
3 3.073277
4 4.510589
5 5.471489
6 5.599175
7 7.03988
8 7.812655
9 8.913299
10 9.971481
11 9.98552
12 10.24078
13 10.59902
14 12.61131
15 12.63132
16 13.83974
17 16.03453
18 17.27271
19 19.25622
20 19.26901
Original Data
y = 0.9176x + 0.2072
R2 = 0.9794
20
18
16
14
12
10
8
6
4
2
0
0 2 4 6 8 10 12 14 16 18 20
X
Y
45. How to test the reliability of trees ?
• Bootstrapping example : 2) Resampling
46. How to test the reliability of trees ?
• Bootstrapping example : 3) Analyse the Resamples
47. How to test the reliability of trees ?
• Boostrapping example : 4) Assess the reliability of the original
estimates with the dispersion of the estimates of the resamples
Original Data + Bootstraps
20
18
16
14
12
10
8
6
4
2
0
0 2 4 6 8 10 12 14 16 18 20
X
Y
48. How to test the reliability of trees ?
• BOOTSTRAPPING :
Taxon A : ATGCGAGTTTAGCAG
Taxon B : ATGCGAGCTTAACTG
Taxon C : ATACTAGCTTAGCTG
Taxon D : ATGCTATCTTAGGTG
Alignment s1
Alignment s2
Alignment s3
Alignment s4
AB
CD
AB
CD
AB
CD
AB
CD
AB
CD
A+B : 4/4 = 100%
C+D : 3/4 = 75%
A+B+C+D : 4/4 = 100%
A
B
C
D
100
100
75
49. Genetic distances
• A multitude of forces act on sequences (mutation, selection, drift)
and therefore two sequences coming from a common ancestor will
diverge with time
• The problem with counting the number of difference (p-distance) is
that it does not take into account multiple substitutions on the same
site
• Therefore we need to model the substitution process
=> time-homogenous continuous stationary Markov Process
53. Genetic distances
• How does the p-distance correlates with speciation time ?
When we look at the divergence of proteins in distantly related
organisms, we expect a linear relation (e.g. the more distant
organisms share less and less identities)
=> correct but we always underestimate the genetic
distance if we only count the number of differences
54. Genetic distances
• How does the p-distance correlates with speciation time ?
Observed p difference
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
Non-linear relation because of multiple, parallel, and back-substitutions
55. Genetic distances
• How to model sequence evolution ? (Jukes and Cantor, 1969)
– All possible substitutions have the same probability
– All 4 nucleotides have the same frquency = 25%
– The chance for a particular substitution is a simple function of time
– The chance for a nucleotide to not change is therefore a decreasing
function of time
– Two random sequences (diverged for an infinite time) will still have 25%
identity (there are only 4 nucleotides)
56. Genetic distances
• The JC69 matrix :
to A C G T
from
A ¼+3/4*X ¼-1/4*X ¼-1/4*X ¼-1/4*X
C ¼-1/4*X ¼+3/4*X ¼-1/4*X ¼-1/4*X
G ¼-1/4*X ¼-1/4*X ¼+3/4*X ¼-1/4*X
T ¼-1/4*X ¼-1/4*X ¼-1/4*X ¼+3/4*X
X = e-μ.t
Sums of columns = sums of lines : the rate of appearance of
nucleotides
is the same as the rate of disparition (nucleotides are at equilibrium)
57. Genetic distances
• How to model sequence evolution ? (Jukes and Cantor, 1969)
– Example : we count 20 differences between two 100bp-long sequences
• d = -3/4 * ln( 1 - 4/3 * p )
• p = 0.2
• d = 0.232
• => there are 3 mutations that have occured but that we do not see, because
they have occured in a position where another mutation had already occured
– Does this now efficiently model the substitution process ?
58. Genetic distances
• How to model sequence evolution ? Some facts
– Transitions are more likely than transversions
purines pyrimidines
A T
G C
59. Genetic distances
• How to model sequence evolution ? Some facts
– Not all positions evolve at the same rate :
the chance for an amino acid change is
different for the third position than for the
other positions
60. Genetic distances
• How to model sequence evolution ? Some facts
– Not all positions evolve at the same rate :
some codons are under strong purifying
selection, while some other are under
diversifying selection
=> they do not evolve at the same rate
61. Genetic distances
• How to model sequence evolution ?
– Better models have been designed to take into account the individuality
of each substitution rate.
– Rate heterogeneity models take into account the inter-position
differences. Some positions are allowed to evolve faster than other
– Genomes have their proper nucleotide compositions (GC-content)
62. Genetic distances
• Some models of nucleotide substitution
- JC69 : a=b=c=d=e=f
A=C=G=T=1/4
- K80 : b=e, a=c=d=f
A=C=G=T=1/4
- HKY85 : b=e, a=c=d=f
A ≠ C ≠ G ≠ T
- TN93 : b, e, a=c=d=f
A ≠ C ≠ G ≠ T
- GTR : a, b, c, d, e, f
A ≠ C ≠ G ≠ T
More models are possible
(12-parameters, codons) but
are generally not used
63. Genetic distances
• Site heterogeneity models
– Usually described as a Gamma distribution (discretized in 4 – 10
categories)
– An arbitrary proportion of invariant sites is sometimes added
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
k=1
k=1.5
k=3
k=5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
64. Genetic distances
• Which model to chose ?
– The simplest models make irrealistic assumptions
– Why don’t we choose always the most complex models ?
• Difficult to compute
• Parameters values difficult to get from the data
• Danger of overfitting
68. Genetic distances
• How to chose the appropriate model ?
– Likelihood ratio tests for nested models (more on that later)
– Many information criteria have been designed (more on that later also)
– Trial-and error, depends of the dataset
• more data -> more complex model
• better data -> more complex models
• Litterature search
69. File formats (flat files)
• FASTA (.fas, .fst, .fasta)
Most common sequence format, no header
>seq1
ATCGTGCATACGAGCT
>seq2
ATCGTGCATACGACGT
>seq3
ATCGTGCATACGAAGT
70. File formats (flat files)
• NEXUS (.nex)
Contains blocks with sequence and tree information
#NEXUS
Begin Data;
Dimensions ntax=3 nchar =16;
Format datatype=Nucleotide gap=-;
[insert comment here]
seq1 ATCGTGCATACGAGCT
seq2 ATCGTGCATACGACGT
seq3 ATCGTGCATACGAAGT
End;
71. Practicals : Phylogenetics Part I
1. Download the file « PrimatesNuc_1.txt». Open it and identify its format. Rename it
with the correct extension.
2. Load the file in BioEdit and run a multiple alignment (select all sequences then click
« Accessory application -> ClustalW multiple alignment »). Save the resulting file
3. Load the original file in Seaview and check alignment options (Align -> Alignment
options). Select ClustalW2 and run a multiple alignment (Align all). Save the resulting
file. Then, reload the original data, change the option to Muscle and run the alignment
again. Save this file too
4. To generate an consistency-based alignment with T-COFFEE, access the web page
http://www.tcoffee.org/, submit the original data and save the resulting alignment
5. To generate an alignment with MAFFT, access the web page
http://mafft.cbrc.jp/alignment/server/index.html, submit the original data with the
default options, and save the resulting alignment
6. Now we can compare the alignments obtained by the different methods. Access the
web page http://bibiserv.techfak.uni-bielefeld.de/altavist/, select option 2 for
comparing two alignments and compare the different alignments you produced. Which
alignment is the most different ? Which are the most identical ? Can you guess why ?
Open the alignments in BioEdit and spot the differences.
72. Practicals : Phylogenetics Part I
7. Open MEGA. Import the MAFFT alignment. Open it as « analyse », consider it
« nucleotides », « coding sequence », with the standard genetic code. Press F4 to
open the alignment explorer. Try the different options in the « Statistics » menu.
8. In the « Models » menu, select « Find Best DNA/Protein Model (ML) ». Leave the
default options and run. Which model has the best likelihood ? Which model is the
most appropriate ?
9. Go to the « Distances » menu and select « Compute pairwise distances ». Now
select the proper options for this analysis (substitution model and site heterogeneity
model). Are chimps closer to humans or to gorillas ? (you might need to export the
data to Excel)
10. Go to the « Phylogeny » menu and select « Construct/Test UPGMA Tree ». Leave
the default options and compute. Does the human/chimp/gorilla clustering fit with
your knowledge ? Redo the analysis with appropriate options (substitution model
and site heterogeneity model). Does it get any better ?
11. Go to the « Phylogeny » menu and select « Construct/Test Neighbor-Joining Tree ».
Select the appropriate options and compute the tree. Do chimps cluster with
humans or gorillas ? Being able to explain is important
12. Try the same with the ClustalW alignment. Draw some conclusions for yourself
73. Practicals : Phylogenetics Part I
13. Which of these 4 unrooted trees does not have the same topology as the 3 other
ones ?