3. Molecular Evolution
At the molecular level, evolution is a process of
mutation with selection.
Molecular evolution is the study of changes in genes
and proteins throughout different branches of tree of
life and phylogeny is the inference of evolutionary
relationships.
Studies of molecular evolution began with the first
sequencing of proteins, beginning in the 1950s.
4. In 1953 Frederick Sanger and colleagues
determined the primary amino acid sequence of
insulin.
7. According to biologists, there are about 5 to 100 million species of organisms
living on Earth today. Evidence from morphological, biochemical, and gene
sequence data suggests
that all organisms on Earth are genetically related, and the genealogical
relationships can be shown by a vast phylogenetic tree, the Tree of Life -a
metaphor dating back over 100 years. At the leaves of this giant tree, are the
species which are alive today. If we could
trace their history back down the branches of the Tree of Life, we would
encounter their ancestors, which lived thousands or millions years ago .
A phylogenetic or evolutional tree is a data structure
showing evolutionary relationships between biological species, or other entities.
These relationships are called the phylogeny of the species and are based upon
differences or similarities in their genetic or physical characteristics.
Phylogeny could be considered the ‘proof’ of
evolution, and evolution is the central unifying principle of biology.
It has several biological application such biological classification.
8. It can also be applied to human understanding of life, of
biochemistry, and of evolution. Many biotechnological applications also benefit from
studies of phylogeny, and its applications in the field of medicine may directly impact
patients’ lives. Phylogeny and its methods, such as cladistics, are important fields of
science .
The main objective of the thesis is to present a detailed survey about the methods that
can be applied to construct a phylogenetic tree from given input sequences. We will also
discuss the effectiveness of these methods in practical use. The first and most critical
step
in the procedure is to find an adaptable sequence alignment, which gives the data used
for
the tree construction. In Chapter 2, the most common sequence alignment methods will
be described via dynamic programming. Chapter 3 focuses on different methods
available
in the literature for constructing a phylogenetic tree. The most important and widely used
methods will be described in details, such as Neighbor Joining and Maximum Parsimony
and the related problems to it (Perfect phylogeny, the Steiner tree problem in
phylogeny).
A linear programming approach for solving the maximum parsimony problem will also be
discussed.
9. What is Phylogenetic analysis?
Phylogenetic Tree is a tool employed to visually
represent evolutionary relationships among a group
of organisms or among a family of related nucleic acid
or protein sequences.
Study of these phylogenetic trees is known as
phylogenetic analysis. Using phylogenies as
analytical frameworks for rigorous understanding
of the evolution of various traits or conditions of
interest, the inference of the branching orders,
and ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
10. Components of phylogenetic tree
Root: the common ancestor of all taxa.
Branch: reflects the relationship
between taxa according to descent and
ancestry.
Branch length: indicates the number
of changes that have occurred in the
branch.
Node: is a taxonomic unit identifying
either an existing or an extinct species.
Clade: a subgroup of two or more taxa
or DNA/Protein sequences that
includes both their common ancestor
and all of their descendents.
Distance scale: scale that represents
the number of differences between
organisms or sequences.
Topology: defines the branching
patterns of the tree.
11. Tree Nomenclature
Evolutionary relationships
between various organisms is
depicted in the form of a
tree.
A phylogenetic tree consists
of nodes, branches and leaves
The leaves or the terminal
nodes or operational
taxonomic units (OTU’s) are
gene/protein sequences of
present day organisms
Internal nodes represent
hypothetical ancestors of
succeeding internal nodes
or/and OTU’s
Branches connect the OTU’s
and the internal nodes
12. Scaled & Unscaled Trees
In unscaled trees the branch length is not proportional to
the number of changes occurring.
Such trees are called as “CLADOGRAMS”.
Cladograms are useful when many OTU’S are present or
when you want to know the time elapsed since
divergence of the events.
13. Scaled & Unscaled Trees
• In scaled trees, the
branch lengths are
proportional to the
number of amino
acid/ nucleotide
changes.
• Such a tree is called
as “PHYLOGRAM”.
15. Clade
The subGroup of
two or more taxa
that have been
derived from a
common
ancestor plus and
all of their
descendents.
16. Tree Roots
The root of a phylogenetic tree represents the common ancestor of all
the sequences.
Many of the tree generating algorithms do not produce rooted trees.
The alternative to a rooted tree is a unrooted tree. A unrooted tree does
not give a complete description of the evolutionary path taken by various
sequences with respect to each other. The direction of time is not
determined
17. Rooting of Trees
An unrooted tree can be rooted.
The principal way to root a tree is to specify
an outgroup. Outgroup - An OTU that is
the least related to the group of taxa that
you are studying.
A second way to place a root is through
mid point rooting. Here the longest branch
is determined and that branch is presumed
to be the most reasonable site for a root.
19. Enumerating Trees
There is only one possible tree showing
evolutionary relationships between the various
sequences which we have to find out.
Some phylogenetic methods function by
evaluating all possible trees.
20. Consider just 3 and 4 OTU’S.
For three OTU’s there is only one possible unrooted tree
while for 4 OTU’s there are 3 possible unrooted trees.
21. The number of possible unrooted trees (NU) for n OTUs (n>3):
The number of rooted trees (NR):
Numbe
r of
OTU’s
Number of
Rooted Trees
Number
of
Unrooted
Trees2 1 1
3 3 1
4 15 3
5 105 15
10 34,489,707 2,027,025
15 213,458,046,67
6,875
8 X 1012
20 8 X 1021
2 X 1020
Here u can see that for 10 OTU’s
there are over 2 million unrooted
trees and 34 million rooted trees.
Hence as the number of sequences
increase heuristic methods are
absolutely essential to speed the
calculations.
22. Species Tree V/s Gene or
Protein Tree
In gene tree, an internal
node represents the
divergence of an ancestral
gene into two genes with
distinct sequences, while in
a species tree an internal
node represents a speciation
event.
Divergence of genes may
have occurred before the
speciation event during or
after speciation.
23. Steps of Phylogenetics
Choosing a marker
Multiple Sequence Alignment of homologus
nucleic acid sequences
Tree building using phylogenetic methods
Tree evaluation
24. Step 1: Choosing a marker
Synonymous & non synonymous substitutions
Multiple substitutions
Non coding regions
Pseudogenes
Transitions & Transversions
25. Sequence Alignments
In bioinformatics, sequence alignment is a method for
arranging the sequences of
DNA, RNA, or protein to identify regions of similarity that
may be a consequence of
functional, structural, or evolutionary relationships
between them [27]. These sequences,
for example DNA, contain information of living cells.
When a cell duplicates, its DNA
molecules double, and both daughter cells contain a copy
of the original cell’s DNA. However, this replication is not
perfect, because the stored information can change due to
random mutations. These mutations are the cause of the
vast diversity of the species living
on Earth
26. Pairwise sequence alignment
Given two sequences as an input, we can ask how many mutations are
needed to
transform one into the other. There are several types of mutations, and one
can occur
more frequently than the other. We assign weights to the different the types of
mutations:
frequent mutations get lower weights, rare mutations get greater weights. The
weight of a
series of mutations is defined as the sum of the weights of the individual
mutations. The
task is to find the minimum weight series of mutations, which transform one
sequence into
another. An important question is that how can we quickly find such a minimum
weight
series. A greedy algorithm finds all the possible series of mutations, and then
chooses the
minimum weight, but it is too slow, since the possible number of series of
mutations grows
exponentially. Therefore, dynamic programming is used to get a global solution
27. Dynamic programming
Dynamic programming is often used to solve
optimization problems, such as finding
the shortest path between two points. Instead of solving
the original problem, it defines
and solves a set of new, smaller subproblems, then
combines the solutions of the generated
subproblems to reach an overall solution.
3
Without solving a subproblem multiple times, it reduces
the number of computations:
it solves each subproblem only once, and stores the
results. The next time when a solution
of a subproblem is needed, it can be simply looked up.
This method is especially useful
when the number of subproblems grows exponentially
28. • From the resulting MSA, sequence homology can be
inferred and Phylogenetic analysis can be conducted to
assess the sequences' shared evolutionary origins.
• Multiple sequence alignment is often used to assess
sequence conservation of protein domains, tertiary and
secondary structures, and even individual amino acids or
nucleotides.
The first step in MSA is to choose the right sequences.
Choose sequences that are homologous. Greater the
dissimilarity between the sequences more are the chances
that related sequences are not homologous.
Perform BLAST to enhance your result. If the sequence is
not apparently homologous then it should be removed.
29. Sometimes it may also happen that the sequences are
distantly related and yet are homologous.
In this case, lower the gap creation /gap extension
penalty in order to accommodate such sequences in
the MSA.
Then, perform MSA using program like ClustalW.
Gaps in the alignment represent insertions and
deletions. Most of the phylogeny programs are not
equipped to evaluate insertions and deletions. Many
experts recommend that any column of MSA that
includes a gap in any position should be deleted.
30. MSAs require more sophisticated methodologies than
pairwise alignment because they are more
computationally complex to produce. Most multiple
sequence alignment programs use heuristic methods
rather than global optimization because identifying the
optimal alignment between more than a few sequences of
moderate length is prohibitively computationally
expensive.
Multiple Sequence Alignments use a number of
homologous sequences.
There are exhaustive and heuristic approaches used in
multiple sequence alignment.
31. Step 2: Multiple Sequence Alignment
It is a very critical step.
A multiple sequence alignment (MSA) is a
sequence alignment of three or more biological
sequences, generally protein, DNA, or RNA. In
many cases, the input set of query sequences
are assumed to have an evolutionary
relationship by which they share a lineage and
are descended from a common ancestor.
32. Exhaustive Algorithms
The exhaustive alignment method involves examining all possible
aligned positions simultaneously.
Similar to DP in PSA, which involves the use of 2D Matrix to search
for a optimal alignment, to use DP for MSA, extra dimensions are
needed to take all possible ways of sequence matching into
considerations. This means to establish a multidimensional search
matrix.
For aligning N sequences, N dimensional matrix is needed to be filled
up with the alignment scores. Hence the amount of computational
time and memory space required increases exponentially with the
sequence number.
For this reason full DP is limited to small datasets of less than 10 short
sequences.
DCA (Divide and Conquer Algorithm) comes under Exhaustive
algorithms.
33. Divide-and-conquer
approach
Many useful algorithms are recursive in structure: to solve
a given problem, they call themselves recursively one or
more times to deal with closely related subproblems.
These algorithms typically follow a divide-and-conquer
approach: they break the problem into several
subproblems that are similar to the original problem but
smaller in size, solve the subproblems recursively, and
then combine these solutions to create a solution to the
original problem.
34. Divide-and-conquer
approach (Cont.)
The divide-and-conquer paradigm
involves three steps at each level of
the recursion:
Divide the problem into a number
of subproblems.
Conquer the subproblems by
solving them recursively. If the
subproblem sizes are small enough,
however, just solve the subproblems
in a straightforward manner.
35. Heuristic Algorithms
Because the use of DP is not feasible for routine MSA,
faster and heuristic algorithm have been developed.
The Heuristic algorithms fall under three categories –
Progressive alignment type, Iterative Alignment type,
Block Based Alignment type.
36. Progressive alignment construction
The most widely used approach to multiple sequence alignments uses
a heuristic search known as progressive technique (also known as the
hierarchical or tree method), that builds up a final MSA by combining
pairwise alignments beginning with the most similar pair and
progressing to the most distantly related.
All progressive alignment methods require two stages: a first stage in
which the relationships between the sequences are represented as a
tree, called a guide tree, and a second step in which the MSA is built
by adding the sequences sequentially to the growing MSA according
to the guide tree.
The initial guide tree is determined by an efficient clustering method
such as neighbor-joining or UPGMA, and may use distances based on
the number of identical two letter sub-sequences (as in FASTA rather
than a dynamic programming alignment).
37. Progressive alignments cannot be globally optimal. The
primary problem is that when errors are made at any stage
in growing the MSA, these errors are then propagated
through to the final result. Performance is also particularly
bad when all of the sequences in the set are rather
distantly related.
Progressive alignment methods are efficient enough to
implement on a large scale for many (100s to 1000s)
sequences. Progressive alignment services are commonly
available on publicly accessible web servers so users need
not locally install the applications of interest. The most
popular progressive alignment method has been the
Clustal family, especially the weighted variant ClustalW.
41. CLUSTAL – w
www.ebi.ac.uk/clustalw/
Clustal is progressive MSA program available either as a stand alone or on
line program.
Clustal is a widely used multiple sequence alignment computer program.
The latest version is 2.0. There are two main variations:
ClustalW: command line interface
ClustalX: This version has a graphical user interface. It is available for
Windows, Mac OS, and Unix/Linux.
This program is available from the Clustal Homepage or European
Bioinformatics Institute ftp server.
There are three main steps:
Do a pair wise alignment
Create a phylogenetic tree (or use a user-defined tree)
Use the phylogenetic tree to carry out a multiple alignment
44. Step 3: Tree Building
(Phylogenetic Methods)
Distance based:
UPGMA
Neighbour-Joining
Character based:
Maximum Parsimony
Maximum Likelihood
46. Distance Based Methods
Begins the construction of a tree by calculating
the distance between the molecular sequences.
Produces one phylogenetic tree
Computationally fast
47. Tree Building - Approach
The simplest approach
-- align pairs of sequences
-- count the number of differences.
HAMMING DISTANCE - The degree of
divergence (D).
D=n/N * 100
Where n = number of sites with differences
N = Length of Alignment
48. Example of Hamming Distance
1. AGGCCATGAATTAAGAATAA
2. AGCCCATGGATAAAGAGTAA
3. AGGACATGAATTAAGAATAA
4. AAGCCAAGAATTACGAATAA
Here, the evolutionary distance is expressed as the
number of nucleotide differences for each sequence
pair.
49. Distance Matrix-Example
1 2 3 4
1 - 0.2 0.05 0.15
2 - 0.25 0.4
3 - 0.2
4 -
Sequences 1 and 2
are 20 nucleotides
in length and have
four differences,
corresponding to an
evolutionary
difference of 4/20 =
0.2.
50. UPGMA
Unweighted – Pair – Group – Method –with
Arithmetic mean
Oldest Distance Method
Proposed by Michener & Sokal in 1957
Produces rooted trees.
It assumes that the trees are ultrametric,
meaning that it assumes constant rate of
substitutions in all branches of the tree.
51. UPGMA - Working
Employs a sequential clustering algorithm
I. Identify the two OTU’s from among all the OTUs, that are
most similar to each other and then treat these as a new
single OTU.
II. Subsequently from among the new group of OTUs,
identify the pair with the highest similarity, and so on.
52. Pros and Cons
Advantage
Fast
Can handle many sequences
Disadvantage
Cannot be used when rates of substitutions are
unequal
Does not consider multiple substitutions.
55. Neighbor-Joining method
Developed by Saitou and Nei, 1987
Method for re-constructing phylogenetic trees and
computing the lengths of the branches of the tree.
Related to the cluster method
Does not require the data to be ultrametric.
Produces Unrooted Trees.
56. Steps Involved
Starts with all species in a
simple star shaped tree.
Least distance pair of
OTU’s are joined with a
node.
Process is repeated until
all other OTU’s are
connected to the internal
node & only 2 OTU’s are
left at the end.
61. Pros and Cons
Advantages
is fast and thus suited for large datasets and for
bootstrap analysis.
permits lineages with largely different branch lengths.
its suitability to handle large datasets has led to the fact
that the method is widely used by molecular
evolutionists.
Disadvantages
sequence information is reduced
gives only one possible tree
strongly dependent on the model of evolution used.
62. • Character based methods analyze candidate trees on
relationships inferred directly from the sequence
alignment.
• These approaches are distinct from distance based
methods because they do not involve an intermediate
summary of the sequence data in the form of a distance
methods, although they do depend on them.
• There are two character based methods
– Maximum parsimony
– Maximum likelihood
63. Maximum Parsimony
(Fitch, 1977)
Parsimony – carefulness in the use of resources.
The basic underlying principle behind parsimony is given
by Occam’s Razor:
“Given a choice between – a hard and easy way of doing
things, nature will always pick the easiest way i.e. simple
is always preferred over complex.”
Parsimony assumes that the relationship that requires the
fewest number of mutations to explain the current state
of sequences being considered is the relationship that is
most likely to be correct.
64. ParsimonyThe concept of parsimony is at the heart of all character-
based methods of phylogenetic reconstruction.
The 2 fundamental ideas of biological parsimony are:
mutations are exceedingly rare events ;
the more unlikely events a model invokes, the less
likely the model is to be correct.
As a result, the relationship that requires the fewest
number of mutations to explain the current state of the
sequences being considered, is the relationship that is
most likely to be correct.
As a result, the relationship that requires the fewest
number of mutations to explain the current state of the
sequences being considered, is the relationship
that is most likely to be correct.
65. Example
Multiple sequence alignment, for a parsimony approach, contains positions
that fall into two categories in terms of their information content : those that
have information (are informative) and those that do not (are uninformative).
Example:
seq 1 2 3 4 5 6
1 G G G G G G
2 G G G A G T
3 G G A T A G
4 G A T C A T
Position 1 is said invariant and therefore uninformative, because all trees
invoke the same number of mutations (0);
Position 2 is uninformative because 1 mutation occurs in all three possible
trees;
Position 3, 2 mutations occur; Position 4 requires 3 mutations in all possible
trees.
Positions 5 and 6 are informative, because one of the trees invokes only one
mutation and the other 2 alternative trees both require 2 mutations.
In general, for a position to be informative regardless of how many sequences
are aligned, it has to have at least 2 different nucleotides, and each of these
nucleotides has to be present at least twice.
66. G G
1G
2G A4
G3
2
1G
3G A4
G2
G G
1G
4A G3
G2
G G
G A
1G
2G T4
A3
3
1G
3A T4
G2
G G
1G
4T A3
G2
G G
G T
1G
2A C4
T3
4
1G
3T C4
A2
G A
1G
4C T3
A2
G A
G A
1G
2G A4
A3
5
1G
3A A4
G2
G G
1G
4A A3
G2
G G
1 G G
1G
2G G4
G3
G G
1G
3G G4
G2
G G
1G
4G G3
G3
G G
1G
2T T4
G3
6
1G
3G T4
T2
G T
1G
4T T3
T2
G T
67. The maximum parsimony algorithm searches for the
minimum number of genetic events (nucleotide
substitutions or amino acids changes) to infer the
most parsimonious tree from a set of sequences.
The best tree is one which needs fewest changes.
68. Positive and negative pointsMaximum Parsimony (positive points):
does not reduce sequence information to a single number
tries to provide information on the ancestral sequences
evaluates different trees
Maximum Parsimony (negative points):
is slow in comparison with distance methods
does not use all the sequence information (only
informative sites are used)
does not correct for multiple mutations (does not imply a
model of evolution)
does not provide information on the branch lengths
70. Bootstrapping
Bootstrap analysis:
is a statistical method for obtaining an estimate of
error.
is used to evaluate the reliability of a tree
is used to examine how often a particular cluster in a
tree appears when nucleotides or aminoacids are
resampled.
71. Maximum likelihood
This approach is a purely statistical based method.
Probabilities are considered for every individual nucleotide
substitutions in a set of sequence alignment.
Since transitions are observed roughly 3 times as often as
transversions; it can be reasonably argued that a greater
likelihood exists that the sequence with C and T are more
closely related to each other than they are to the sequence
with G.
Calculation of probabilities is complicated by the fact that
the sequence of the common ancestor to the sequences
considered being unknown.
Furthermore multiple substitutions may have occurred at
one or more sites and that all sites are not necessarily
independent or equivalent.
72. Contd…..Notes :
1. This is the best justified method from a theoretical
viewpoint;
2. ML estimates the branch lengths of the final tree ;
3. ML methods are usually consistent ;
4. Sequence simulation experiments have shown that
this method works better than all others in most
cases.
Drawbacks : they need long computation time to
construct a tree.
73. Applications
1. Taxonomy
Can infer how to classify organisms
Reflects evolution
Used for classifications
Host and parasite
Can look at co-evolution of any two related organisms
Examine phylogeny of hosts and parasites
Look at how similar the two phylogenies are
Can also look at host switching
Other examples are plant and insect interactions,
humans and pathogens (such as HIV)
74. Advantages and disadvantages of character
based methods
Advantages
MP tries to provide information on the ancestral
sequences
ML tends to outperform alternative methods such
as parsimony or distance methods even with very
short sequences
Disadvantages
Slow in comparison with distance methods
MP does not use all the sequence information
ML result is dependent on the model of evolution
used
75. Conclusion
Molecular Phylogeny
Important tool to understand the evolution and
relationships between various species
Increase in molecular data & visual impact of tree
has made phylogenetics increasingly important
77. Applications
There are wide array of applications of
phylogenetic analysis which include:
Evolution studies
Medical research and epidemology
In ecology
In criminal studies
Finding the orthologs and paralogs
78. 1. Medicinal field - to make predictions
about poorly studied species
Phylogenies also allow us to predict about
the characteristics of living organisms that
we have not yet studied.
Eg: Pacific Yew produces a compound
taxol i.e. helpful in treating certain kinds
of cancer.
However it was expensive to extract
compound from this tree.
From phylogenetic analysis leaves of the
European Yew contain a related
compound that can also be used to
efficiently produce taxol.
Pacific yew
European yew
79. 2. Evolution studies : Tracking SARS
back to its source
The previously unknown SARS virus
generated widespread panic in 2002 and
2003.
In 2003,it was suspected that it had
jumped from some cat like mammals
called civets.
In 2005,researchers found SARS like virus
in Chinese horseshoe bats and was
concluded to be the culprit by
reconstruction of evolutionary history of
virus.
SARS virus genetic material RNA was
collected from different sources and a tree
was generated.
A civet
A horseshoe bat
80. The tree showed that
civet and human SARS
viruses are very
similar to each other
and that both are
nested within a clade
of bat viruses - so the
ancestor of the civet
and human strains
seems to have been a
bat virus.
81. 3. In Criminal Studies : Evidence of HIV-1
transmission
A gastroenterologist was convicted of attempted
murder by injecting his former girlfriend with
blood or blood-products obtained from an HIV
type 1 infected patient under his care.
Phylogenetic analyses of HIV-1 sequences were
admitted and used as evidence in this case,
representing the first use of phylogenetic analyses
in a criminal court case in the United States
82. Phylogenetic analyses
of HIV-1 isolated from
the victim, the
patient, and a local
population sample of
HIV-1-positive
individuals showed
the victim's HIV-1
sequences to be most
closely related to and
nested within a
lineage comprised of
the patient's HIV-1
sequences
83. 4. Ecology - In Conservation Of Biodiversity
We are losing biodiversity at a rate that will halve
the number of species on Earth within the next 100
years.
It is not possible to save every species and its
habitat.
So ecosystems with the highest biodiversity should
be saved and phylogenetics provide a useful
measure of this biodiversity.
85. References
Books
Fundamental concepts of Bioinformatics
Dan E. Krane Michael L. Raymer
Statistical methods in Bioinformatics
Warren J. Ewens and Gregory R. Grant
Research papers
Tutorial phylogenetic tree estimation
Junhyong Kim
A phylogeny is a tree representation for the evolutionary history relating the species we are interested in.
This is an example of a 13-species phylogeny.
At each leaf of the tree is a species – we also call it a taxon in phylogenetics (plural form is taxa). They are all distinct.
Each internal node corresponds to a speciation event in the past.
When reconstructing the phylogeny we compare the characteristics of the taxa, such as their appearance, physiological features, or the composition of the genetic material.
1- position 1 on the multiple alignment
All 4 positions have the same base « G » and the positio is said invariant. Invariant sites are uninformative, because each of the three sequences invokes exactly the same number of mutations (0).
2- Position 2 is uninformative from the parsimony perspective because one mutation occurs in all three of the possible trees.
3- Position 3 is uninformative because all trees require two mutations;
4- is uninformative because all three trees require three mutations.
In contrast: , positions 5 and 6 are both informative because for both of them, one of the three trees invokes only one mutation and the other two alternative trees both require two.
Informative sites are those that allow one of the trees (see next slide) to be distinguished from the other two on the basis of how many mutations they must invoke.
In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least two different nucleotides, and each of these nucleotides has to be present at least twice.
All parsimony programs begin by applu-ying this fairly simple rule to the data set being analysed. Notice that four of the six positions being considered in the alignment shown are simply discarded and not considered any further in a parsimony analysis. All of those sites would have contributed to the pairwise similarity scores used by a distance based approach, and this diference alone can generate substantial differences in the conclusions reached by both types of approaches.
Krane & Rayner, 2002.