SlideShare a Scribd company logo
1 of 86
Download to read offline
MULTIPLE SEQUENCE ALIGNMENT-
PHYLOGENETIC ANALYSIS BY-ADS
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Molecular Evolution
At the molecular level, evolution is a process of
mutation with selection.
Molecular evolution is the study of changes in genes
and proteins throughout different branches of tree of
life and phylogeny is the inference of evolutionary
relationships.
Studies of molecular evolution began with the first
sequencing of proteins, beginning in the 1950s.
In 1953 Frederick Sanger and colleagues
determined the primary amino acid sequence of
insulin.
Sanger and
colleagues
sequenced
insulin protein
from five
species (cow,
sheep, pig,
horse and
whale) and
compared
them
Phylogeny
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,
University of Arizona
According to biologists, there are about 5 to 100 million species of organisms
living on Earth today. Evidence from morphological, biochemical, and gene
sequence data suggests
that all organisms on Earth are genetically related, and the genealogical
relationships can be shown by a vast phylogenetic tree, the Tree of Life -a
metaphor dating back over 100 years. At the leaves of this giant tree, are the
species which are alive today. If we could
trace their history back down the branches of the Tree of Life, we would
encounter their ancestors, which lived thousands or millions years ago .
A phylogenetic or evolutional tree is a data structure
showing evolutionary relationships between biological species, or other entities.
These relationships are called the phylogeny of the species and are based upon
differences or similarities in their genetic or physical characteristics.
Phylogeny could be considered the ‘proof’ of
evolution, and evolution is the central unifying principle of biology.
It has several biological application such biological classification.
It can also be applied to human understanding of life, of
biochemistry, and of evolution. Many biotechnological applications also benefit from
studies of phylogeny, and its applications in the field of medicine may directly impact
patients’ lives. Phylogeny and its methods, such as cladistics, are important fields of
science .
The main objective of the thesis is to present a detailed survey about the methods that
can be applied to construct a phylogenetic tree from given input sequences. We will also
discuss the effectiveness of these methods in practical use. The first and most critical
step
in the procedure is to find an adaptable sequence alignment, which gives the data used
for
the tree construction. In Chapter 2, the most common sequence alignment methods will
be described via dynamic programming. Chapter 3 focuses on different methods
available
in the literature for constructing a phylogenetic tree. The most important and widely used
methods will be described in details, such as Neighbor Joining and Maximum Parsimony
and the related problems to it (Perfect phylogeny, the Steiner tree problem in
phylogeny).
A linear programming approach for solving the maximum parsimony problem will also be
discussed.
What is Phylogenetic analysis?
Phylogenetic Tree is a tool employed to visually
represent evolutionary relationships among a group
of organisms or among a family of related nucleic acid
or protein sequences.
Study of these phylogenetic trees is known as
phylogenetic analysis. Using phylogenies as
analytical frameworks for rigorous understanding
of the evolution of various traits or conditions of
interest, the inference of the branching orders,
and ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
Components of phylogenetic tree
 Root: the common ancestor of all taxa.
 Branch: reflects the relationship
between taxa according to descent and
ancestry.
 Branch length: indicates the number
of changes that have occurred in the
branch.
 Node: is a taxonomic unit identifying
either an existing or an extinct species.
 Clade: a subgroup of two or more taxa
or DNA/Protein sequences that
includes both their common ancestor
and all of their descendents.
 Distance scale: scale that represents
the number of differences between
organisms or sequences.
 Topology: defines the branching
patterns of the tree.
Tree Nomenclature
Evolutionary relationships
between various organisms is
depicted in the form of a
tree.
A phylogenetic tree consists
of nodes, branches and leaves
The leaves or the terminal
nodes or operational
taxonomic units (OTU’s) are
gene/protein sequences of
present day organisms
Internal nodes represent
hypothetical ancestors of
succeeding internal nodes
or/and OTU’s
Branches connect the OTU’s
and the internal nodes
Scaled & Unscaled Trees
In unscaled trees the branch length is not proportional to
the number of changes occurring.
Such trees are called as “CLADOGRAMS”.
Cladograms are useful when many OTU’S are present or
when you want to know the time elapsed since
divergence of the events.
Scaled & Unscaled Trees
• In scaled trees, the
branch lengths are
proportional to the
number of amino
acid/ nucleotide
changes.
• Such a tree is called
as “PHYLOGRAM”.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Clade
The subGroup of
two or more taxa
that have been
derived from a
common
ancestor plus and
all of their
descendents.
Tree Roots
 The root of a phylogenetic tree represents the common ancestor of all
the sequences.
 Many of the tree generating algorithms do not produce rooted trees.
 The alternative to a rooted tree is a unrooted tree. A unrooted tree does
not give a complete description of the evolutionary path taken by various
sequences with respect to each other. The direction of time is not
determined
Rooting of Trees
An unrooted tree can be rooted.
The principal way to root a tree is to specify
an outgroup. Outgroup - An OTU that is
the least related to the group of taxa that
you are studying.
A second way to place a root is through
mid point rooting. Here the longest branch
is determined and that branch is presumed
to be the most reasonable site for a root.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Enumerating Trees
There is only one possible tree showing
evolutionary relationships between the various
sequences which we have to find out.
Some phylogenetic methods function by
evaluating all possible trees.
 Consider just 3 and 4 OTU’S.
 For three OTU’s there is only one possible unrooted tree
while for 4 OTU’s there are 3 possible unrooted trees.
The number of possible unrooted trees (NU) for n OTUs (n>3):
The number of rooted trees (NR):
Numbe
r of
OTU’s
Number of
Rooted Trees
Number
of
Unrooted
Trees2 1 1
3 3 1
4 15 3
5 105 15
10 34,489,707 2,027,025
15 213,458,046,67
6,875
8 X 1012
20 8 X 1021
2 X 1020
Here u can see that for 10 OTU’s
there are over 2 million unrooted
trees and 34 million rooted trees.
Hence as the number of sequences
increase heuristic methods are
absolutely essential to speed the
calculations.
Species Tree V/s Gene or
Protein Tree
In gene tree, an internal
node represents the
divergence of an ancestral
gene into two genes with
distinct sequences, while in
a species tree an internal
node represents a speciation
event.
Divergence of genes may
have occurred before the
speciation event during or
after speciation.
Steps of Phylogenetics
Choosing a marker
Multiple Sequence Alignment of homologus
nucleic acid sequences
Tree building using phylogenetic methods
Tree evaluation
Step 1: Choosing a marker
Synonymous & non synonymous substitutions
Multiple substitutions
Non coding regions
Pseudogenes
Transitions & Transversions
Sequence Alignments
In bioinformatics, sequence alignment is a method for
arranging the sequences of
DNA, RNA, or protein to identify regions of similarity that
may be a consequence of
functional, structural, or evolutionary relationships
between them [27]. These sequences,
for example DNA, contain information of living cells.
When a cell duplicates, its DNA
molecules double, and both daughter cells contain a copy
of the original cell’s DNA. However, this replication is not
perfect, because the stored information can change due to
random mutations. These mutations are the cause of the
vast diversity of the species living
on Earth
Pairwise sequence alignment
Given two sequences as an input, we can ask how many mutations are
needed to
transform one into the other. There are several types of mutations, and one
can occur
more frequently than the other. We assign weights to the different the types of
mutations:
frequent mutations get lower weights, rare mutations get greater weights. The
weight of a
series of mutations is defined as the sum of the weights of the individual
mutations. The
task is to find the minimum weight series of mutations, which transform one
sequence into
another. An important question is that how can we quickly find such a minimum
weight
series. A greedy algorithm finds all the possible series of mutations, and then
chooses the
minimum weight, but it is too slow, since the possible number of series of
mutations grows
exponentially. Therefore, dynamic programming is used to get a global solution
Dynamic programming
Dynamic programming is often used to solve
optimization problems, such as finding
the shortest path between two points. Instead of solving
the original problem, it defines
and solves a set of new, smaller subproblems, then
combines the solutions of the generated
subproblems to reach an overall solution.
3
Without solving a subproblem multiple times, it reduces
the number of computations:
it solves each subproblem only once, and stores the
results. The next time when a solution
of a subproblem is needed, it can be simply looked up.
This method is especially useful
when the number of subproblems grows exponentially
• From the resulting MSA, sequence homology can be
inferred and Phylogenetic analysis can be conducted to
assess the sequences' shared evolutionary origins.
• Multiple sequence alignment is often used to assess
sequence conservation of protein domains, tertiary and
secondary structures, and even individual amino acids or
nucleotides.
The first step in MSA is to choose the right sequences.
Choose sequences that are homologous. Greater the
dissimilarity between the sequences more are the chances
that related sequences are not homologous.
Perform BLAST to enhance your result. If the sequence is
not apparently homologous then it should be removed.
Sometimes it may also happen that the sequences are
distantly related and yet are homologous.
In this case, lower the gap creation /gap extension
penalty in order to accommodate such sequences in
the MSA.
Then, perform MSA using program like ClustalW.
Gaps in the alignment represent insertions and
deletions. Most of the phylogeny programs are not
equipped to evaluate insertions and deletions. Many
experts recommend that any column of MSA that
includes a gap in any position should be deleted.
MSAs require more sophisticated methodologies than
pairwise alignment because they are more
computationally complex to produce. Most multiple
sequence alignment programs use heuristic methods
rather than global optimization because identifying the
optimal alignment between more than a few sequences of
moderate length is prohibitively computationally
expensive.
Multiple Sequence Alignments use a number of
homologous sequences.
There are exhaustive and heuristic approaches used in
multiple sequence alignment.
Step 2: Multiple Sequence Alignment
It is a very critical step.
A multiple sequence alignment (MSA) is a
sequence alignment of three or more biological
sequences, generally protein, DNA, or RNA. In
many cases, the input set of query sequences
are assumed to have an evolutionary
relationship by which they share a lineage and
are descended from a common ancestor.
Exhaustive Algorithms
The exhaustive alignment method involves examining all possible
aligned positions simultaneously.
Similar to DP in PSA, which involves the use of 2D Matrix to search
for a optimal alignment, to use DP for MSA, extra dimensions are
needed to take all possible ways of sequence matching into
considerations. This means to establish a multidimensional search
matrix.
For aligning N sequences, N dimensional matrix is needed to be filled
up with the alignment scores. Hence the amount of computational
time and memory space required increases exponentially with the
sequence number.
For this reason full DP is limited to small datasets of less than 10 short
sequences.
DCA (Divide and Conquer Algorithm) comes under Exhaustive
algorithms.
Divide-and-conquer
approach
Many useful algorithms are recursive in structure: to solve
a given problem, they call themselves recursively one or
more times to deal with closely related subproblems.
These algorithms typically follow a divide-and-conquer
approach: they break the problem into several
subproblems that are similar to the original problem but
smaller in size, solve the subproblems recursively, and
then combine these solutions to create a solution to the
original problem.
Divide-and-conquer
approach (Cont.)
The divide-and-conquer paradigm
involves three steps at each level of
the recursion:
Divide the problem into a number
of subproblems.
Conquer the subproblems by
solving them recursively. If the
subproblem sizes are small enough,
however, just solve the subproblems
in a straightforward manner.
Heuristic Algorithms
Because the use of DP is not feasible for routine MSA,
faster and heuristic algorithm have been developed.
The Heuristic algorithms fall under three categories –
Progressive alignment type, Iterative Alignment type,
Block Based Alignment type.
Progressive alignment construction
The most widely used approach to multiple sequence alignments uses
a heuristic search known as progressive technique (also known as the
hierarchical or tree method), that builds up a final MSA by combining
pairwise alignments beginning with the most similar pair and
progressing to the most distantly related.
All progressive alignment methods require two stages: a first stage in
which the relationships between the sequences are represented as a
tree, called a guide tree, and a second step in which the MSA is built
by adding the sequences sequentially to the growing MSA according
to the guide tree.
The initial guide tree is determined by an efficient clustering method
such as neighbor-joining or UPGMA, and may use distances based on
the number of identical two letter sub-sequences (as in FASTA rather
than a dynamic programming alignment).
Progressive alignments cannot be globally optimal. The
primary problem is that when errors are made at any stage
in growing the MSA, these errors are then propagated
through to the final result. Performance is also particularly
bad when all of the sequences in the set are rather
distantly related.
Progressive alignment methods are efficient enough to
implement on a large scale for many (100s to 1000s)
sequences. Progressive alignment services are commonly
available on publicly accessible web servers so users need
not locally install the applications of interest. The most
popular progressive alignment method has been the
Clustal family, especially the weighted variant ClustalW.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
CLUSTAL – w
www.ebi.ac.uk/clustalw/
Clustal is progressive MSA program available either as a stand alone or on
line program.
Clustal is a widely used multiple sequence alignment computer program.
The latest version is 2.0. There are two main variations:
 ClustalW: command line interface
 ClustalX: This version has a graphical user interface. It is available for
Windows, Mac OS, and Unix/Linux.
This program is available from the Clustal Homepage or European
Bioinformatics Institute ftp server.
There are three main steps:
 Do a pair wise alignment
 Create a phylogenetic tree (or use a user-defined tree)
 Use the phylogenetic tree to carry out a multiple alignment
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Step 3: Tree Building
(Phylogenetic Methods)
Distance based:
UPGMA
Neighbour-Joining
Character based:
Maximum Parsimony
Maximum Likelihood
Phylogenetic Methods
Distance Based Methods
Begins the construction of a tree by calculating
the distance between the molecular sequences.
Produces one phylogenetic tree
Computationally fast
Tree Building - Approach
The simplest approach
-- align pairs of sequences
-- count the number of differences.
HAMMING DISTANCE - The degree of
divergence (D).
D=n/N * 100
Where n = number of sites with differences
N = Length of Alignment
Example of Hamming Distance
1. AGGCCATGAATTAAGAATAA
2. AGCCCATGGATAAAGAGTAA
3. AGGACATGAATTAAGAATAA
4. AAGCCAAGAATTACGAATAA
Here, the evolutionary distance is expressed as the
number of nucleotide differences for each sequence
pair.
Distance Matrix-Example
1 2 3 4
1 - 0.2 0.05 0.15
2 - 0.25 0.4
3 - 0.2
4 -
Sequences 1 and 2
are 20 nucleotides
in length and have
four differences,
corresponding to an
evolutionary
difference of 4/20 =
0.2.
UPGMA
Unweighted – Pair – Group – Method –with
Arithmetic mean
Oldest Distance Method
Proposed by Michener & Sokal in 1957
Produces rooted trees.
It assumes that the trees are ultrametric,
meaning that it assumes constant rate of
substitutions in all branches of the tree.
UPGMA - Working
Employs a sequential clustering algorithm
I. Identify the two OTU’s from among all the OTUs, that are
most similar to each other and then treat these as a new
single OTU.
II. Subsequently from among the new group of OTUs,
identify the pair with the highest similarity, and so on.
Pros and Cons
Advantage
Fast
Can handle many sequences
Disadvantage
Cannot be used when rates of substitutions are
unequal
Does not consider multiple substitutions.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Nucleotide Substitutions - Types
Neighbor-Joining method
 Developed by Saitou and Nei, 1987
 Method for re-constructing phylogenetic trees and
computing the lengths of the branches of the tree.
 Related to the cluster method
 Does not require the data to be ultrametric.
 Produces Unrooted Trees.
Steps Involved
 Starts with all species in a
simple star shaped tree.
 Least distance pair of
OTU’s are joined with a
node.
 Process is repeated until
all other OTU’s are
connected to the internal
node & only 2 OTU’s are
left at the end.
N-J Tree
The resulting tree will be the following:
The resulting tree will
be :
Final Tree
Pros and Cons
 Advantages
 is fast and thus suited for large datasets and for
bootstrap analysis.
 permits lineages with largely different branch lengths.
 its suitability to handle large datasets has led to the fact
that the method is widely used by molecular
evolutionists.
 Disadvantages
 sequence information is reduced
 gives only one possible tree
 strongly dependent on the model of evolution used.
• Character based methods analyze candidate trees on
relationships inferred directly from the sequence
alignment.
• These approaches are distinct from distance based
methods because they do not involve an intermediate
summary of the sequence data in the form of a distance
methods, although they do depend on them.
• There are two character based methods
– Maximum parsimony
– Maximum likelihood
Maximum Parsimony
(Fitch, 1977)
Parsimony – carefulness in the use of resources.
The basic underlying principle behind parsimony is given
by Occam’s Razor:
“Given a choice between – a hard and easy way of doing
things, nature will always pick the easiest way i.e. simple
is always preferred over complex.”
Parsimony assumes that the relationship that requires the
fewest number of mutations to explain the current state
of sequences being considered is the relationship that is
most likely to be correct.
ParsimonyThe concept of parsimony is at the heart of all character-
based methods of phylogenetic reconstruction.
The 2 fundamental ideas of biological parsimony are:
mutations are exceedingly rare events ;
the more unlikely events a model invokes, the less
likely the model is to be correct.
As a result, the relationship that requires the fewest
number of mutations to explain the current state of the
sequences being considered, is the relationship that is
most likely to be correct.
As a result, the relationship that requires the fewest
number of mutations to explain the current state of the
sequences being considered, is the relationship
that is most likely to be correct.
Example
Multiple sequence alignment, for a parsimony approach, contains positions
that fall into two categories in terms of their information content : those that
have information (are informative) and those that do not (are uninformative).
Example:
seq 1 2 3 4 5 6
1 G G G G G G
2 G G G A G T
3 G G A T A G
4 G A T C A T
Position 1 is said invariant and therefore uninformative, because all trees
invoke the same number of mutations (0);
Position 2 is uninformative because 1 mutation occurs in all three possible
trees;
Position 3, 2 mutations occur; Position 4 requires 3 mutations in all possible
trees.
Positions 5 and 6 are informative, because one of the trees invokes only one
mutation and the other 2 alternative trees both require 2 mutations.
In general, for a position to be informative regardless of how many sequences
are aligned, it has to have at least 2 different nucleotides, and each of these
nucleotides has to be present at least twice.
G G
1G
2G A4
G3
2
1G
3G A4
G2
G G
1G
4A G3
G2
G G
G A
1G
2G T4
A3
3
1G
3A T4
G2
G G
1G
4T A3
G2
G G
G T
1G
2A C4
T3
4
1G
3T C4
A2
G A
1G
4C T3
A2
G A
G A
1G
2G A4
A3
5
1G
3A A4
G2
G G
1G
4A A3
G2
G G
1 G G
1G
2G G4
G3
G G
1G
3G G4
G2
G G
1G
4G G3
G3
G G
1G
2T T4
G3
6
1G
3G T4
T2
G T
1G
4T T3
T2
G T
The maximum parsimony algorithm searches for the
minimum number of genetic events (nucleotide
substitutions or amino acids changes) to infer the
most parsimonious tree from a set of sequences.
The best tree is one which needs fewest changes.
Positive and negative pointsMaximum Parsimony (positive points):
does not reduce sequence information to a single number
tries to provide information on the ancestral sequences
evaluates different trees
Maximum Parsimony (negative points):
is slow in comparison with distance methods
does not use all the sequence information (only
informative sites are used)
does not correct for multiple mutations (does not imply a
model of evolution)
does not provide information on the branch lengths
More than one parsimonious tree?
By Bootstrapping
Bootstrapping
Bootstrap analysis:
is a statistical method for obtaining an estimate of
error.
is used to evaluate the reliability of a tree
is used to examine how often a particular cluster in a
tree appears when nucleotides or aminoacids are
resampled.
Maximum likelihood
This approach is a purely statistical based method.
Probabilities are considered for every individual nucleotide
substitutions in a set of sequence alignment.
Since transitions are observed roughly 3 times as often as
transversions; it can be reasonably argued that a greater
likelihood exists that the sequence with C and T are more
closely related to each other than they are to the sequence
with G.
Calculation of probabilities is complicated by the fact that
the sequence of the common ancestor to the sequences
considered being unknown.
Furthermore multiple substitutions may have occurred at
one or more sites and that all sites are not necessarily
independent or equivalent.
Contd…..Notes :
1. This is the best justified method from a theoretical
viewpoint;
2. ML estimates the branch lengths of the final tree ;
3. ML methods are usually consistent ;
4. Sequence simulation experiments have shown that
this method works better than all others in most
cases.
Drawbacks : they need long computation time to
construct a tree.
Applications
1. Taxonomy
Can infer how to classify organisms
Reflects evolution
Used for classifications
Host and parasite
Can look at co-evolution of any two related organisms
Examine phylogeny of hosts and parasites
Look at how similar the two phylogenies are
Can also look at host switching
Other examples are plant and insect interactions,
humans and pathogens (such as HIV)
Advantages and disadvantages of character
based methods
Advantages
MP tries to provide information on the ancestral
sequences
ML tends to outperform alternative methods such
as parsimony or distance methods even with very
short sequences
Disadvantages
Slow in comparison with distance methods
MP does not use all the sequence information
ML result is dependent on the model of evolution
used
Conclusion
Molecular Phylogeny
Important tool to understand the evolution and
relationships between various species
Increase in molecular data & visual impact of tree
has made phylogenetics increasingly important
Which method to choose?
Applications
There are wide array of applications of
phylogenetic analysis which include:
 Evolution studies
 Medical research and epidemology
 In ecology
 In criminal studies
 Finding the orthologs and paralogs
1. Medicinal field - to make predictions
about poorly studied species
 Phylogenies also allow us to predict about
the characteristics of living organisms that
we have not yet studied.
 Eg: Pacific Yew produces a compound
taxol i.e. helpful in treating certain kinds
of cancer.
 However it was expensive to extract
compound from this tree.
 From phylogenetic analysis leaves of the
European Yew contain a related
compound that can also be used to
efficiently produce taxol.
Pacific yew
European yew
2. Evolution studies : Tracking SARS
back to its source
 The previously unknown SARS virus
generated widespread panic in 2002 and
2003.
 In 2003,it was suspected that it had
jumped from some cat like mammals
called civets.
 In 2005,researchers found SARS like virus
in Chinese horseshoe bats and was
concluded to be the culprit by
reconstruction of evolutionary history of
virus.
 SARS virus genetic material RNA was
collected from different sources and a tree
was generated.
A civet
A horseshoe bat
The tree showed that
civet and human SARS
viruses are very
similar to each other
and that both are
nested within a clade
of bat viruses - so the
ancestor of the civet
and human strains
seems to have been a
bat virus.
3. In Criminal Studies : Evidence of HIV-1
transmission
 A gastroenterologist was convicted of attempted
murder by injecting his former girlfriend with
blood or blood-products obtained from an HIV
type 1 infected patient under his care.
 Phylogenetic analyses of HIV-1 sequences were
admitted and used as evidence in this case,
representing the first use of phylogenetic analyses
in a criminal court case in the United States
Phylogenetic analyses
of HIV-1 isolated from
the victim, the
patient, and a local
population sample of
HIV-1-positive
individuals showed
the victim's HIV-1
sequences to be most
closely related to and
nested within a
lineage comprised of
the patient's HIV-1
sequences
4. Ecology - In Conservation Of Biodiversity
 We are losing biodiversity at a rate that will halve
the number of species on Earth within the next 100
years.
 It is not possible to save every species and its
habitat.
 So ecosystems with the highest biodiversity should
be saved and phylogenetics provide a useful
measure of this biodiversity.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics.
References
Books
Fundamental concepts of Bioinformatics
 Dan E. Krane Michael L. Raymer
Statistical methods in Bioinformatics
Warren J. Ewens and Gregory R. Grant
Research papers
Tutorial phylogenetic tree estimation
Junhyong Kim
Thank You

More Related Content

What's hot

What's hot (20)

Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Phylogenetic analysis
Phylogenetic analysisPhylogenetic analysis
Phylogenetic analysis
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence Alignment
 
Phylogenetics
PhylogeneticsPhylogenetics
Phylogenetics
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogenetics
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
 
UPGMA
UPGMAUPGMA
UPGMA
 
Comparitive genome mapping and model systems
Comparitive genome mapping and model systemsComparitive genome mapping and model systems
Comparitive genome mapping and model systems
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networks
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
BLAST
BLASTBLAST
BLAST
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Gene mapping
Gene mappingGene mapping
Gene mapping
 

Similar to Multiple Sequence Alignment-just glims of viewes on bioinformatics.

Report on Phylogenetic tree
Report on Phylogenetic treeReport on Phylogenetic tree
Report on Phylogenetic treeSanzid Kawsar
 
Methods of illustrating evolutionary relationship
Methods of illustrating evolutionary relationshipMethods of illustrating evolutionary relationship
Methods of illustrating evolutionary relationshipEmaSushan
 
Phylogenetic tree and its construction and phylogeny of
Phylogenetic tree and its construction and phylogeny ofPhylogenetic tree and its construction and phylogeny of
Phylogenetic tree and its construction and phylogeny ofbhavnesthakur
 
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdfphylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdfalizain9604
 
Survey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysisSurvey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysisArindam Ghosh
 
Basics of constructing Phylogenetic tree.ppt
Basics of constructing Phylogenetic tree.pptBasics of constructing Phylogenetic tree.ppt
Basics of constructing Phylogenetic tree.pptSehrishSarfraz2
 
Bioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxBioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxshabirhassan4585
 
Lecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogenyLecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogenyKristen DeAngelis
 
Plang functional genome
Plang functional genomePlang functional genome
Plang functional genometcha163
 
Phylogenetic Tree evolution
Phylogenetic Tree evolutionPhylogenetic Tree evolution
Phylogenetic Tree evolutionMd Omama Jawaid
 
Bioinformatica 24-11-2011-t6-phylogenetics
Bioinformatica 24-11-2011-t6-phylogeneticsBioinformatica 24-11-2011-t6-phylogenetics
Bioinformatica 24-11-2011-t6-phylogeneticsProf. Wim Van Criekinge
 
Taxonomic data sources powerpoint presentation.pptx
Taxonomic data sources powerpoint presentation.pptxTaxonomic data sources powerpoint presentation.pptx
Taxonomic data sources powerpoint presentation.pptxMukhtarAhmadsofi
 
Phylogenetic tree and it's types
Phylogenetic tree and it's typesPhylogenetic tree and it's types
Phylogenetic tree and it's typesNizadSultana
 
PHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptxPHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptxSilpa87
 
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...Jonathan Eisen
 

Similar to Multiple Sequence Alignment-just glims of viewes on bioinformatics. (20)

Report on Phylogenetic tree
Report on Phylogenetic treeReport on Phylogenetic tree
Report on Phylogenetic tree
 
Methods of illustrating evolutionary relationship
Methods of illustrating evolutionary relationshipMethods of illustrating evolutionary relationship
Methods of illustrating evolutionary relationship
 
Phylogenetic tree and its construction and phylogeny of
Phylogenetic tree and its construction and phylogeny ofPhylogenetic tree and its construction and phylogeny of
Phylogenetic tree and its construction and phylogeny of
 
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdfphylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
 
Survey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysisSurvey of softwares for phylogenetic analysis
Survey of softwares for phylogenetic analysis
 
Basics of constructing Phylogenetic tree.ppt
Basics of constructing Phylogenetic tree.pptBasics of constructing Phylogenetic tree.ppt
Basics of constructing Phylogenetic tree.ppt
 
phylogenetics.pdf
phylogenetics.pdfphylogenetics.pdf
phylogenetics.pdf
 
Phylogeny-Abida.pptx
Phylogeny-Abida.pptxPhylogeny-Abida.pptx
Phylogeny-Abida.pptx
 
Bioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxBioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptx
 
phylogenetic tree.pptx
phylogenetic tree.pptxphylogenetic tree.pptx
phylogenetic tree.pptx
 
Lecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogenyLecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogeny
 
Plang functional genome
Plang functional genomePlang functional genome
Plang functional genome
 
3035 e1 (2)
3035 e1 (2)3035 e1 (2)
3035 e1 (2)
 
Phylogenetic Tree evolution
Phylogenetic Tree evolutionPhylogenetic Tree evolution
Phylogenetic Tree evolution
 
Bioinformatica 24-11-2011-t6-phylogenetics
Bioinformatica 24-11-2011-t6-phylogeneticsBioinformatica 24-11-2011-t6-phylogenetics
Bioinformatica 24-11-2011-t6-phylogenetics
 
Taxonomic data sources powerpoint presentation.pptx
Taxonomic data sources powerpoint presentation.pptxTaxonomic data sources powerpoint presentation.pptx
Taxonomic data sources powerpoint presentation.pptx
 
Phylogenetic tree and it's types
Phylogenetic tree and it's typesPhylogenetic tree and it's types
Phylogenetic tree and it's types
 
PHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptxPHYLOGENETIC ANALYSIS_CSS2.pptx
PHYLOGENETIC ANALYSIS_CSS2.pptx
 
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
 
07_Phylogeny_2022.pdf
07_Phylogeny_2022.pdf07_Phylogeny_2022.pdf
07_Phylogeny_2022.pdf
 

Recently uploaded

MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.pptaigil2
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceLayne Sadler
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxRahulVishwakarma71547
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxArdeniel
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestAkashDTejwani
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabkiyorndlab
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptSachin Teotia
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsSumathi Arumugam
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...marwaahmad357
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearmarwaahmad357
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsSafaFallah
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Productspurwaborkar@gmail.com
 
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxTHE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxAkinrotimiOluwadunsi
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxkastureyashashree
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba caohsadfeeling
 
RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024suelcarter1
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxUalikhanKalkhojayev1
 

Recently uploaded (20)

MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.ppt
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data science
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptx
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening Test
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlab
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.ppt
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery Systems
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final year
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibiotics
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Products
 
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxTHE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptx
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba ca
 
RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptx
 

Multiple Sequence Alignment-just glims of viewes on bioinformatics.

  • 3. Molecular Evolution At the molecular level, evolution is a process of mutation with selection. Molecular evolution is the study of changes in genes and proteins throughout different branches of tree of life and phylogeny is the inference of evolutionary relationships. Studies of molecular evolution began with the first sequencing of proteins, beginning in the 1950s.
  • 4. In 1953 Frederick Sanger and colleagues determined the primary amino acid sequence of insulin.
  • 5. Sanger and colleagues sequenced insulin protein from five species (cow, sheep, pig, horse and whale) and compared them
  • 6. Phylogeny Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona
  • 7. According to biologists, there are about 5 to 100 million species of organisms living on Earth today. Evidence from morphological, biochemical, and gene sequence data suggests that all organisms on Earth are genetically related, and the genealogical relationships can be shown by a vast phylogenetic tree, the Tree of Life -a metaphor dating back over 100 years. At the leaves of this giant tree, are the species which are alive today. If we could trace their history back down the branches of the Tree of Life, we would encounter their ancestors, which lived thousands or millions years ago . A phylogenetic or evolutional tree is a data structure showing evolutionary relationships between biological species, or other entities. These relationships are called the phylogeny of the species and are based upon differences or similarities in their genetic or physical characteristics. Phylogeny could be considered the ‘proof’ of evolution, and evolution is the central unifying principle of biology. It has several biological application such biological classification.
  • 8. It can also be applied to human understanding of life, of biochemistry, and of evolution. Many biotechnological applications also benefit from studies of phylogeny, and its applications in the field of medicine may directly impact patients’ lives. Phylogeny and its methods, such as cladistics, are important fields of science . The main objective of the thesis is to present a detailed survey about the methods that can be applied to construct a phylogenetic tree from given input sequences. We will also discuss the effectiveness of these methods in practical use. The first and most critical step in the procedure is to find an adaptable sequence alignment, which gives the data used for the tree construction. In Chapter 2, the most common sequence alignment methods will be described via dynamic programming. Chapter 3 focuses on different methods available in the literature for constructing a phylogenetic tree. The most important and widely used methods will be described in details, such as Neighbor Joining and Maximum Parsimony and the related problems to it (Perfect phylogeny, the Steiner tree problem in phylogeny). A linear programming approach for solving the maximum parsimony problem will also be discussed.
  • 9. What is Phylogenetic analysis? Phylogenetic Tree is a tool employed to visually represent evolutionary relationships among a group of organisms or among a family of related nucleic acid or protein sequences. Study of these phylogenetic trees is known as phylogenetic analysis. Using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest, the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)
  • 10. Components of phylogenetic tree  Root: the common ancestor of all taxa.  Branch: reflects the relationship between taxa according to descent and ancestry.  Branch length: indicates the number of changes that have occurred in the branch.  Node: is a taxonomic unit identifying either an existing or an extinct species.  Clade: a subgroup of two or more taxa or DNA/Protein sequences that includes both their common ancestor and all of their descendents.  Distance scale: scale that represents the number of differences between organisms or sequences.  Topology: defines the branching patterns of the tree.
  • 11. Tree Nomenclature Evolutionary relationships between various organisms is depicted in the form of a tree. A phylogenetic tree consists of nodes, branches and leaves The leaves or the terminal nodes or operational taxonomic units (OTU’s) are gene/protein sequences of present day organisms Internal nodes represent hypothetical ancestors of succeeding internal nodes or/and OTU’s Branches connect the OTU’s and the internal nodes
  • 12. Scaled & Unscaled Trees In unscaled trees the branch length is not proportional to the number of changes occurring. Such trees are called as “CLADOGRAMS”. Cladograms are useful when many OTU’S are present or when you want to know the time elapsed since divergence of the events.
  • 13. Scaled & Unscaled Trees • In scaled trees, the branch lengths are proportional to the number of amino acid/ nucleotide changes. • Such a tree is called as “PHYLOGRAM”.
  • 15. Clade The subGroup of two or more taxa that have been derived from a common ancestor plus and all of their descendents.
  • 16. Tree Roots  The root of a phylogenetic tree represents the common ancestor of all the sequences.  Many of the tree generating algorithms do not produce rooted trees.  The alternative to a rooted tree is a unrooted tree. A unrooted tree does not give a complete description of the evolutionary path taken by various sequences with respect to each other. The direction of time is not determined
  • 17. Rooting of Trees An unrooted tree can be rooted. The principal way to root a tree is to specify an outgroup. Outgroup - An OTU that is the least related to the group of taxa that you are studying. A second way to place a root is through mid point rooting. Here the longest branch is determined and that branch is presumed to be the most reasonable site for a root.
  • 19. Enumerating Trees There is only one possible tree showing evolutionary relationships between the various sequences which we have to find out. Some phylogenetic methods function by evaluating all possible trees.
  • 20.  Consider just 3 and 4 OTU’S.  For three OTU’s there is only one possible unrooted tree while for 4 OTU’s there are 3 possible unrooted trees.
  • 21. The number of possible unrooted trees (NU) for n OTUs (n>3): The number of rooted trees (NR): Numbe r of OTU’s Number of Rooted Trees Number of Unrooted Trees2 1 1 3 3 1 4 15 3 5 105 15 10 34,489,707 2,027,025 15 213,458,046,67 6,875 8 X 1012 20 8 X 1021 2 X 1020 Here u can see that for 10 OTU’s there are over 2 million unrooted trees and 34 million rooted trees. Hence as the number of sequences increase heuristic methods are absolutely essential to speed the calculations.
  • 22. Species Tree V/s Gene or Protein Tree In gene tree, an internal node represents the divergence of an ancestral gene into two genes with distinct sequences, while in a species tree an internal node represents a speciation event. Divergence of genes may have occurred before the speciation event during or after speciation.
  • 23. Steps of Phylogenetics Choosing a marker Multiple Sequence Alignment of homologus nucleic acid sequences Tree building using phylogenetic methods Tree evaluation
  • 24. Step 1: Choosing a marker Synonymous & non synonymous substitutions Multiple substitutions Non coding regions Pseudogenes Transitions & Transversions
  • 25. Sequence Alignments In bioinformatics, sequence alignment is a method for arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between them [27]. These sequences, for example DNA, contain information of living cells. When a cell duplicates, its DNA molecules double, and both daughter cells contain a copy of the original cell’s DNA. However, this replication is not perfect, because the stored information can change due to random mutations. These mutations are the cause of the vast diversity of the species living on Earth
  • 26. Pairwise sequence alignment Given two sequences as an input, we can ask how many mutations are needed to transform one into the other. There are several types of mutations, and one can occur more frequently than the other. We assign weights to the different the types of mutations: frequent mutations get lower weights, rare mutations get greater weights. The weight of a series of mutations is defined as the sum of the weights of the individual mutations. The task is to find the minimum weight series of mutations, which transform one sequence into another. An important question is that how can we quickly find such a minimum weight series. A greedy algorithm finds all the possible series of mutations, and then chooses the minimum weight, but it is too slow, since the possible number of series of mutations grows exponentially. Therefore, dynamic programming is used to get a global solution
  • 27. Dynamic programming Dynamic programming is often used to solve optimization problems, such as finding the shortest path between two points. Instead of solving the original problem, it defines and solves a set of new, smaller subproblems, then combines the solutions of the generated subproblems to reach an overall solution. 3 Without solving a subproblem multiple times, it reduces the number of computations: it solves each subproblem only once, and stores the results. The next time when a solution of a subproblem is needed, it can be simply looked up. This method is especially useful when the number of subproblems grows exponentially
  • 28. • From the resulting MSA, sequence homology can be inferred and Phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. • Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides. The first step in MSA is to choose the right sequences. Choose sequences that are homologous. Greater the dissimilarity between the sequences more are the chances that related sequences are not homologous. Perform BLAST to enhance your result. If the sequence is not apparently homologous then it should be removed.
  • 29. Sometimes it may also happen that the sequences are distantly related and yet are homologous. In this case, lower the gap creation /gap extension penalty in order to accommodate such sequences in the MSA. Then, perform MSA using program like ClustalW. Gaps in the alignment represent insertions and deletions. Most of the phylogeny programs are not equipped to evaluate insertions and deletions. Many experts recommend that any column of MSA that includes a gap in any position should be deleted.
  • 30. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex to produce. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. Multiple Sequence Alignments use a number of homologous sequences. There are exhaustive and heuristic approaches used in multiple sequence alignment.
  • 31. Step 2: Multiple Sequence Alignment It is a very critical step. A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor.
  • 32. Exhaustive Algorithms The exhaustive alignment method involves examining all possible aligned positions simultaneously. Similar to DP in PSA, which involves the use of 2D Matrix to search for a optimal alignment, to use DP for MSA, extra dimensions are needed to take all possible ways of sequence matching into considerations. This means to establish a multidimensional search matrix. For aligning N sequences, N dimensional matrix is needed to be filled up with the alignment scores. Hence the amount of computational time and memory space required increases exponentially with the sequence number. For this reason full DP is limited to small datasets of less than 10 short sequences. DCA (Divide and Conquer Algorithm) comes under Exhaustive algorithms.
  • 33. Divide-and-conquer approach Many useful algorithms are recursive in structure: to solve a given problem, they call themselves recursively one or more times to deal with closely related subproblems. These algorithms typically follow a divide-and-conquer approach: they break the problem into several subproblems that are similar to the original problem but smaller in size, solve the subproblems recursively, and then combine these solutions to create a solution to the original problem.
  • 34. Divide-and-conquer approach (Cont.) The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems. Conquer the subproblems by solving them recursively. If the subproblem sizes are small enough, however, just solve the subproblems in a straightforward manner.
  • 35. Heuristic Algorithms Because the use of DP is not feasible for routine MSA, faster and heuristic algorithm have been developed. The Heuristic algorithms fall under three categories – Progressive alignment type, Iterative Alignment type, Block Based Alignment type.
  • 36. Progressive alignment construction The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method), that builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a tree, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering method such as neighbor-joining or UPGMA, and may use distances based on the number of identical two letter sub-sequences (as in FASTA rather than a dynamic programming alignment).
  • 37. Progressive alignments cannot be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. Progressive alignment services are commonly available on publicly accessible web servers so users need not locally install the applications of interest. The most popular progressive alignment method has been the Clustal family, especially the weighted variant ClustalW.
  • 41. CLUSTAL – w www.ebi.ac.uk/clustalw/ Clustal is progressive MSA program available either as a stand alone or on line program. Clustal is a widely used multiple sequence alignment computer program. The latest version is 2.0. There are two main variations:  ClustalW: command line interface  ClustalX: This version has a graphical user interface. It is available for Windows, Mac OS, and Unix/Linux. This program is available from the Clustal Homepage or European Bioinformatics Institute ftp server. There are three main steps:  Do a pair wise alignment  Create a phylogenetic tree (or use a user-defined tree)  Use the phylogenetic tree to carry out a multiple alignment
  • 44. Step 3: Tree Building (Phylogenetic Methods) Distance based: UPGMA Neighbour-Joining Character based: Maximum Parsimony Maximum Likelihood
  • 46. Distance Based Methods Begins the construction of a tree by calculating the distance between the molecular sequences. Produces one phylogenetic tree Computationally fast
  • 47. Tree Building - Approach The simplest approach -- align pairs of sequences -- count the number of differences. HAMMING DISTANCE - The degree of divergence (D). D=n/N * 100 Where n = number of sites with differences N = Length of Alignment
  • 48. Example of Hamming Distance 1. AGGCCATGAATTAAGAATAA 2. AGCCCATGGATAAAGAGTAA 3. AGGACATGAATTAAGAATAA 4. AAGCCAAGAATTACGAATAA Here, the evolutionary distance is expressed as the number of nucleotide differences for each sequence pair.
  • 49. Distance Matrix-Example 1 2 3 4 1 - 0.2 0.05 0.15 2 - 0.25 0.4 3 - 0.2 4 - Sequences 1 and 2 are 20 nucleotides in length and have four differences, corresponding to an evolutionary difference of 4/20 = 0.2.
  • 50. UPGMA Unweighted – Pair – Group – Method –with Arithmetic mean Oldest Distance Method Proposed by Michener & Sokal in 1957 Produces rooted trees. It assumes that the trees are ultrametric, meaning that it assumes constant rate of substitutions in all branches of the tree.
  • 51. UPGMA - Working Employs a sequential clustering algorithm I. Identify the two OTU’s from among all the OTUs, that are most similar to each other and then treat these as a new single OTU. II. Subsequently from among the new group of OTUs, identify the pair with the highest similarity, and so on.
  • 52. Pros and Cons Advantage Fast Can handle many sequences Disadvantage Cannot be used when rates of substitutions are unequal Does not consider multiple substitutions.
  • 55. Neighbor-Joining method  Developed by Saitou and Nei, 1987  Method for re-constructing phylogenetic trees and computing the lengths of the branches of the tree.  Related to the cluster method  Does not require the data to be ultrametric.  Produces Unrooted Trees.
  • 56. Steps Involved  Starts with all species in a simple star shaped tree.  Least distance pair of OTU’s are joined with a node.  Process is repeated until all other OTU’s are connected to the internal node & only 2 OTU’s are left at the end.
  • 58. The resulting tree will be the following:
  • 59. The resulting tree will be :
  • 61. Pros and Cons  Advantages  is fast and thus suited for large datasets and for bootstrap analysis.  permits lineages with largely different branch lengths.  its suitability to handle large datasets has led to the fact that the method is widely used by molecular evolutionists.  Disadvantages  sequence information is reduced  gives only one possible tree  strongly dependent on the model of evolution used.
  • 62. • Character based methods analyze candidate trees on relationships inferred directly from the sequence alignment. • These approaches are distinct from distance based methods because they do not involve an intermediate summary of the sequence data in the form of a distance methods, although they do depend on them. • There are two character based methods – Maximum parsimony – Maximum likelihood
  • 63. Maximum Parsimony (Fitch, 1977) Parsimony – carefulness in the use of resources. The basic underlying principle behind parsimony is given by Occam’s Razor: “Given a choice between – a hard and easy way of doing things, nature will always pick the easiest way i.e. simple is always preferred over complex.” Parsimony assumes that the relationship that requires the fewest number of mutations to explain the current state of sequences being considered is the relationship that is most likely to be correct.
  • 64. ParsimonyThe concept of parsimony is at the heart of all character- based methods of phylogenetic reconstruction. The 2 fundamental ideas of biological parsimony are: mutations are exceedingly rare events ; the more unlikely events a model invokes, the less likely the model is to be correct. As a result, the relationship that requires the fewest number of mutations to explain the current state of the sequences being considered, is the relationship that is most likely to be correct. As a result, the relationship that requires the fewest number of mutations to explain the current state of the sequences being considered, is the relationship that is most likely to be correct.
  • 65. Example Multiple sequence alignment, for a parsimony approach, contains positions that fall into two categories in terms of their information content : those that have information (are informative) and those that do not (are uninformative). Example: seq 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T Position 1 is said invariant and therefore uninformative, because all trees invoke the same number of mutations (0); Position 2 is uninformative because 1 mutation occurs in all three possible trees; Position 3, 2 mutations occur; Position 4 requires 3 mutations in all possible trees. Positions 5 and 6 are informative, because one of the trees invokes only one mutation and the other 2 alternative trees both require 2 mutations. In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least 2 different nucleotides, and each of these nucleotides has to be present at least twice.
  • 66. G G 1G 2G A4 G3 2 1G 3G A4 G2 G G 1G 4A G3 G2 G G G A 1G 2G T4 A3 3 1G 3A T4 G2 G G 1G 4T A3 G2 G G G T 1G 2A C4 T3 4 1G 3T C4 A2 G A 1G 4C T3 A2 G A G A 1G 2G A4 A3 5 1G 3A A4 G2 G G 1G 4A A3 G2 G G 1 G G 1G 2G G4 G3 G G 1G 3G G4 G2 G G 1G 4G G3 G3 G G 1G 2T T4 G3 6 1G 3G T4 T2 G T 1G 4T T3 T2 G T
  • 67. The maximum parsimony algorithm searches for the minimum number of genetic events (nucleotide substitutions or amino acids changes) to infer the most parsimonious tree from a set of sequences. The best tree is one which needs fewest changes.
  • 68. Positive and negative pointsMaximum Parsimony (positive points): does not reduce sequence information to a single number tries to provide information on the ancestral sequences evaluates different trees Maximum Parsimony (negative points): is slow in comparison with distance methods does not use all the sequence information (only informative sites are used) does not correct for multiple mutations (does not imply a model of evolution) does not provide information on the branch lengths
  • 69. More than one parsimonious tree? By Bootstrapping
  • 70. Bootstrapping Bootstrap analysis: is a statistical method for obtaining an estimate of error. is used to evaluate the reliability of a tree is used to examine how often a particular cluster in a tree appears when nucleotides or aminoacids are resampled.
  • 71. Maximum likelihood This approach is a purely statistical based method. Probabilities are considered for every individual nucleotide substitutions in a set of sequence alignment. Since transitions are observed roughly 3 times as often as transversions; it can be reasonably argued that a greater likelihood exists that the sequence with C and T are more closely related to each other than they are to the sequence with G. Calculation of probabilities is complicated by the fact that the sequence of the common ancestor to the sequences considered being unknown. Furthermore multiple substitutions may have occurred at one or more sites and that all sites are not necessarily independent or equivalent.
  • 72. Contd…..Notes : 1. This is the best justified method from a theoretical viewpoint; 2. ML estimates the branch lengths of the final tree ; 3. ML methods are usually consistent ; 4. Sequence simulation experiments have shown that this method works better than all others in most cases. Drawbacks : they need long computation time to construct a tree.
  • 73. Applications 1. Taxonomy Can infer how to classify organisms Reflects evolution Used for classifications Host and parasite Can look at co-evolution of any two related organisms Examine phylogeny of hosts and parasites Look at how similar the two phylogenies are Can also look at host switching Other examples are plant and insect interactions, humans and pathogens (such as HIV)
  • 74. Advantages and disadvantages of character based methods Advantages MP tries to provide information on the ancestral sequences ML tends to outperform alternative methods such as parsimony or distance methods even with very short sequences Disadvantages Slow in comparison with distance methods MP does not use all the sequence information ML result is dependent on the model of evolution used
  • 75. Conclusion Molecular Phylogeny Important tool to understand the evolution and relationships between various species Increase in molecular data & visual impact of tree has made phylogenetics increasingly important
  • 76. Which method to choose?
  • 77. Applications There are wide array of applications of phylogenetic analysis which include:  Evolution studies  Medical research and epidemology  In ecology  In criminal studies  Finding the orthologs and paralogs
  • 78. 1. Medicinal field - to make predictions about poorly studied species  Phylogenies also allow us to predict about the characteristics of living organisms that we have not yet studied.  Eg: Pacific Yew produces a compound taxol i.e. helpful in treating certain kinds of cancer.  However it was expensive to extract compound from this tree.  From phylogenetic analysis leaves of the European Yew contain a related compound that can also be used to efficiently produce taxol. Pacific yew European yew
  • 79. 2. Evolution studies : Tracking SARS back to its source  The previously unknown SARS virus generated widespread panic in 2002 and 2003.  In 2003,it was suspected that it had jumped from some cat like mammals called civets.  In 2005,researchers found SARS like virus in Chinese horseshoe bats and was concluded to be the culprit by reconstruction of evolutionary history of virus.  SARS virus genetic material RNA was collected from different sources and a tree was generated. A civet A horseshoe bat
  • 80. The tree showed that civet and human SARS viruses are very similar to each other and that both are nested within a clade of bat viruses - so the ancestor of the civet and human strains seems to have been a bat virus.
  • 81. 3. In Criminal Studies : Evidence of HIV-1 transmission  A gastroenterologist was convicted of attempted murder by injecting his former girlfriend with blood or blood-products obtained from an HIV type 1 infected patient under his care.  Phylogenetic analyses of HIV-1 sequences were admitted and used as evidence in this case, representing the first use of phylogenetic analyses in a criminal court case in the United States
  • 82. Phylogenetic analyses of HIV-1 isolated from the victim, the patient, and a local population sample of HIV-1-positive individuals showed the victim's HIV-1 sequences to be most closely related to and nested within a lineage comprised of the patient's HIV-1 sequences
  • 83. 4. Ecology - In Conservation Of Biodiversity  We are losing biodiversity at a rate that will halve the number of species on Earth within the next 100 years.  It is not possible to save every species and its habitat.  So ecosystems with the highest biodiversity should be saved and phylogenetics provide a useful measure of this biodiversity.
  • 85. References Books Fundamental concepts of Bioinformatics  Dan E. Krane Michael L. Raymer Statistical methods in Bioinformatics Warren J. Ewens and Gregory R. Grant Research papers Tutorial phylogenetic tree estimation Junhyong Kim

Editor's Notes

  1. A phylogeny is a tree representation for the evolutionary history relating the species we are interested in. This is an example of a 13-species phylogeny. At each leaf of the tree is a species – we also call it a taxon in phylogenetics (plural form is taxa). They are all distinct. Each internal node corresponds to a speciation event in the past. When reconstructing the phylogeny we compare the characteristics of the taxa, such as their appearance, physiological features, or the composition of the genetic material.
  2. 1- position 1 on the multiple alignment All 4 positions have the same base « G » and the positio is said invariant. Invariant sites are uninformative, because each of the three sequences invokes exactly the same number of mutations (0). 2- Position 2 is uninformative from the parsimony perspective because one mutation occurs in all three of the possible trees. 3- Position 3 is uninformative because all trees require two mutations; 4- is uninformative because all three trees require three mutations. In contrast: , positions 5 and 6 are both informative because for both of them, one of the three trees invokes only one mutation and the other two alternative trees both require two. Informative sites are those that allow one of the trees (see next slide) to be distinguished from the other two on the basis of how many mutations they must invoke. In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least two different nucleotides, and each of these nucleotides has to be present at least twice. All parsimony programs begin by applu-ying this fairly simple rule to the data set being analysed. Notice that four of the six positions being considered in the alignment shown are simply discarded and not considered any further in a parsimony analysis. All of those sites would have contributed to the pairwise similarity scores used by a distance based approach, and this diference alone can generate substantial differences in the conclusions reached by both types of approaches. Krane & Rayner, 2002.