SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
Lecture 11
Phylogenetic trees
Principles of Computational
Biology
Teresa Przytycka, PhD
Phylogenetic (evolutionary) Tree
• showing the evolutionary relationships among
various biological species or other entities that
are believed to have a common ancestor.
• Each node is called a taxonomic unit.
• Internal nodes are generally called hypothetical
taxonomic units
• In a phylogenetic tree, each node with
descendants represents the most recent
common ancestor of the descendants, and the
• edge lengths (if present) correspond to time
estimates.
Methods to construct phylogentic
trees
• Parsimony
• Distance matrix based
• Maximum likelihood
Parsimony methods
The preferred evolutionary tree is
the one that requires
“the minimum net amount of evolution”
[Edwards and Cavalli-Sforza, 1963]
Assumption of character based
parsimony
• Each taxa is described by a set of characters
• Each character can be in one of finite number
of states
• In one step certain changes are allowed in
character states
• Goal: find evolutionary tree that explains the
states of the taxa with minimal number of
changes
Example
Taxon1 Yes Yes No
Taxon 2 YES Yes Yes
Taxon 3 Yes No No
Taxon 4 Yes No No
Taxon 5 Yes No Yes
Taxon 6 No No Yes
Ancestral
states
4 Changes
Version parsimony models:
• Character states
– Binary: states are 0 and 1 usually interpreted as presence or
absence of an attribute (eg. character is a gene and can be
present or absent in a genome)
– Multistate: Any number of states (Eg. Characters are
position in a multiple sequence alignment and states are
A,C,T,G.
• Type of changes:
– Characters are ordered (the changes have to happen in
particular order or not.
– The changes are reversible or not.
Variants of parsimony
• Fitch Parsimony unordered, multistate characters with
reversibility
• Wagner Parsimony ordered, multistate characters
with reversibility
• Dollo Parsimony ordered, binary characters with
reversibility but only one insertion allowed per
character characters that are relatively chard to gain but easy
to lose (like introns)
• Camin-Sokal Parsimony- no reversals, derived states
arise once only
• (binary) prefect phylogeny – binary and non-
reversible; each character changes at most once.
Prefect – No
(triangle gained
and the lost)
Dollo – Yes
Camin-Sokal –
No (for the same
reason as perfect)
3 Changes
Camin-Sokal Parsimony
Triangle inserted twice!
But this is
not prefect
and not
Dollo
Homoplasy
Having some states arise more than once is
called homoplasy.
Example – triangle in the tree on the
previous slide
Finding most parsimonious tree
• There are exponentially many trees with n
nodes
• Finding most parsimonious tree is NP-
complete (for most variants of parsimony
models)
• Exception: Perfect phylogeny if exists can
be found quickly. Problem – perfect
phylogeny is to restrictive in practice.
Perfect phylogeny
• Each change can happen only once and is
not reversible.
• Can be directed or not
Example: Consider binary characters. Each character
corresponds to a gene.
0-gene absent
1-gene present
It make sense to assume directed changes only form 0 to 1.
The root has to be all zeros
Perfect phylogeny
Example: characters = genes; 0 = absent ; 1 = present
Taxa: genomes (A,B,C,D,E)
A 0 0 0 1 1 0
B 1 1 0 0 0 0
C 0 0 0 1 1 1
D 1 0 1 0 0 0
E 0 0 0 1 0 0
genes
B D E A C
Perfect phylogeny tree
Goal: For a given character state matrix construct a tree topology
that provides perfect phylogeny.
1
1
1
1 1
1
Does there exist prefect
parsimony tree for our example
with geometrical shapes?
There is a simple test
Character Compatibility
• Two characters A, B are
compatible if there do
not exits four taxa
containing all four
combinations as in the
table
• Fact: there exits perfect
phylogeny if and only if
and only if all pairs of
characters are
compatible
T1 1 1
T2 1 0
T3 0 1
T4 0 0
A B
Are not compatible
Taxon1 Yes Yes No
Taxon 2 YES Yes Yes
Taxon 3 Yes No No
Taxon 4 Yes No No
Taxon 5 Yes No Yes
Taxon 6 No No Yes
?
One cannot add triangle
to the tree so that no
character changes it state
twice:
If we add it to on of the
left branches it will be
inserted twice if to the
right most – circle would
have to be deleted
(insertion and the
deletion of the circle)
Ordered characters and perfect
phylogeny
• Assume that we in the last common
ancestor all characters had state 0.
• This assumption makes sense for many
characters, for example genes.
• Then compatibility criterion is even
simpler: characters are compatible if and
only if there do not exist three taxa
containing combinations (1,0),(0,1),(1,1)
Example
Under assumption that states are directed
form 0 to 1: if i and j are two different genes
then the set of species containing i is either
disjoint with set if species containing j or
one of this sets contains the other.
A 0 0 0 1 1 0
B 1 1 0 0 0 0
C 0 0 0 1 1 1
D 1 0 1 0 0 0
E 0 0 0 1 0 0
• The above property is necessary and sufficient for prefect
phylogeny under 0 to 1 ordering
• Why works: associated with each character is a subtree.
These subtrees have to be nested.
Simple test for prefect phylogeny
• Fact: there exits perfect phylogeny if and only if and
only if all pairs of characters are compatible
• Special case: if we assume directed parsimony (0!1
only) then characters are compatible if and only if
there do not exist tree taxa containing combinations
(1,0),(0,1),(1,1)
• Observe the last one is equivalent to non-overlapping
criterion
• Optimal algorithm: Gusfield O(nm):
n = # taxa; m= #characters
Two version optimization problem:
Small parsimony: Tree is given and we want to find the
labeling that minimizes #changes – there are good
algorithms to do it.
Large parsimony: Find the tree that minimize number of
evolutionary changes. For most models NP complete
One approach to large parsimony requires:
- generating all possible trees
- finding optimal labeling of internal nodes for each tree.
Fact 1: #tree topologies grows exponentially with #nodes
Fact 2: There may be many possible labels leading to the
same score.
Clique method for large parsimony
• Consider the following
graph:
– nodes – characters;
– edge if two characters
are compatible
1 2 3 4 5 6
α	

 1 0 0 1 1 0
β	

 0 0 1 0 0 0
γ	

 1 1 0 0 0 0
δ	

 1 1 0 1 1 1
ε	

 0 0 1 1 1 0
ω	

 0 0 0 0 0 0
characters
2
1
3
4
5
6
3,5 INCOMPATIBLE
Max. compatible set
Clique method (Meacham 1981) -
• Find maximal compatible clique (NP-
complete problem)
• Each characters defines a partition of the
set into two subsets
α,β,γ,	

δ,ε,ω	

α,γ,	

δ	

β	

ε,ω	

1 2
β	

ε,ω	

γ,	

δ	

α,	

	

ε,ω	

γ,	

δ	

α,	

3
β
Small parsimony
• Assumptions: the tree is known
• Goal: find the optimal labeling of the tree
(optimal = minimizing cost under given
parsimony assumption)
“Small” parsimony
Infer nodes labels
Application of small parsimony
problem
• errors in data
• loss of function
• convergent evolution (a trait
developed independently by two
evolutionary pathways e.g. wings
in birds an bats)
• lateral gene transfer (transferring
genes across species not by
inheritance)
From paper: Are There Bugs in Our Genome:
Anderson, Doolittle, Nesbo, Science 292 (2001) 1848-51
Red – gene encoding
N-acetylneuraminate
lyase
Dynamic programming algorithm for small
parsimony problem
• Sankoff (1975) comes with the DP approach
(Fitch provided an earlier non DP algorithm)
• Assumptions
– one character with multiple states
- The cost of change from state v to w is δ(v,w)
(note that it is a generalization, so far we talk about
cost of any change equal to 1)
DP algorithm continued
st(v) = minimum parsimony cost for node v under
assumption that the character state is t.
st(v) = 0 if v is a leaf. Otherwise let u, w be children
of u
st(v) = min i {si(u)+ δ(i,t)}+ min j {sj(w)+ δ(j,t)}
t
δ(i,t) δ(j,t)
u w
Try all possible states in u and v
O(nk) cost where n=number of nodes
k = number of sates
Exercise
1 2
5
4
6
7
3
St(v) 1 2 3 4 5 6 7
t
Left and right characters are independent,
We will compute the left.
Branch lengths
• Numbers that indicate the
number of changes in each
branch
• Problem – there may by many
most parsimonious trees
• Method 1: Average over all
most parsimonious trees.
• Still a problem – the branch
lengths are frequently
underestimated
1
4
6
1
Character patterns and parsimony
• Assume 2 state characters (0/1) and four
taxa A,B,C,D
• The possible topologies are:
A
A A
B
B
B
C
C C
D D D
A B C D
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
Changes in each topology
0, 0, 0
1, 1, 1
1, 1, 1
1, 2, 2
Informative character
(helps to decide the tree topology)
Informative characters: xxyy, xyxy,xyyx
Inconsistency
• Let p, q character change probability
• Consider the three informative patters xxyy, xyxy,
xyyx
• The tree selected by the parsimony depends
which pattern has the highest fraction;
• If q(1-q) < p2 then the most frequent pattern is
xyxy leading to incorrect tree.
p
p
q
q
q
A
B
C
D
Distance based methods
• When two sequences are similar they are
likely to originate from the same ancestor
• Sequence similarity can approximate
evolutionary distances
GAATC GAGTT
GA(A/G)T(C/T)
Distance Method
• Assume that for any pair of species we
have an estimation of evolutionary
distance between them
– eg. alignment score
• Goal: construct a tree which best
approximates these distance
Tree from distance matrix
A B
C D
E
1 1
3
2 2
1
3
5
A B C D E
0
0
0
0
0
2 7 7 12
2 7 7 12
7 7 4 11
7 7 4 11
11
11
12
12
A
B
C
D
E
length of the path from A to D = 1+3+1+2=7
Consider weighted trees: w(e) = weight of edge e
Recall: In a tree there is a unique path between any two nodes.
Let e1,e2,…ek be the edges of the path connecting u and v then the
distance between u and v in the tree is:
d(u,v) = w(e1) + w(e2) + … + w(ek)
M T
Can one always represent a
distance matrix as a weighted
tree?
0 10 5 10
10 0 9 5
5 9 0 8
10 5 8 0
a c
b
3
2
7
a
b
c
d
a b c d
d
?
There is no way to add d
to the tree and preserve
the distances
Quadrangle inequality
• Matrix that satisfies quadrangle inequality (called
also the four point condition) for every four taxa is
called additive.
• Theorem: Distance matrix can be represented
precisely as a weighted tree if and only if it is
additive.
a c
b d
d(a,c) + d(b,d) = d(a,d) + d(b,c) >= d(a,b) + d(d,c)
Constructing the tree representing an additive
matrix (one of several methods)
1. Start form 2-leaf tree a,b where a,b are
any two elements
2. For i = 3 to n (iteratively add vertices)
1. Take any vertex z not yet in the tree
and consider 2 vertices x,y that are
in the tree and compute
d(z,c) = (d(z,x) + d(z,y) - d(x,y) )/2
d(x,c) = (d(x,z) + d(x,y) – d(y,z))/2
2. From step 1 we know position of c
and the length of brunch (c,z).
If c did not hit exactly a brunching
point add c and z
else take as y any node from sub-tree
that brunches at c and repeat steps
1,2.
x
y
x y
c
z
x
y
z
Example
0 10 5 9
10 0 9 5
5 9 0 9
9 5 9 0
u
u
v
x
y
u v x y
v
10
Adding x:
d(x,c) = (d(u,x) + d(v,x) – d(u,v))/2 = (5+9-10)/2= 2
d(u,c) = (d(u,x) + d(u,v) - d(x,v))/2 = (5+10-9)/2 = 3
u v
x
3
2
7
c
Adding y:
d(y,c’) = (d(u,y) + d(v,y) – d(u,w))/2 = (5+9-10)/2= 2
d(u,c’) = (d(u,y) + d(u,v) - d(y,v))/2 = (10+9-5)/2 = 7
u v
x
3
2
3
c c’
y
2
4
Real matrices are almost never
additive
• Finding a tree that minimizes the error
Optimizing the error is hard
• Heuristics:
– Unweighted Pair Group Method with
Arithmetic Mean (UPGMA)
– Neighborhood Joining (NJ)
Hierarchical Clustering
• Clustering problem: Group items (e.g. genes) with
similar properties (e.g. expression pattern, sequence
similarity) so that
– The clusters are homogenous (the items in each cluster are
highly similar, as measured by the given property –
sequence similarity, expression pattern)
– The clusters are well separated (the items in different
clusters are different)
• Hierarchical clustering Many clusters have natural
sub-clusters which are often easier to identify e.g. cuts
are sub-cluster of carnivore sub-cluster of mammals
Organize the elements into a tree rather than forming
explicit portioning
The basic algorithm
Input: distance array d; cluster to cluster distance function
Initialize:
1. Put every element in one-element cluster
2. Initialize a forest T of one-node trees (each tree
corresponds to one cluster)
while there is more than on cluster
1. Find two closest clusters C1 and C2 and merge them into
C
2. Compute distance from C to all other clusters
3. Add new vertex corresponding to C to the forest T and
make nodes corresponding to C1, C2 children of this
node.
4. Remove from d columns corresponding to C1,C2
5. Add to d column corresponding to C
A distance function
• dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y)
x in C1 y in C2
Average over all distances
***
Example (on blackboard)
A B C D E
0
0
0
0
0
2 7 7 12
2 7 7 12
7 7 4 11
7 7 4 11
11
11
12
12
A
B
C
D
E
d
Unweighted Pair Group Method with
Arithmetic Mean (UPGMA)
• Idea:
– Combine hierarchical clustering with a method
to put weights on the edges
– Distance function used:
dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y)
– We need to come up with a method of
computing brunch lengths
x in C1
y in C2
Ultrametric trees
• The distance from
any internal node C
to any of its leaves
is constant and
equal to h(C)
• For each node (v)
we keep variable h –
height of the node in
the tree. h(v) = 0 for
all leaves.
UPGMA algorithm
Initialization (as in hierarchical clustering); h(v) = 0
while there is more than on cluster
1. Find two closest clusters C1 and C2 and
merge them into C
2. Compute dave from C to all other clusters
3. Add new vertex corresponding to C to the
forest T and make nodes corresponding to C1,
C2 children of this node.
4. Remove from d columns corresponding to
C1,C2
5. Add to d column corresponding to C
6. h(C) = d(C1, C2) /2
7. Assign length h(C)-h(C1) to edge (C1,C)
8. Assign length h(C)-h(C2) to edge (C2,C)
Neighbor Joining
• Idea:
– Construct tree by
iteratively combing first
nodes that are neighbors
in the tree
• Trick: Figuring out a pair
of neighboring vertices
takes a trick – the closest
pair want always do:
• B and C are the closest
but are NOT neighbors.
B
A
C
D
0 5 7 10
5 0 4 7
7 4 0 5
10 7 5 0
A
B
C
D
2
4
4
1
Finding Neighbors
• Let u(C) = 1/(#clusters-2)Σ d(C,C’)
• Find a pair C1C2 that minimizes
f(C1,C2)= d(C1,C2)-(u(C1)+u(C2))
• Motivation: keep d(C1,C2) small while
(u(C1)+u(C2)) large
all clusters C’
Finding Neighbors
• Let u(C) = 1/(#clusters-2)Σ d(C,C’)
• Find a pair C1C2 that minimizes
f(C1,C2)= d(C1,C2)-(u(C1)+u(C2))
• For the data from example:
u(CA) = u(CD) = 1/2(5+7+10) = 11
u(CB) = u(CC) = 1/2(5+4+7) = 8
f(CA,CB) = 5-11 -8 = -14
f(CB,CC) = 4- 8 -8 = -12
all clusters C’
NJ algorithm
Initialization (as in hierarchical clustering); h(v) = 0
while there is more than on cluster
1. Find clusters C1 and C2 minimizing f(C1C2) and
merge then into C
2. Compute for all C*: d(C,C*) = (d(C1C)+ d(C2C))/2
3. Add new vertex corresponding to C to the forest T
and connect it to C1, C2
4. Remove from d columns corresponding to C1,C2
5. Add to d column corresponding to C
6. Assign length ½(d(C1C2)+u(C1)-u(C2) to edge C1C
7. Assign length ½(d(C1C2)+u(C2)-u(C1) to edge C2C
NJ tree is not rooted
The order of construction of internal nodes of NJ does not suggest an
ancestral relation:
1 2 3 4 5
Rooting a tree
• Choose one distant organism as an out-
group
Species of interests
out-group
root
Bootstraping
• Estimating confidence in the tree topology
•Are we sure if this is
correct?
•Is there enough
evidence that A is a
successor of B not the
other way around? A
B
Bootstrapping, continued
• Assume that the tree is build form multiple
sequence alignment
A
B
1 2 3 4 5 67 8 9 10 11
Columns
of the alignment
Select columns randomly
(with replacement)
13 2 1 10 6 5 4 5 0 11
Initial tree
A
B
New tree
Repeat, say 1000 times,
For each edge of initial tree calculate % times
it is present in the new tree
59%
Summary
• Assume you have multiple alignment of length N.
Let T be the NJ tree build from this alignment
• Repeat, say 1000 times the following process:
– Select randomly with replacement N columns of the
alignment to produce a randomized alignment
– Build the tree for this randomized alignment
• For each edge of T report % time it was present
in a tree build form randomized alignment . This
is called the bootstrap value.
• Trusted edges: 80% or better bootstrap.
Maximum Likelihood Method
• Given is a multiple sequence alignment
and probabilistic model of for substitutions
(like PAM model) find the tree which has
the highest probability of generating the
data.
• Simplifying assumptions:
– Positions involved independently
– After species diverged they evolve
independently.
Formally:
• Find the tree T such that assuming evolution
model M
Pr[Data| T,M] is maximized
• From the independence of symbols:
Pr[Data| T,M] = P i Pr[Di| T,M]
Where the product is taken over all characters i
and Di value of the character i is over all taxa
Computing Pr[Di| T,M]
Pr[Di| T,M]
= Σ x Σ y Σ z
p(x)p(x,y,t1)p(y,A,t2)p(y,B,t6)p(x,z,t3)p(z,C,t4)p(z,D,t5)
A B C D
Column Di
x
y z
consider all possible
assignments here
t1
t2
t3
t4
t5 time
p(x,y,t) = prob. of
mutation x to y in time t
(from the model)
t6
Discovering the tree of life
• “Tree of life” – evolutionary tree of all
organisms
• Construction: choose a gene universally present
in all organisms; good examples: small rRNA
subunit, mitochondrial sequences.
• Things to keep in mind while constructing tree of
life from sequence distances:
– Lateral (or horizontal) gene transfer
– Gene duplication: genome may contain similar genes
that may evolve using different pathways. Phylogeny
tree need to be derived based on orthologous genes.
Where we go with it…
• We now how to compute for given column
and given tree Pr[Di| T,M]
• Sum up over all columns to get
Pr[Data| T,M]
Now, explore the space of possible trees
Problem:
• Bad news: the space of all possible trees is HUGE
Various heuristic approaches are used.
Metropolis algorithm:Random Walk
in Energy Space
Goal design transition probabilities so that the
probability of arriving at state j is
P(j) = q(j) / Z
(typically, q(S) = e
–E(S)/kT
, where E – energy)
Z – partition function = sum over all states S of
terms q(s). Z cannot be computed analytically since the
space is to large
State i
(in our case tree)
State j
(in our case another tree)
Tree i can be obtained
from tree j by a “local” change
The network
of possible
trees, there is
an edge
between
“similar
trees”
temperature
const.
Monte Carlo, Metropolis
Algorithm
• At each state i choose uniformly at random
one of neighboring conformations j.
• Compute p(i,j) = min(1, q(i)/q(j) )
• With probability p(i,j) move to state j.
• Iterate
MrBayes
• Program developed by J.Huelsenbeck &
F.Ronquist.
• Assumption:
• q(i) = Pr[D| Ti,M]
Prior probabilities: All trees are equally
likely.
• Proportion of time a given tree is visited
approximates posterior probabilities.
Most Popular Phylogeny Software
• PAUP
• PHYLIP

Contenu connexe

Tendances

Forward and reverse genetics
Forward and reverse geneticsForward and reverse genetics
Forward and reverse geneticsSachin Ekatpure
 
Phylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony methodPhylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony methodAfnan Zuiter
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Phylogenetic Tree, types and Applicantion
Phylogenetic Tree, types and Applicantion Phylogenetic Tree, types and Applicantion
Phylogenetic Tree, types and Applicantion Faisal Hussain
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionRoshan Karunarathna
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
multiple sequence alignment
multiple sequence alignmentmultiple sequence alignment
multiple sequence alignmentharshita agarwal
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and toolsKAUSHAL SAHU
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expressionishi tandon
 
Aflp (amplified fragment length polymorphism), alu
Aflp (amplified fragment length polymorphism), aluAflp (amplified fragment length polymorphism), alu
Aflp (amplified fragment length polymorphism), aluJannat Iftikhar
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSHEETHUMOLKS
 

Tendances (20)

Forward and reverse genetics
Forward and reverse geneticsForward and reverse genetics
Forward and reverse genetics
 
Phylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony methodPhylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony method
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Phylogenetic Tree, types and Applicantion
Phylogenetic Tree, types and Applicantion Phylogenetic Tree, types and Applicantion
Phylogenetic Tree, types and Applicantion
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure prediction
 
DNA Footprinting
DNA Footprinting DNA Footprinting
DNA Footprinting
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
multiple sequence alignment
multiple sequence alignmentmultiple sequence alignment
multiple sequence alignment
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Upgma
UpgmaUpgma
Upgma
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Kegg
KeggKegg
Kegg
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and tools
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
 
Data mining
Data miningData mining
Data mining
 
Aflp (amplified fragment length polymorphism), alu
Aflp (amplified fragment length polymorphism), aluAflp (amplified fragment length polymorphism), alu
Aflp (amplified fragment length polymorphism), alu
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 

Similaire à maximum parsimony.pdf

Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsJoe Parker
 
BTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptxBTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptxChijiokeNsofor
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prologbaran19901990
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)Nitesh Singh
 
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
EVE161: Microbial Phylogenomics - Class 4 - PhylogenyEVE161: Microbial Phylogenomics - Class 4 - Phylogeny
EVE161: Microbial Phylogenomics - Class 4 - PhylogenyJonathan Eisen
 
Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...Steve Pepper
 
Lecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogenyLecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogenyKristen DeAngelis
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptxMukhtiarKhan5
 
Bioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxBioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxshabirhassan4585
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBruno Mmassy
 
Lecture10.pdf
Lecture10.pdfLecture10.pdf
Lecture10.pdftmmwj1
 
Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4Jonathan Eisen
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix ArrayHarshit Agarwal
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithmsguest9938738
 
local and global allignment
local and global allignmentlocal and global allignment
local and global allignmentSHOBI KHAKHI
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and ApplicationsSaurav Jha
 

Similaire à maximum parsimony.pdf (20)

Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
 
BTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptxBTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptx
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prolog
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
 
Module_5_1.pptx
Module_5_1.pptxModule_5_1.pptx
Module_5_1.pptx
 
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
EVE161: Microbial Phylogenomics - Class 4 - PhylogenyEVE161: Microbial Phylogenomics - Class 4 - Phylogeny
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
 
Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...
 
Lecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogenyLecture 02 (2 04-2021) phylogeny
Lecture 02 (2 04-2021) phylogeny
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptx
 
Bioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxBioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptx
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogenetics
 
Lecture10.pdf
Lecture10.pdfLecture10.pdf
Lecture10.pdf
 
07_Phylogeny_2022.pdf
07_Phylogeny_2022.pdf07_Phylogeny_2022.pdf
07_Phylogeny_2022.pdf
 
Dot plots-1.ppt
Dot plots-1.pptDot plots-1.ppt
Dot plots-1.ppt
 
Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix Array
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
local and global allignment
local and global allignmentlocal and global allignment
local and global allignment
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
Trees ayaz
Trees ayazTrees ayaz
Trees ayaz
 

Plus de SrimathideviJ

Electro magnetic radiation principles.pdf
Electro magnetic radiation principles.pdfElectro magnetic radiation principles.pdf
Electro magnetic radiation principles.pdfSrimathideviJ
 
Lecture 4 Xray diffraction unit 5.pptx
Lecture 4 Xray diffraction unit 5.pptxLecture 4 Xray diffraction unit 5.pptx
Lecture 4 Xray diffraction unit 5.pptxSrimathideviJ
 
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdfSrimathideviJ
 
Carcinogenesis ppt.ppt
Carcinogenesis ppt.pptCarcinogenesis ppt.ppt
Carcinogenesis ppt.pptSrimathideviJ
 
2.5 Co-Transport (1).pptx
2.5 Co-Transport (1).pptx2.5 Co-Transport (1).pptx
2.5 Co-Transport (1).pptxSrimathideviJ
 
strain improvement-ppt unit 2.pptx
strain improvement-ppt unit 2.pptxstrain improvement-ppt unit 2.pptx
strain improvement-ppt unit 2.pptxSrimathideviJ
 
ch 3 proteins structure ppt.ppt
ch 3 proteins structure ppt.pptch 3 proteins structure ppt.ppt
ch 3 proteins structure ppt.pptSrimathideviJ
 
Basics in cancer biology.pdf
Basics in cancer biology.pdfBasics in cancer biology.pdf
Basics in cancer biology.pdfSrimathideviJ
 
apoptosis lecture-powerpoint slides (1) (1).pdf
apoptosis lecture-powerpoint slides (1) (1).pdfapoptosis lecture-powerpoint slides (1) (1).pdf
apoptosis lecture-powerpoint slides (1) (1).pdfSrimathideviJ
 
radioisotopetechnique-161107054511.pdf
radioisotopetechnique-161107054511.pdfradioisotopetechnique-161107054511.pdf
radioisotopetechnique-161107054511.pdfSrimathideviJ
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdfSrimathideviJ
 

Plus de SrimathideviJ (13)

Electro magnetic radiation principles.pdf
Electro magnetic radiation principles.pdfElectro magnetic radiation principles.pdf
Electro magnetic radiation principles.pdf
 
chromatography.pptx
chromatography.pptxchromatography.pptx
chromatography.pptx
 
Lecture 4 Xray diffraction unit 5.pptx
Lecture 4 Xray diffraction unit 5.pptxLecture 4 Xray diffraction unit 5.pptx
Lecture 4 Xray diffraction unit 5.pptx
 
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
 
Carcinogenesis ppt.ppt
Carcinogenesis ppt.pptCarcinogenesis ppt.ppt
Carcinogenesis ppt.ppt
 
2.5 Co-Transport (1).pptx
2.5 Co-Transport (1).pptx2.5 Co-Transport (1).pptx
2.5 Co-Transport (1).pptx
 
strain improvement-ppt unit 2.pptx
strain improvement-ppt unit 2.pptxstrain improvement-ppt unit 2.pptx
strain improvement-ppt unit 2.pptx
 
ch 3 proteins structure ppt.ppt
ch 3 proteins structure ppt.pptch 3 proteins structure ppt.ppt
ch 3 proteins structure ppt.ppt
 
Basics in cancer biology.pdf
Basics in cancer biology.pdfBasics in cancer biology.pdf
Basics in cancer biology.pdf
 
apoptosis lecture-powerpoint slides (1) (1).pdf
apoptosis lecture-powerpoint slides (1) (1).pdfapoptosis lecture-powerpoint slides (1) (1).pdf
apoptosis lecture-powerpoint slides (1) (1).pdf
 
radioisotopetechnique-161107054511.pdf
radioisotopetechnique-161107054511.pdfradioisotopetechnique-161107054511.pdf
radioisotopetechnique-161107054511.pdf
 
phylogenetics.pdf
phylogenetics.pdfphylogenetics.pdf
phylogenetics.pdf
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdf
 

Dernier

CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 

Dernier (20)

CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 

maximum parsimony.pdf

  • 1. Lecture 11 Phylogenetic trees Principles of Computational Biology Teresa Przytycka, PhD
  • 2. Phylogenetic (evolutionary) Tree • showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor. • Each node is called a taxonomic unit. • Internal nodes are generally called hypothetical taxonomic units • In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and the • edge lengths (if present) correspond to time estimates.
  • 3. Methods to construct phylogentic trees • Parsimony • Distance matrix based • Maximum likelihood
  • 4. Parsimony methods The preferred evolutionary tree is the one that requires “the minimum net amount of evolution” [Edwards and Cavalli-Sforza, 1963]
  • 5. Assumption of character based parsimony • Each taxa is described by a set of characters • Each character can be in one of finite number of states • In one step certain changes are allowed in character states • Goal: find evolutionary tree that explains the states of the taxa with minimal number of changes
  • 6. Example Taxon1 Yes Yes No Taxon 2 YES Yes Yes Taxon 3 Yes No No Taxon 4 Yes No No Taxon 5 Yes No Yes Taxon 6 No No Yes
  • 8. Version parsimony models: • Character states – Binary: states are 0 and 1 usually interpreted as presence or absence of an attribute (eg. character is a gene and can be present or absent in a genome) – Multistate: Any number of states (Eg. Characters are position in a multiple sequence alignment and states are A,C,T,G. • Type of changes: – Characters are ordered (the changes have to happen in particular order or not. – The changes are reversible or not.
  • 9. Variants of parsimony • Fitch Parsimony unordered, multistate characters with reversibility • Wagner Parsimony ordered, multistate characters with reversibility • Dollo Parsimony ordered, binary characters with reversibility but only one insertion allowed per character characters that are relatively chard to gain but easy to lose (like introns) • Camin-Sokal Parsimony- no reversals, derived states arise once only • (binary) prefect phylogeny – binary and non- reversible; each character changes at most once.
  • 10. Prefect – No (triangle gained and the lost) Dollo – Yes Camin-Sokal – No (for the same reason as perfect)
  • 11. 3 Changes Camin-Sokal Parsimony Triangle inserted twice! But this is not prefect and not Dollo
  • 12. Homoplasy Having some states arise more than once is called homoplasy. Example – triangle in the tree on the previous slide
  • 13. Finding most parsimonious tree • There are exponentially many trees with n nodes • Finding most parsimonious tree is NP- complete (for most variants of parsimony models) • Exception: Perfect phylogeny if exists can be found quickly. Problem – perfect phylogeny is to restrictive in practice.
  • 14. Perfect phylogeny • Each change can happen only once and is not reversible. • Can be directed or not Example: Consider binary characters. Each character corresponds to a gene. 0-gene absent 1-gene present It make sense to assume directed changes only form 0 to 1. The root has to be all zeros
  • 15. Perfect phylogeny Example: characters = genes; 0 = absent ; 1 = present Taxa: genomes (A,B,C,D,E) A 0 0 0 1 1 0 B 1 1 0 0 0 0 C 0 0 0 1 1 1 D 1 0 1 0 0 0 E 0 0 0 1 0 0 genes B D E A C Perfect phylogeny tree Goal: For a given character state matrix construct a tree topology that provides perfect phylogeny. 1 1 1 1 1 1
  • 16. Does there exist prefect parsimony tree for our example with geometrical shapes? There is a simple test
  • 17. Character Compatibility • Two characters A, B are compatible if there do not exits four taxa containing all four combinations as in the table • Fact: there exits perfect phylogeny if and only if and only if all pairs of characters are compatible T1 1 1 T2 1 0 T3 0 1 T4 0 0 A B
  • 18. Are not compatible Taxon1 Yes Yes No Taxon 2 YES Yes Yes Taxon 3 Yes No No Taxon 4 Yes No No Taxon 5 Yes No Yes Taxon 6 No No Yes
  • 19. ? One cannot add triangle to the tree so that no character changes it state twice: If we add it to on of the left branches it will be inserted twice if to the right most – circle would have to be deleted (insertion and the deletion of the circle)
  • 20. Ordered characters and perfect phylogeny • Assume that we in the last common ancestor all characters had state 0. • This assumption makes sense for many characters, for example genes. • Then compatibility criterion is even simpler: characters are compatible if and only if there do not exist three taxa containing combinations (1,0),(0,1),(1,1)
  • 21. Example Under assumption that states are directed form 0 to 1: if i and j are two different genes then the set of species containing i is either disjoint with set if species containing j or one of this sets contains the other. A 0 0 0 1 1 0 B 1 1 0 0 0 0 C 0 0 0 1 1 1 D 1 0 1 0 0 0 E 0 0 0 1 0 0 • The above property is necessary and sufficient for prefect phylogeny under 0 to 1 ordering • Why works: associated with each character is a subtree. These subtrees have to be nested.
  • 22. Simple test for prefect phylogeny • Fact: there exits perfect phylogeny if and only if and only if all pairs of characters are compatible • Special case: if we assume directed parsimony (0!1 only) then characters are compatible if and only if there do not exist tree taxa containing combinations (1,0),(0,1),(1,1) • Observe the last one is equivalent to non-overlapping criterion • Optimal algorithm: Gusfield O(nm): n = # taxa; m= #characters
  • 23. Two version optimization problem: Small parsimony: Tree is given and we want to find the labeling that minimizes #changes – there are good algorithms to do it. Large parsimony: Find the tree that minimize number of evolutionary changes. For most models NP complete One approach to large parsimony requires: - generating all possible trees - finding optimal labeling of internal nodes for each tree. Fact 1: #tree topologies grows exponentially with #nodes Fact 2: There may be many possible labels leading to the same score.
  • 24. Clique method for large parsimony • Consider the following graph: – nodes – characters; – edge if two characters are compatible 1 2 3 4 5 6 α 1 0 0 1 1 0 β 0 0 1 0 0 0 γ 1 1 0 0 0 0 δ 1 1 0 1 1 1 ε 0 0 1 1 1 0 ω 0 0 0 0 0 0 characters 2 1 3 4 5 6 3,5 INCOMPATIBLE Max. compatible set
  • 25. Clique method (Meacham 1981) - • Find maximal compatible clique (NP- complete problem) • Each characters defines a partition of the set into two subsets α,β,γ, δ,ε,ω α,γ, δ β ε,ω 1 2 β ε,ω γ, δ α, ε,ω γ, δ α, 3 β
  • 26. Small parsimony • Assumptions: the tree is known • Goal: find the optimal labeling of the tree (optimal = minimizing cost under given parsimony assumption)
  • 28. Application of small parsimony problem • errors in data • loss of function • convergent evolution (a trait developed independently by two evolutionary pathways e.g. wings in birds an bats) • lateral gene transfer (transferring genes across species not by inheritance) From paper: Are There Bugs in Our Genome: Anderson, Doolittle, Nesbo, Science 292 (2001) 1848-51 Red – gene encoding N-acetylneuraminate lyase
  • 29. Dynamic programming algorithm for small parsimony problem • Sankoff (1975) comes with the DP approach (Fitch provided an earlier non DP algorithm) • Assumptions – one character with multiple states - The cost of change from state v to w is δ(v,w) (note that it is a generalization, so far we talk about cost of any change equal to 1)
  • 30. DP algorithm continued st(v) = minimum parsimony cost for node v under assumption that the character state is t. st(v) = 0 if v is a leaf. Otherwise let u, w be children of u st(v) = min i {si(u)+ δ(i,t)}+ min j {sj(w)+ δ(j,t)} t δ(i,t) δ(j,t) u w Try all possible states in u and v O(nk) cost where n=number of nodes k = number of sates
  • 31. Exercise 1 2 5 4 6 7 3 St(v) 1 2 3 4 5 6 7 t Left and right characters are independent, We will compute the left.
  • 32. Branch lengths • Numbers that indicate the number of changes in each branch • Problem – there may by many most parsimonious trees • Method 1: Average over all most parsimonious trees. • Still a problem – the branch lengths are frequently underestimated 1 4 6 1
  • 33. Character patterns and parsimony • Assume 2 state characters (0/1) and four taxa A,B,C,D • The possible topologies are: A A A B B B C C C D D D A B C D 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 Changes in each topology 0, 0, 0 1, 1, 1 1, 1, 1 1, 2, 2 Informative character (helps to decide the tree topology) Informative characters: xxyy, xyxy,xyyx
  • 34. Inconsistency • Let p, q character change probability • Consider the three informative patters xxyy, xyxy, xyyx • The tree selected by the parsimony depends which pattern has the highest fraction; • If q(1-q) < p2 then the most frequent pattern is xyxy leading to incorrect tree. p p q q q A B C D
  • 35. Distance based methods • When two sequences are similar they are likely to originate from the same ancestor • Sequence similarity can approximate evolutionary distances GAATC GAGTT GA(A/G)T(C/T)
  • 36. Distance Method • Assume that for any pair of species we have an estimation of evolutionary distance between them – eg. alignment score • Goal: construct a tree which best approximates these distance
  • 37. Tree from distance matrix A B C D E 1 1 3 2 2 1 3 5 A B C D E 0 0 0 0 0 2 7 7 12 2 7 7 12 7 7 4 11 7 7 4 11 11 11 12 12 A B C D E length of the path from A to D = 1+3+1+2=7 Consider weighted trees: w(e) = weight of edge e Recall: In a tree there is a unique path between any two nodes. Let e1,e2,…ek be the edges of the path connecting u and v then the distance between u and v in the tree is: d(u,v) = w(e1) + w(e2) + … + w(ek) M T
  • 38. Can one always represent a distance matrix as a weighted tree? 0 10 5 10 10 0 9 5 5 9 0 8 10 5 8 0 a c b 3 2 7 a b c d a b c d d ? There is no way to add d to the tree and preserve the distances
  • 39. Quadrangle inequality • Matrix that satisfies quadrangle inequality (called also the four point condition) for every four taxa is called additive. • Theorem: Distance matrix can be represented precisely as a weighted tree if and only if it is additive. a c b d d(a,c) + d(b,d) = d(a,d) + d(b,c) >= d(a,b) + d(d,c)
  • 40. Constructing the tree representing an additive matrix (one of several methods) 1. Start form 2-leaf tree a,b where a,b are any two elements 2. For i = 3 to n (iteratively add vertices) 1. Take any vertex z not yet in the tree and consider 2 vertices x,y that are in the tree and compute d(z,c) = (d(z,x) + d(z,y) - d(x,y) )/2 d(x,c) = (d(x,z) + d(x,y) – d(y,z))/2 2. From step 1 we know position of c and the length of brunch (c,z). If c did not hit exactly a brunching point add c and z else take as y any node from sub-tree that brunches at c and repeat steps 1,2. x y x y c z x y z
  • 41. Example 0 10 5 9 10 0 9 5 5 9 0 9 9 5 9 0 u u v x y u v x y v 10 Adding x: d(x,c) = (d(u,x) + d(v,x) – d(u,v))/2 = (5+9-10)/2= 2 d(u,c) = (d(u,x) + d(u,v) - d(x,v))/2 = (5+10-9)/2 = 3 u v x 3 2 7 c Adding y: d(y,c’) = (d(u,y) + d(v,y) – d(u,w))/2 = (5+9-10)/2= 2 d(u,c’) = (d(u,y) + d(u,v) - d(y,v))/2 = (10+9-5)/2 = 7 u v x 3 2 3 c c’ y 2 4
  • 42. Real matrices are almost never additive • Finding a tree that minimizes the error Optimizing the error is hard • Heuristics: – Unweighted Pair Group Method with Arithmetic Mean (UPGMA) – Neighborhood Joining (NJ)
  • 43. Hierarchical Clustering • Clustering problem: Group items (e.g. genes) with similar properties (e.g. expression pattern, sequence similarity) so that – The clusters are homogenous (the items in each cluster are highly similar, as measured by the given property – sequence similarity, expression pattern) – The clusters are well separated (the items in different clusters are different) • Hierarchical clustering Many clusters have natural sub-clusters which are often easier to identify e.g. cuts are sub-cluster of carnivore sub-cluster of mammals Organize the elements into a tree rather than forming explicit portioning
  • 44. The basic algorithm Input: distance array d; cluster to cluster distance function Initialize: 1. Put every element in one-element cluster 2. Initialize a forest T of one-node trees (each tree corresponds to one cluster) while there is more than on cluster 1. Find two closest clusters C1 and C2 and merge them into C 2. Compute distance from C to all other clusters 3. Add new vertex corresponding to C to the forest T and make nodes corresponding to C1, C2 children of this node. 4. Remove from d columns corresponding to C1,C2 5. Add to d column corresponding to C
  • 45. A distance function • dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y) x in C1 y in C2 Average over all distances ***
  • 46. Example (on blackboard) A B C D E 0 0 0 0 0 2 7 7 12 2 7 7 12 7 7 4 11 7 7 4 11 11 11 12 12 A B C D E d
  • 47. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) • Idea: – Combine hierarchical clustering with a method to put weights on the edges – Distance function used: dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y) – We need to come up with a method of computing brunch lengths x in C1 y in C2
  • 48. Ultrametric trees • The distance from any internal node C to any of its leaves is constant and equal to h(C) • For each node (v) we keep variable h – height of the node in the tree. h(v) = 0 for all leaves.
  • 49. UPGMA algorithm Initialization (as in hierarchical clustering); h(v) = 0 while there is more than on cluster 1. Find two closest clusters C1 and C2 and merge them into C 2. Compute dave from C to all other clusters 3. Add new vertex corresponding to C to the forest T and make nodes corresponding to C1, C2 children of this node. 4. Remove from d columns corresponding to C1,C2 5. Add to d column corresponding to C 6. h(C) = d(C1, C2) /2 7. Assign length h(C)-h(C1) to edge (C1,C) 8. Assign length h(C)-h(C2) to edge (C2,C)
  • 50. Neighbor Joining • Idea: – Construct tree by iteratively combing first nodes that are neighbors in the tree • Trick: Figuring out a pair of neighboring vertices takes a trick – the closest pair want always do: • B and C are the closest but are NOT neighbors. B A C D 0 5 7 10 5 0 4 7 7 4 0 5 10 7 5 0 A B C D 2 4 4 1
  • 51. Finding Neighbors • Let u(C) = 1/(#clusters-2)Σ d(C,C’) • Find a pair C1C2 that minimizes f(C1,C2)= d(C1,C2)-(u(C1)+u(C2)) • Motivation: keep d(C1,C2) small while (u(C1)+u(C2)) large all clusters C’
  • 52. Finding Neighbors • Let u(C) = 1/(#clusters-2)Σ d(C,C’) • Find a pair C1C2 that minimizes f(C1,C2)= d(C1,C2)-(u(C1)+u(C2)) • For the data from example: u(CA) = u(CD) = 1/2(5+7+10) = 11 u(CB) = u(CC) = 1/2(5+4+7) = 8 f(CA,CB) = 5-11 -8 = -14 f(CB,CC) = 4- 8 -8 = -12 all clusters C’
  • 53. NJ algorithm Initialization (as in hierarchical clustering); h(v) = 0 while there is more than on cluster 1. Find clusters C1 and C2 minimizing f(C1C2) and merge then into C 2. Compute for all C*: d(C,C*) = (d(C1C)+ d(C2C))/2 3. Add new vertex corresponding to C to the forest T and connect it to C1, C2 4. Remove from d columns corresponding to C1,C2 5. Add to d column corresponding to C 6. Assign length ½(d(C1C2)+u(C1)-u(C2) to edge C1C 7. Assign length ½(d(C1C2)+u(C2)-u(C1) to edge C2C
  • 54. NJ tree is not rooted The order of construction of internal nodes of NJ does not suggest an ancestral relation: 1 2 3 4 5
  • 55. Rooting a tree • Choose one distant organism as an out- group Species of interests out-group root
  • 56. Bootstraping • Estimating confidence in the tree topology •Are we sure if this is correct? •Is there enough evidence that A is a successor of B not the other way around? A B
  • 57. Bootstrapping, continued • Assume that the tree is build form multiple sequence alignment A B 1 2 3 4 5 67 8 9 10 11 Columns of the alignment Select columns randomly (with replacement) 13 2 1 10 6 5 4 5 0 11 Initial tree A B New tree Repeat, say 1000 times, For each edge of initial tree calculate % times it is present in the new tree 59%
  • 58. Summary • Assume you have multiple alignment of length N. Let T be the NJ tree build from this alignment • Repeat, say 1000 times the following process: – Select randomly with replacement N columns of the alignment to produce a randomized alignment – Build the tree for this randomized alignment • For each edge of T report % time it was present in a tree build form randomized alignment . This is called the bootstrap value. • Trusted edges: 80% or better bootstrap.
  • 59. Maximum Likelihood Method • Given is a multiple sequence alignment and probabilistic model of for substitutions (like PAM model) find the tree which has the highest probability of generating the data. • Simplifying assumptions: – Positions involved independently – After species diverged they evolve independently.
  • 60. Formally: • Find the tree T such that assuming evolution model M Pr[Data| T,M] is maximized • From the independence of symbols: Pr[Data| T,M] = P i Pr[Di| T,M] Where the product is taken over all characters i and Di value of the character i is over all taxa
  • 61. Computing Pr[Di| T,M] Pr[Di| T,M] = Σ x Σ y Σ z p(x)p(x,y,t1)p(y,A,t2)p(y,B,t6)p(x,z,t3)p(z,C,t4)p(z,D,t5) A B C D Column Di x y z consider all possible assignments here t1 t2 t3 t4 t5 time p(x,y,t) = prob. of mutation x to y in time t (from the model) t6
  • 62. Discovering the tree of life • “Tree of life” – evolutionary tree of all organisms • Construction: choose a gene universally present in all organisms; good examples: small rRNA subunit, mitochondrial sequences. • Things to keep in mind while constructing tree of life from sequence distances: – Lateral (or horizontal) gene transfer – Gene duplication: genome may contain similar genes that may evolve using different pathways. Phylogeny tree need to be derived based on orthologous genes.
  • 63. Where we go with it… • We now how to compute for given column and given tree Pr[Di| T,M] • Sum up over all columns to get Pr[Data| T,M] Now, explore the space of possible trees Problem: • Bad news: the space of all possible trees is HUGE Various heuristic approaches are used.
  • 64. Metropolis algorithm:Random Walk in Energy Space Goal design transition probabilities so that the probability of arriving at state j is P(j) = q(j) / Z (typically, q(S) = e –E(S)/kT , where E – energy) Z – partition function = sum over all states S of terms q(s). Z cannot be computed analytically since the space is to large State i (in our case tree) State j (in our case another tree) Tree i can be obtained from tree j by a “local” change The network of possible trees, there is an edge between “similar trees” temperature const.
  • 65. Monte Carlo, Metropolis Algorithm • At each state i choose uniformly at random one of neighboring conformations j. • Compute p(i,j) = min(1, q(i)/q(j) ) • With probability p(i,j) move to state j. • Iterate
  • 66. MrBayes • Program developed by J.Huelsenbeck & F.Ronquist. • Assumption: • q(i) = Pr[D| Ti,M] Prior probabilities: All trees are equally likely. • Proportion of time a given tree is visited approximates posterior probabilities.
  • 67. Most Popular Phylogeny Software • PAUP • PHYLIP