Phylogenetic analysis in nutshell

Re-construction of Phylogenetic tree
using maximum-likelihood methods
PhyML (in nutshell)
Note: Slides are still under revision

Steps
• Collect homologous sequences.
• Multiple sequence alignment.
• Manually Curing of the multiple sequence alignment.
• Feeding the MSA to programs to study the substitution
rates in between locations of the sites in the MSA.
(ProtTest for protein and jModeltest for DNA alignments).
• Selecting an appropriate substitution model.
• Feeding the MSA, starting tree (e.g., those obtained with
Neighbour-joining method) and substitution model as well
as bootstrap properties to PhyML.
• Obtain tree and cross-check bootstrap values, branch
length and general resolution.
• Remove rouge taxons and redo the entire process till
satisfactory tree is constructed.

Selection of sequences for phylogenetic tree
Purpose of the tree
1.Geneology: evolution of gene/ gene family irrespective of
speciation (called gene tree).
2.Phenology: evolution of gene/gene family in context of
phylogenetic speciation (called species tree).
Homologues: Genes derived from common ancestors.
Orthologues: Genes derived from common ancestors or
homologues that are separated from each other by
gene/genome duplication (of course before speciation).
Paralogues: Genes derived from common ancestors or
homologues that are separated from one another by
speciation (i.e., after speciation occurs the same copy of gene
evolves under different constraints that are face by the two
different species.

Selecting sequences
•Similar sequence of considerably low e-value in BLAST in
general can be assigned to be homologous.
•<40% amino acid similarity = higher by-chance appearance of
similarity and not necessarily a similairity due to homology
•~40% amino acid similarity= twilight zone for homology (may
be may not be)
•≥60% amino acid similarity=homology inferred
(~80% or higher similarity in DNA sequence.)

• Perform BLAST of the new sequence.
• Note the hits obtained and the e-value.
• Follow the sequences down the list with increasing e-values till the e-
value suddenly jumps in order of 3 or so. E.g. 1e-10 means that the
possibility that the sequence similarity is having a by-chance occurance is
in probablity of 1x 10-10
and not due to homology. A sudden jump from 1e-
10 to 1 e-5 in the similarity sequence BLAST result list may indicate that
the homology may be limited till the sequences with lower e-value.
(Note: e-value is subjected to the size of the sequence database. larger
database have lower starting e-values for a given query sequence)
• Note the annotation or characterization of the proteins encoded as well
as the % similarity and sequence coverage.
• Also note the organisms from which it is derived
• Select sequences with considerable coverage and similarity for multiple
sequence alignment.
• The choice of sequence can be based on species of origin and their
relatedness or on special activities and multiple domain structures
depending on what basis the phylogeny is to be re-constructed.

MSA- Multiuple Sequence Alignment
Different types eg., CLUSTAL, DiALIGN, MUSCLE, MAFFT.
THEORETICALY ANY SEQUENCE CAN BE ALIGNED TO ANY OTHER SEQUENCE>
WHETHER IT MAKES SENSE OR NOT IS A DIFFERENT ISSUE.
CLUSTAL (CLUSTALW2, X): ClustalW2 uses a dynamic programing method to make
MSA based on Hidden-Markov models (HMM) of probalistic likelihoods of all gaps,
matches and mismatches to be aligned into a biologically relevant MSA. The dynamic
programing stepwise finds the highest score of MSA based on cumulative scores by
matches at each base and penalizing scores due to mismatches. This stepwise scoring
is decided in first a pairwise matrix choosing the shortest distance to higher scores in
situations where gaps are observed. (more info on internet will be available). This
reduces greatly the time required for analysis.
DiALIGN: Dialign which does not use gap penalizing and thus can be used for more
accurate alignment of very divergent sequences that suffer large alignment gaps.
MUSCLE: MUSCLE (Multiple Sequence Alignment by Log-Expectation) rely on
interative methods that involve repeatedly aligning the old sequences while adding
newer to the growing MSA to produce more accurate alignments in shorter time
frames.

CLUSTAL (CLUSTALX):
•Feed sequence in fasta format (copy paste on the applet or attach a
notepad file {*.txt}).
E.g., > (name of the 1st
sequence)
Agtgatagatag…………
>(name of the 2nd
sequence)
Gatagatcgctgatcgctc…..
•Run with default.
•Analyze
Gaps are frequent: change the settings such that gap
opening penalty is high e.g. increase from the default value
of 10 to 15, 20, 25, 30.
Gaps are long but less frequent: change settings such that
gap extension penalty is high e.g., increase from default
value of 1 to 2,3,4,5
No gaps but many mismatches: relax the gap opening (5,
6, 7,) and/or gap extension penalty (0.1, 0.2, 0.4, 0.5) such
that indels might occur in the data set for a better match.
REDO THE MSA ALIGNMENT TILL IT IS better.

Manual curing of MSA
•Involves intellectual curing of usually the placement of alignment gaps
among the sequence alignment. This is understood more appropriately in
case to case study.
•Involves the removal of rouge taxons. i.e., the sequence that do not fit in
the current MSA due to dis-proportionate accurence of mismatches and
gaps. Usually it can be figured out after the first tree is made and the
bootstrapping values and/or branch lengths of the particular lineages is
questionable. (appropriate software are available).
•Larger the sequence set the higher the accuracy of the tree. But also more
time consuming is tree construction by maximum likelihood (ML).
•More diverse the sequence set more erroneous the tree may be since it
would be an approximation. Hence closely similar sequences
representatives from each ordered data set needs to be selected. For eg.,
when talking of small molecule methyl transferases one may take a few
close relatives of O-, N-, C- methyl transferases for analysis since these
have considerable phylogenetic homology.

Substitution model
•The curated MSA can be included as an input to programs like jModeltest for DNA and
Prottest for proteins to the pattern of substitution at each site in the MSA. Based on this
pattern a list of appropriate substitution model for anaylsis is calculated. For eg. The
simplest model Jukes-Cantor (JC) says that each base of DNA can be substituted at equal
rate to other base in evolution. Though it is unrealistic in the practicality of life but the
sequences selected might just anticipated to be obliging to this rate and thus JC can be
used for analysis in PhyML. Kimura model says that transitions (Ts) (or purine to purine and
pyrimidine to pyrimidine changes) and transversions (Tv) (purine to pyrimidine or vice
versa) changes occur at different rates.
•There are 22 DNA substitution models published and each model can have slight variants
based on statistical distribution of variables like +I + G and +Y thus making it a total of
22*4=88 substitution model for DNA substitution.
•+I: refers to proportion of invariable sites. (invariable sites refers to the bias incorporated
due to substitution and rate heterogeneity amongst different lineages).
Inclusion of this parameter ensures that the bias of sequence dissimilarity due to sequence
relatedness id reduced.
•+g: refers to gamma distribution of the matrix (gamma distribution is a pattern/shaape
that is obserevd during statistical distribution of variants).
•+y: refers to distribution or accounting for Ts/Tv ratio (incorporated due to slight
variations observed between transition and transversion substitutions).
e.g., MSA can follow a JC model or JC+I or JC+G or JC+Y

Substitution model
•The decision of what substitution model depends on three sattistical
considerations incorporated in both jModeltest and prottest. Akaike
Information Criteria (AIC), Bayesian Information criteria (BIC) and Akaike
Information Criteria corrected for small samples (AICc).
•The model having high scores for AIC and BIC are usually selected as
appropriate substitution models for phylogenetic estimation.
Phylogeny
PhyML at present incorporates analysis using 32 substitution models for
DNA.
After adding all the tested parameters like MSA, substitution models, + I/
+G/+Y parameter options the tree building can be carried out.
PhyML requires a strating user-define tree for building a phlylogenetic tree.
If not available PhyML can be commanded to construct by its own a
Neighbour-Joining starting tree.
The tree can be improved by selecting option like SPR +NNI so that
appropriateness in branch lengths can be incorporated.
Finally a bootstrapping for 1000 pseudoreplicates is choosen for accuracy
of branch topology.

Bootstrapping
Bootstrapping involves the program to perform the same
tree building with pseudoreplicates of the sequences
after breaking blocks of alignment and rearranging and
then calculation how many times per hundred
pseudoreplicates does a branch fall under the same
topology.
A bootstrap of greated than 70% is significant in general.
Higher amount of pseudoreplicates chooses the more
accurate is the topological calculations
A bootstrap pesudoreplicate of 1000 is preferable but in
consideration of time required pseudoreplicate of 100
also suffices.

Re-construction
•Once the tree is generated, the tree is broadly looked upon for
accuracy by bootstrap values of each branch as well as disproportionate
branch lengths.
•In case of faulty trees, corrections need to be made at both aspects.
•If the MSA is cured properly, then one might need to remove rogue
taxons (Taxons that are problematic to the tree topology or branch
length) using available softwares.
The entire process from searching for optimal substitution models
may needed to be repeated.
•If no rogue taxons can be identified. Reducing the generality of
sequence diversity could also be tried. And more relevant sequences
only be included in MSA.
•The NJ tree option can also be changed to a user defined tree option.
•The tree construction is repeated in a number of cycles untill
appropriate tree is generated.

Phylogenetic analysis in nutshell

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Phylogenetic analysis in nutshell

Similaire à Phylogenetic analysis in nutshell (20)

Dernier

Dernier (20)

Phylogenetic analysis in nutshell