SlideShare une entreprise Scribd logo
1  sur  12
Re-construction of Phylogenetic tree
using maximum-likelihood methods
PhyML (in nutshell)
Note: Slides are still under revision
Steps
• Collect homologous sequences.
• Multiple sequence alignment.
• Manually Curing of the multiple sequence alignment.
• Feeding the MSA to programs to study the substitution
rates in between locations of the sites in the MSA.
(ProtTest for protein and jModeltest for DNA alignments).
• Selecting an appropriate substitution model.
• Feeding the MSA, starting tree (e.g., those obtained with
Neighbour-joining method) and substitution model as well
as bootstrap properties to PhyML.
• Obtain tree and cross-check bootstrap values, branch
length and general resolution.
• Remove rouge taxons and redo the entire process till
satisfactory tree is constructed.
Selection of sequences for phylogenetic tree
Purpose of the tree
1.Geneology: evolution of gene/ gene family irrespective of
speciation (called gene tree).
2.Phenology: evolution of gene/gene family in context of
phylogenetic speciation (called species tree).
Homologues: Genes derived from common ancestors.
Orthologues: Genes derived from common ancestors or
homologues that are separated from each other by
gene/genome duplication (of course before speciation).
Paralogues: Genes derived from common ancestors or
homologues that are separated from one another by
speciation (i.e., after speciation occurs the same copy of gene
evolves under different constraints that are face by the two
different species.
Selecting sequences
•Similar sequence of considerably low e-value in BLAST in
general can be assigned to be homologous.
•<40% amino acid similarity = higher by-chance appearance of
similarity and not necessarily a similairity due to homology
•~40% amino acid similarity= twilight zone for homology (may
be may not be)
•≥60% amino acid similarity=homology inferred
(~80% or higher similarity in DNA sequence.)
• Perform BLAST of the new sequence.
• Note the hits obtained and the e-value.
• Follow the sequences down the list with increasing e-values till the e-
value suddenly jumps in order of 3 or so. E.g. 1e-10 means that the
possibility that the sequence similarity is having a by-chance occurance is
in probablity of 1x 10-10
and not due to homology. A sudden jump from 1e-
10 to 1 e-5 in the similarity sequence BLAST result list may indicate that
the homology may be limited till the sequences with lower e-value.
(Note: e-value is subjected to the size of the sequence database. larger
database have lower starting e-values for a given query sequence)
• Note the annotation or characterization of the proteins encoded as well
as the % similarity and sequence coverage.
• Also note the organisms from which it is derived
• Select sequences with considerable coverage and similarity for multiple
sequence alignment.
• The choice of sequence can be based on species of origin and their
relatedness or on special activities and multiple domain structures
depending on what basis the phylogeny is to be re-constructed.
MSA- Multiuple Sequence Alignment
Different types eg., CLUSTAL, DiALIGN, MUSCLE, MAFFT.
THEORETICALY ANY SEQUENCE CAN BE ALIGNED TO ANY OTHER SEQUENCE>
WHETHER IT MAKES SENSE OR NOT IS A DIFFERENT ISSUE.
CLUSTAL (CLUSTALW2, X): ClustalW2 uses a dynamic programing method to make
MSA based on Hidden-Markov models (HMM) of probalistic likelihoods of all gaps,
matches and mismatches to be aligned into a biologically relevant MSA. The dynamic
programing stepwise finds the highest score of MSA based on cumulative scores by
matches at each base and penalizing scores due to mismatches. This stepwise scoring
is decided in first a pairwise matrix choosing the shortest distance to higher scores in
situations where gaps are observed. (more info on internet will be available). This
reduces greatly the time required for analysis.
DiALIGN: Dialign which does not use gap penalizing and thus can be used for more
accurate alignment of very divergent sequences that suffer large alignment gaps.
MUSCLE: MUSCLE (Multiple Sequence Alignment by Log-Expectation) rely on
interative methods that involve repeatedly aligning the old sequences while adding
newer to the growing MSA to produce more accurate alignments in shorter time
frames.
CLUSTAL (CLUSTALX):
•Feed sequence in fasta format (copy paste on the applet or attach a
notepad file {*.txt}).
E.g., > (name of the 1st
sequence)
Agtgatagatag…………
>(name of the 2nd
sequence)
Gatagatcgctgatcgctc…..
•Run with default.
•Analyze
Gaps are frequent: change the settings such that gap
opening penalty is high e.g. increase from the default value
of 10 to 15, 20, 25, 30.
Gaps are long but less frequent: change settings such that
gap extension penalty is high e.g., increase from default
value of 1 to 2,3,4,5
No gaps but many mismatches: relax the gap opening (5,
6, 7,) and/or gap extension penalty (0.1, 0.2, 0.4, 0.5) such
that indels might occur in the data set for a better match.
REDO THE MSA ALIGNMENT TILL IT IS better.
Manual curing of MSA
•Involves intellectual curing of usually the placement of alignment gaps
among the sequence alignment. This is understood more appropriately in
case to case study.
•Involves the removal of rouge taxons. i.e., the sequence that do not fit in
the current MSA due to dis-proportionate accurence of mismatches and
gaps. Usually it can be figured out after the first tree is made and the
bootstrapping values and/or branch lengths of the particular lineages is
questionable. (appropriate software are available).
•Larger the sequence set the higher the accuracy of the tree. But also more
time consuming is tree construction by maximum likelihood (ML).
•More diverse the sequence set more erroneous the tree may be since it
would be an approximation. Hence closely similar sequences
representatives from each ordered data set needs to be selected. For eg.,
when talking of small molecule methyl transferases one may take a few
close relatives of O-, N-, C- methyl transferases for analysis since these
have considerable phylogenetic homology.
Substitution model
•The curated MSA can be included as an input to programs like jModeltest for DNA and
Prottest for proteins to the pattern of substitution at each site in the MSA. Based on this
pattern a list of appropriate substitution model for anaylsis is calculated. For eg. The
simplest model Jukes-Cantor (JC) says that each base of DNA can be substituted at equal
rate to other base in evolution. Though it is unrealistic in the practicality of life but the
sequences selected might just anticipated to be obliging to this rate and thus JC can be
used for analysis in PhyML. Kimura model says that transitions (Ts) (or purine to purine and
pyrimidine to pyrimidine changes) and transversions (Tv) (purine to pyrimidine or vice
versa) changes occur at different rates.
•There are 22 DNA substitution models published and each model can have slight variants
based on statistical distribution of variables like +I + G and +Y thus making it a total of
22*4=88 substitution model for DNA substitution.
•+I: refers to proportion of invariable sites. (invariable sites refers to the bias incorporated
due to substitution and rate heterogeneity amongst different lineages).
Inclusion of this parameter ensures that the bias of sequence dissimilarity due to sequence
relatedness id reduced.
•+g: refers to gamma distribution of the matrix (gamma distribution is a pattern/shaape
that is obserevd during statistical distribution of variants).
•+y: refers to distribution or accounting for Ts/Tv ratio (incorporated due to slight
variations observed between transition and transversion substitutions).
e.g., MSA can follow a JC model or JC+I or JC+G or JC+Y
Substitution model
•The decision of what substitution model depends on three sattistical
considerations incorporated in both jModeltest and prottest. Akaike
Information Criteria (AIC), Bayesian Information criteria (BIC) and Akaike
Information Criteria corrected for small samples (AICc).
•The model having high scores for AIC and BIC are usually selected as
appropriate substitution models for phylogenetic estimation.
Phylogeny
PhyML at present incorporates analysis using 32 substitution models for
DNA.
After adding all the tested parameters like MSA, substitution models, + I/
+G/+Y parameter options the tree building can be carried out.
PhyML requires a strating user-define tree for building a phlylogenetic tree.
If not available PhyML can be commanded to construct by its own a
Neighbour-Joining starting tree.
The tree can be improved by selecting option like SPR +NNI so that
appropriateness in branch lengths can be incorporated.
Finally a bootstrapping for 1000 pseudoreplicates is choosen for accuracy
of branch topology.
Bootstrapping
Bootstrapping involves the program to perform the same
tree building with pseudoreplicates of the sequences
after breaking blocks of alignment and rearranging and
then calculation how many times per hundred
pseudoreplicates does a branch fall under the same
topology.
A bootstrap of greated than 70% is significant in general.
Higher amount of pseudoreplicates chooses the more
accurate is the topological calculations
A bootstrap pesudoreplicate of 1000 is preferable but in
consideration of time required pseudoreplicate of 100
also suffices.
Re-construction
•Once the tree is generated, the tree is broadly looked upon for
accuracy by bootstrap values of each branch as well as disproportionate
branch lengths.
•In case of faulty trees, corrections need to be made at both aspects.
•If the MSA is cured properly, then one might need to remove rogue
taxons (Taxons that are problematic to the tree topology or branch
length) using available softwares.
The entire process from searching for optimal substitution models
may needed to be repeated.
•If no rogue taxons can be identified. Reducing the generality of
sequence diversity could also be tried. And more relevant sequences
only be included in MSA.
•The NJ tree option can also be changed to a user defined tree option.
•The tree construction is repeated in a number of cycles untill
appropriate tree is generated.

Contenu connexe

Tendances

Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
avrilcoghlan
 
Tandem affinity purification
Tandem affinity purificationTandem affinity purification
Tandem affinity purification
Ramish Saher
 

Tendances (20)

FASTA
FASTAFASTA
FASTA
 
Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
PAM matrices evolution
PAM matrices evolutionPAM matrices evolution
PAM matrices evolution
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)
 
Phylogenetic tree construction
Phylogenetic tree constructionPhylogenetic tree construction
Phylogenetic tree construction
 
phylogenetic analysis.pptx
phylogenetic analysis.pptxphylogenetic analysis.pptx
phylogenetic analysis.pptx
 
Tandem affinity purification
Tandem affinity purificationTandem affinity purification
Tandem affinity purification
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Primer designing
Primer designingPrimer designing
Primer designing
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Maxam gilbert sequencing method
Maxam gilbert sequencing methodMaxam gilbert sequencing method
Maxam gilbert sequencing method
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Oligonucleotide ligation assay
Oligonucleotide ligation assayOligonucleotide ligation assay
Oligonucleotide ligation assay
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Clustal
ClustalClustal
Clustal
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Protein database
Protein databaseProtein database
Protein database
 

Similaire à Phylogenetic analysis in nutshell

Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
Rai University
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
International Journal of Engineering Inventions www.ijeijournal.com
 

Similaire à Phylogenetic analysis in nutshell (20)

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
BTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptxBTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptx
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
phy prAC.pptx
phy prAC.pptxphy prAC.pptx
phy prAC.pptx
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 
Phylogenetic Tree evolution
Phylogenetic Tree evolutionPhylogenetic Tree evolution
Phylogenetic Tree evolution
 

Dernier

Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Cherry
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Dernier (20)

Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Early Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdfEarly Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 

Phylogenetic analysis in nutshell

  • 1. Re-construction of Phylogenetic tree using maximum-likelihood methods PhyML (in nutshell) Note: Slides are still under revision
  • 2. Steps • Collect homologous sequences. • Multiple sequence alignment. • Manually Curing of the multiple sequence alignment. • Feeding the MSA to programs to study the substitution rates in between locations of the sites in the MSA. (ProtTest for protein and jModeltest for DNA alignments). • Selecting an appropriate substitution model. • Feeding the MSA, starting tree (e.g., those obtained with Neighbour-joining method) and substitution model as well as bootstrap properties to PhyML. • Obtain tree and cross-check bootstrap values, branch length and general resolution. • Remove rouge taxons and redo the entire process till satisfactory tree is constructed.
  • 3. Selection of sequences for phylogenetic tree Purpose of the tree 1.Geneology: evolution of gene/ gene family irrespective of speciation (called gene tree). 2.Phenology: evolution of gene/gene family in context of phylogenetic speciation (called species tree). Homologues: Genes derived from common ancestors. Orthologues: Genes derived from common ancestors or homologues that are separated from each other by gene/genome duplication (of course before speciation). Paralogues: Genes derived from common ancestors or homologues that are separated from one another by speciation (i.e., after speciation occurs the same copy of gene evolves under different constraints that are face by the two different species.
  • 4. Selecting sequences •Similar sequence of considerably low e-value in BLAST in general can be assigned to be homologous. •<40% amino acid similarity = higher by-chance appearance of similarity and not necessarily a similairity due to homology •~40% amino acid similarity= twilight zone for homology (may be may not be) •≥60% amino acid similarity=homology inferred (~80% or higher similarity in DNA sequence.)
  • 5. • Perform BLAST of the new sequence. • Note the hits obtained and the e-value. • Follow the sequences down the list with increasing e-values till the e- value suddenly jumps in order of 3 or so. E.g. 1e-10 means that the possibility that the sequence similarity is having a by-chance occurance is in probablity of 1x 10-10 and not due to homology. A sudden jump from 1e- 10 to 1 e-5 in the similarity sequence BLAST result list may indicate that the homology may be limited till the sequences with lower e-value. (Note: e-value is subjected to the size of the sequence database. larger database have lower starting e-values for a given query sequence) • Note the annotation or characterization of the proteins encoded as well as the % similarity and sequence coverage. • Also note the organisms from which it is derived • Select sequences with considerable coverage and similarity for multiple sequence alignment. • The choice of sequence can be based on species of origin and their relatedness or on special activities and multiple domain structures depending on what basis the phylogeny is to be re-constructed.
  • 6. MSA- Multiuple Sequence Alignment Different types eg., CLUSTAL, DiALIGN, MUSCLE, MAFFT. THEORETICALY ANY SEQUENCE CAN BE ALIGNED TO ANY OTHER SEQUENCE> WHETHER IT MAKES SENSE OR NOT IS A DIFFERENT ISSUE. CLUSTAL (CLUSTALW2, X): ClustalW2 uses a dynamic programing method to make MSA based on Hidden-Markov models (HMM) of probalistic likelihoods of all gaps, matches and mismatches to be aligned into a biologically relevant MSA. The dynamic programing stepwise finds the highest score of MSA based on cumulative scores by matches at each base and penalizing scores due to mismatches. This stepwise scoring is decided in first a pairwise matrix choosing the shortest distance to higher scores in situations where gaps are observed. (more info on internet will be available). This reduces greatly the time required for analysis. DiALIGN: Dialign which does not use gap penalizing and thus can be used for more accurate alignment of very divergent sequences that suffer large alignment gaps. MUSCLE: MUSCLE (Multiple Sequence Alignment by Log-Expectation) rely on interative methods that involve repeatedly aligning the old sequences while adding newer to the growing MSA to produce more accurate alignments in shorter time frames.
  • 7. CLUSTAL (CLUSTALX): •Feed sequence in fasta format (copy paste on the applet or attach a notepad file {*.txt}). E.g., > (name of the 1st sequence) Agtgatagatag………… >(name of the 2nd sequence) Gatagatcgctgatcgctc….. •Run with default. •Analyze Gaps are frequent: change the settings such that gap opening penalty is high e.g. increase from the default value of 10 to 15, 20, 25, 30. Gaps are long but less frequent: change settings such that gap extension penalty is high e.g., increase from default value of 1 to 2,3,4,5 No gaps but many mismatches: relax the gap opening (5, 6, 7,) and/or gap extension penalty (0.1, 0.2, 0.4, 0.5) such that indels might occur in the data set for a better match. REDO THE MSA ALIGNMENT TILL IT IS better.
  • 8. Manual curing of MSA •Involves intellectual curing of usually the placement of alignment gaps among the sequence alignment. This is understood more appropriately in case to case study. •Involves the removal of rouge taxons. i.e., the sequence that do not fit in the current MSA due to dis-proportionate accurence of mismatches and gaps. Usually it can be figured out after the first tree is made and the bootstrapping values and/or branch lengths of the particular lineages is questionable. (appropriate software are available). •Larger the sequence set the higher the accuracy of the tree. But also more time consuming is tree construction by maximum likelihood (ML). •More diverse the sequence set more erroneous the tree may be since it would be an approximation. Hence closely similar sequences representatives from each ordered data set needs to be selected. For eg., when talking of small molecule methyl transferases one may take a few close relatives of O-, N-, C- methyl transferases for analysis since these have considerable phylogenetic homology.
  • 9. Substitution model •The curated MSA can be included as an input to programs like jModeltest for DNA and Prottest for proteins to the pattern of substitution at each site in the MSA. Based on this pattern a list of appropriate substitution model for anaylsis is calculated. For eg. The simplest model Jukes-Cantor (JC) says that each base of DNA can be substituted at equal rate to other base in evolution. Though it is unrealistic in the practicality of life but the sequences selected might just anticipated to be obliging to this rate and thus JC can be used for analysis in PhyML. Kimura model says that transitions (Ts) (or purine to purine and pyrimidine to pyrimidine changes) and transversions (Tv) (purine to pyrimidine or vice versa) changes occur at different rates. •There are 22 DNA substitution models published and each model can have slight variants based on statistical distribution of variables like +I + G and +Y thus making it a total of 22*4=88 substitution model for DNA substitution. •+I: refers to proportion of invariable sites. (invariable sites refers to the bias incorporated due to substitution and rate heterogeneity amongst different lineages). Inclusion of this parameter ensures that the bias of sequence dissimilarity due to sequence relatedness id reduced. •+g: refers to gamma distribution of the matrix (gamma distribution is a pattern/shaape that is obserevd during statistical distribution of variants). •+y: refers to distribution or accounting for Ts/Tv ratio (incorporated due to slight variations observed between transition and transversion substitutions). e.g., MSA can follow a JC model or JC+I or JC+G or JC+Y
  • 10. Substitution model •The decision of what substitution model depends on three sattistical considerations incorporated in both jModeltest and prottest. Akaike Information Criteria (AIC), Bayesian Information criteria (BIC) and Akaike Information Criteria corrected for small samples (AICc). •The model having high scores for AIC and BIC are usually selected as appropriate substitution models for phylogenetic estimation. Phylogeny PhyML at present incorporates analysis using 32 substitution models for DNA. After adding all the tested parameters like MSA, substitution models, + I/ +G/+Y parameter options the tree building can be carried out. PhyML requires a strating user-define tree for building a phlylogenetic tree. If not available PhyML can be commanded to construct by its own a Neighbour-Joining starting tree. The tree can be improved by selecting option like SPR +NNI so that appropriateness in branch lengths can be incorporated. Finally a bootstrapping for 1000 pseudoreplicates is choosen for accuracy of branch topology.
  • 11. Bootstrapping Bootstrapping involves the program to perform the same tree building with pseudoreplicates of the sequences after breaking blocks of alignment and rearranging and then calculation how many times per hundred pseudoreplicates does a branch fall under the same topology. A bootstrap of greated than 70% is significant in general. Higher amount of pseudoreplicates chooses the more accurate is the topological calculations A bootstrap pesudoreplicate of 1000 is preferable but in consideration of time required pseudoreplicate of 100 also suffices.
  • 12. Re-construction •Once the tree is generated, the tree is broadly looked upon for accuracy by bootstrap values of each branch as well as disproportionate branch lengths. •In case of faulty trees, corrections need to be made at both aspects. •If the MSA is cured properly, then one might need to remove rogue taxons (Taxons that are problematic to the tree topology or branch length) using available softwares. The entire process from searching for optimal substitution models may needed to be repeated. •If no rogue taxons can be identified. Reducing the generality of sequence diversity could also be tried. And more relevant sequences only be included in MSA. •The NJ tree option can also be changed to a user defined tree option. •The tree construction is repeated in a number of cycles untill appropriate tree is generated.