Sequence alignment

Sequence Alignment
Presentation On
Zeeshan Akram Hanjra
15211506-050
Bot-309 Genetics-I
Bs Botany 6th (A)
University of Gujrat

● Procedure of comparing two or more sequences by
searching for a series of individual characters or
character patterns that are in the same order in the
sequences.
● Two sequences are aligned by writing them across a
page in two rows.
● Identical or similar characters are placed in the same
column, and non-identical characters can either be
placed in same column as mismatch or opposite a
gap in the other sequence.
 What is sequence alignment ?

Alignment
Alignment is the task of locating “equivalent”
regions of two or more sequences to maximize
their similarity.
● NIKESH NARAYANAN
● NIGESH NARAYAN- -
(RED : Mismatches)
( gaps )

● Way of arranging the sequences of DNA, RNA or
protein to identify regions of similarity.
● Helps in inferring functional , Structural or
evolutionary relationship between the sequence.
● Sequence alignment methods are used to find the
best- matching sequences.
● To determine the nucleotidessequence of DNA.
( Adenine, Thymine, cytosine, Guanine )
 Why ?

An algorithm is a sequence of instructions that one
must perform in order to solve a well-formulated
problem.
First you must identify exactly what the problem is.
A problem describes a class of computational tasks. A
problem for instance is one particular input from that
task.
Algorithms

 An algorithm must stop after a finite number of steps.
 All steps must be precisely defined.
 Input to the algorithm must be specified.
 Output to the algorithm must be specified.
 It must be very effective.
Features of Algorithm

● A genetic algorithm is used in artificial intelligence
and computing. It is used for finding optimized
solutions to search problems based on the theory of
natural selection and evolutionary biology.
Genetic Algorithm

 Global alignment
Attempts to align the entire sequence using as
many characters as possible, up to both ends of
each sequence.
Sequences that are quite similar and approximately
the same length are suitable candidates for global
alignment.
Needleman - Wunch algorithm is used to produce
global alignment between pairs of DNA or Protein
sequences.
Types Of Alignment

Local alignment
● Stretches of sequence with the highest density of
matches are aligned.
● Generates one or more islands of matches or sub
alignments in the aligned sequences.
● Suitable for aligning sequences that are similar along
some of their lengths but dissimilar in others,
sequences that differ in length, or sequences that
share conserved region or domain.
● Smith-Waterman algorithm is used to produce local
alignments.

Fig: Distinction between Global and Local alignment of twosequences

 Function or activity of a new gene/protein.
 Structure or shape of a new protein.
 Location or preferred location of a protein.
 Stability of a gene or protein.
 Origin of a gene or protein.
 Origin or phylogeny of an organelle.
 Origin or phylogeny of an organism.
Goals/Importance Of Alignment

● Parametric sequence refers to computer
methods that are used to find a range of
possible alignments.
● In response to varying the scoring system used
for matches, miss-matches , and gaps.
● There is also an effort to use scores.
● The result of global and local types of sequence
alignments provide consistent result.
Parametric sequence

● The process of alignment can be measured in
terms of the number of gaps introduced and the
number of mismatches remaining in the
alignment.
● We could score the alignment by counting how
many positions match identically at each
position.
● Many gaps may have to be placed at positions
that are not strictly identical.
Gaps

● In such cases, the positioning of gaps in the
alignment becomes numerous and more
complex.
● If this is done. The algorithms produce
alignments containing very large proportions of
matching letters and large numbers of gaps.
Cont.

● Although this process achieves optimum score
and is mathematically meaningful.
● The result of such a process would be
biologically meaningless, because insertion and
deletion of monomers is relatively a slow
evolutionary process.
Mismatches

● Dynamic programming algorithms use gap
penalties to maximize the biological meaning.
● A simple score contains a positive additives
contribution of 1 for every matching pair of
letters in the alignment.
● A gap penalty is subtracted for each gap that
has been introduced (different kinds of gap
penalties are there such as constant penalty,
proportional penalty, gap penalty which includes
gap opening and gap extension penalty).
Cont.

● The total alignment score is then a function of
the identity between aligned residues and gap
penalities incurred.
Cont.

● Distance treat sequence as points in metric space.
● A function ,associated a numeric value with a pair
of sequence.
● Larger the distance ,smaller the similarities and
vice versa.
● It satisfied the mathematical axioms of a metric.
● Distance and similarities are interchangeable.
Distance measure

● Can be measured in term of number of gaps
introduced and number of mismatches
remaining.
● It also known as edit distance.
● It is a minimum number of edit operations
required to change one string to the other.
● Edit operation can be addition , deletion
,insertion or alternation of single character.
Lavenshtein distance

● Distance between two sequences of equal
length is the number of positions with
mismatches character
● It is desirable to assign variable weights to
different edit operation since certain changes
are more likely to occur naturally.
Hamming distance

● These are used to signify text in perl programming
language.
● Usually surrounding by single or double quotation
marks.
● Given two character string, hamming technique is
used to measure distance between them.
 Strings

● Amino acid substitution tend to be conservative
and the replacement of one amino acid by
another with similar size.
● Physiochemical properties is more likely to occur
then its replacement by another amino acid
with very different property.
● Algorithm used different distance measure to
compute and score alignments.
High scoring matches

● Similar sequence gives high score
● high scoring have only mathematical significance
● While the dissimilar sequence gives the low score
● Algorithm for optimal alignment can seek either
to minimize a dissimilarity measure or maximized
a coring function
High scoring ,low scoring

● It generally involves full length sequence and a
comprehensive alignment require that many
residue have to be placed at positions that are
not strictly identical.
● For a biologically meaningful comparison, the
positioning of gaps and the number of identical
mismatches have to be balanced.
Sequence comparison

● To achieved the optimum score penalties are
introduce to minimized the number of gaps and
extensions penalties are added when the gap is
extended.
● The important task of sequence scoring is to
distinguish between the high scoring and low
scoring.
Optimum score

It is useful to discover
● Structural ,functional and evolutionary
information.
Sequences that are similar
● have same function.
● Regulatory role in case of similar DNA molecule
● Similar biochemical function and 3-D structure
for proteins.
Uses of sequence alignment

It is important to obtain
● Best possible or optimal alignment.
If 2 sequences from 2 different organisms are
similar
● There have been a common ancestor sequence
● Sequence said to be homologous.
Uses

● Alignment indicates
● Changes that have occurred between two
homologous sequences and a common
ancestor sequence.
● Helps to determine the data base
● that are potentially related to a particular
sequence.
Uses Cont.

2 scientists Doolittle and Waterfield discovered
similar sequences for first time.
● They found that viral oncogene V-sis was found
to be a modified form of normal cellular gene
which encodes platelet-derived growth factor.
● Dynamic programming algorithms find best
alignment.
● Process is very slow.
Uses

Due to random mutations nucleotides may be
 Replaced
 Deleted
 Or inserted.
Loss of function of protein is disadvantage of the
organism.
Change will survive if its not a deleterious effect on
protein.
Scoring mutations, Deletions and Substitution

• If change is deleterious than organism will not
survive and the genes will not transfer.
• Most of substitution mutations are well tolerated in
protein.
The substitution that does not affect protein property
is called as conservative substitution.
• Protein coding genes evolve much slowly.
• When evolution happens the proteins tend to
involve substitution between amino acids with
similar proteins.
Cont.

Protein sequences from same evolutionary family
show
 substitution between amino acids with similar
physiochemical processes.
 Substitution score matrix used to show scores
for amino acid substitutions.
 While comparing proteins we can increase
sensitivity to weak alignments by substitution
matrix.
Cont.

● In different species amino acid substitutions
occur in proteins that functions and are
compatible with its structure and function.
● They are chemically similar but changes also
occur.
● By knowing the changes in proteins can assist in
predicting alignments.
● If protein sequences are similar they are easily
aligned.
Amino acid substitution matrix

 Evolution can be predicted if ancestral relationships
among a group of proteins are assessed.
 Margaret Dayhoff pioneered this analysis.
 Symbol comparison table are used for this purposes.
Mechanism:
 Matrices amino acids are listed above and below.
 Each matrix position is filled with a score.
 It shows how often an amino acid is paired with other.
Cont.

● Probability of changing an amino acid from A to
B assumed to be possible of the reverse.
● This is because the ancestor amino acid in
phylogenetic tree is not known.
● The prediction of this model is that over
evolutionary time amino acid frequencies will
not change.
● Calculating alignment scores identical amino
acids should be given higher value.
Cont.

• And among substitutions conservative substitutions
should be given greater value than non conservative
substitutions.
• Tow popular matrices
Dayhoff mutation data
BLOSUM
They have been devised to weight matches between
non identical residues.
• MD score is based on concept of point accepted
mutation.
Cont.

● A PAM matrix is a matrix where each column and row
represents one of the twenty standard amino acids.
● In bioinformatics, PAM matrices are regularly used as
substitution matrices to score sequence alignments for
proteins.
● The missense mutations may be classed as point accepted
mutations
● A PAM matrix is a matrix where each column and row
represents one of the twenty standard amino acids.
Percent Accepted mutation matrix

● The genetic instructions of every replicating cell in a
living organism are contained within its DNA.
● Throughout the cell's lifetime, this information is
transcribed and replicated by cellular mechanisms.
● To produce proteins or to provide instructions for
daughter cells during cell division, and the possibility
exists that the DNA may be altered during these
processes. This is known as a mutation.
● At the molecular level, there are regulatory systems
that correct most but not all of these changes to the
DNA before it is replicated.
 Biological background

● PAM matrices were introduced by Margaret Dalhoff
in 1978.
● The calculation of these matrices were based on
1572 observed mutations in the phylogenetic trees
of 71 families of closely related proteins.
● The proteins to be studied were selected on the
basis of having high similarity with their
predecessors.
● The protein alignments included were required to
display at least 85% identity.
Construction of PAM matrices

As a result, it is reasonable to assume that any aligned
mismatches were the result of a single mutation event,
rather than several at the same location.
● Each PAM matrix has twenty rows and twenty
columns one representing each of the twenty amino
acids translated by the genetic code
● The value in each cell of a PAM matrix is related to
the probability of a row amino acid before the
mutation being aligned with a column amino acid
afterwards.
Conti.

● For each branch in the phylogenetic trees of the protein
families, the number of mismatches that were observed were
recorded and a record kept of the two amino acids involved.
● These counts were used as entries below the main diagonal of
the matrix A, Matrix A is assumed to be symmetrical.
● The mutability of an amino acid is the ratio of the number
of mutations the number of times it occurs in an alignment.
● Cysteine and tryptophan were found to be the least mutable
amino acids.
● Cysteine's side chain contains sulfur which participates in
disulfide bonds.
 Collection of data from phylogenetic tree

● Relative mutabilities were evaluated by counting in each
group of related sequences, the number of changes of
each amino acid and dividing this number by a factor,
called the exposure to mutation of the amino acid.
● This factor is the product of the frequency of occurrence
of all amino acid changes that occurred in that group per
100 sites.
● By these scores,Asn,Ser,Asp and Glu were the most
mutable amino acids, and Cys and Trp were the least
mutable
 Relative mutabilities

● The molecular clock hypothesis predicts that the rate
of amino acid substitution in a particular protein will
be approximately constant over time, though this rate
may vary between protein families.
● This suggests that the number of mutations per amino
acid in a protein increases approximately linearly with
time.
● Determining the time at which two proteins diverged
is an important task in phylogenetic.
 Determining the time of divergence in
phylogenetic trees

● Fossil records are often used to establish the
position of events on the timeline of the Earth's
evolutionary history, but the application of this
source is limited.
● However, if the rate at which the molecular clock of
protein family ticks that is, the rate at which the
number of mutations per amino acid increases is
known.
● Then knowing this number of mutations would
allow the date of divergence to be found.
Cont.

● PAM matrices are usually converted into
another form, called as log odds matrices.
● The odds score represents the ratio of the
change of amino acid substitution by two
different Hypothesis.
● One that the change actually represents an
authentic evolutionary variation at that site.
Log odds matrices

• PAM matrices are also used as a scoring matrix
when comparing DNA sequences or protein
sequences to judge the quality of the alignment.
• This form of scoring system is utilized by a wide
range of alignment software including BLAST.
Use in BLAST

● The BLOSUM substitution is widely used for
scoring protein sequence alignment.
● The BLOSUM matrices are based on different
types of sequence analysis and much larger set
than the PAM matrices.
What is BLOSUM?

● The matrices values are based on the observed
amino acid substitution in a large set more than
2000 conserved amino acid pattern called block.
● These blocks have been found in a database of
protein sequence representing more than 500
families of related protein and act as signature
of these protein families.
What is BLOCK?

● The prosites catalog provides lists of protein that
in the same family because they have similar
biochemical function.
● For each family a pattern of amino acids that are
characteristic of that function is provided
● Henikoff examined each prosite family for the
presence of ungapped amino acid pattern blocks
that could be used to identify members of that
family.
Cont.

● To locate these patterns the sequence of each
protein family were searched for similar amino acid
pattern by the MOTIF program.
● These initial pattern are organized into larger
ungapped pattern(blocks) between 3 and 60 amino
acid long by the Henikoffs PROTOMAT program.
● These blocks are present in all the sequence in each
family.
● They could be used to identify other members of
family.
How to locate patterns?

● The blocks that are characterized each family
provided a type of multiple sequence alignment for
that family.
● The amino acid changes in column of alignment
could be counted.
● The types of substitution were used to prepare a
scoring matrix,BLOSUM matrix.
● These were given as logarithm of odd scores of ratio
of observed frequency of amino acid divided by
frequency expected by chance.
Cont.

● The counting of amino acids changes in blocks.
● The sequence were grouped together into one
substitution before scoring the amino acid sub.
In aligned block.
● Pattern that were 60% identical were grouped
together to make one substitution called
BLOSUM60.
● And those 80% alike called BLOSUM80.
How to count amino acids?

● Like PAM BLOSUM is based on similar principles
of target frequencies of mutation.
● BLOSUM make use of BLOCK database.
● Blocks contain local multiple alignments of
distantly related sequence.
● BLOSUM has an evolutionary model in its matrix
formation as seen in PAM.
Similarity between BLOSUM & PAM.

Sequence alignment

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Sequence alignment

Similaire à Sequence alignment (20)

Plus de Zeeshan Hanjra

Plus de Zeeshan Hanjra (10)

Dernier

Dernier (20)

Sequence alignment