2. Sequence
• A sequence in biology is the one dimensional ordering of monomers,
covalently linked with a biopolymer.
• May be also referred to as primary structure of a biological
macromolecule.
• In bioinformatics, refers to DNA, RNA or protein sequence.
3. Sequence alignment
• Procedure of comparing two or more sequences by searching for a
series of individual characters or character patterns that are in the
same order in the sequences.
• Two sequences are aligned by writing them across a page in two rows.
• Identical or similar characters are placed in the same column, and
non-identical characters can either be placed in same column as
mismatch or opposite a gap in the other sequence.
• In an optimal alignment, non-identical characters and gaps are placed
to bring as many identical or similar characters as possible into
vertical register.
• Sequences that can be readily aligned in this manner are said to be
similar.
4. Two types of sequence alignment:
–Global alignment
–Local alignment
Fig.: Distinction between Global and Local alignment of two sequences
5. • Global alignment
– Attempts to align the entire sequence using as many characters as possible,
upto both ends of each sequence.
– Sequences that are quite similar and approximately the same length are
suitable candidates for global alignment.
– Needleman-Wunch algorithm is used to produce global alignment between
pairs of DNA or Protein sequences.
6. • Local alignment
– Stretches of sequence with the highest density of matches are aligned
– Generates one or more islands of matches or subalignments in the aligned
sequences
– Suitable for aligning sequences that are similar along some of their lengths
but dissimilar in others, sequences that differ in length, or sequences that
share conserved region or domain.
– Smith-Waterman algorithm is used to produce local alignments between pairs
of DNA or protein sequences.
7. DynamicProgramming
• Method for solving a complex problem by breaking it down into a
collection of simpler sub-problems, solving each of these sub-problems
just once and storing their solutions ideally, using a memory based
data structure.
• Then next time the same sub-problem occurs, instead of recomputing
its solution, one simply looks up the previously computed solution,
thereby saving computation time at the expense of a modest
expenditure in storage space.
8. Three steps in dynamic programming:
• Initialisation
• Matrix fill (scoring)
• Traceback (alignment)
9. • Initialization:
– Involves creating a matrix with M+1 columns and N+1 rows where
M and N correspond to the size of the sequences to be aligned.
– The first row and the first column are initialized with scores
corresponding to gap penalties.
10.
11. • Matrix fill (scoring)
– The score at each position is given as:
12.
13. • Traceback (alignment)
– Traceback starts from the last block and continues till the first
block in the matrix.
15. Needleman-Wunch algorithm
• Based on dynamic programming.
• The optimal score at each position is calculated by adding the current
match score to previously scored positions and subtracting gap
penalties (if applicable).
• Each matrix position may have a positive or negative score or zero.
• The Needleman-Wunch algorithm will maximize the number of
matches between the sequences along the entire length of the
sequences.
• Trace back starts at the last block and ends at the first block.
16. Smith-Waterman algorithm
• Based on DP but modified to give high scoring local matches.
• Slightly different from Needleman-Wunch algorithm
• The main differences are:
– The scoring system must include negative scores for mismatches, and
– When a DP scoring matrix value becomes negative it is set to zero, which has
the effect of terminating any alignment up to that point.
• Traceback starts at the highest score and ends at the block containing
zero.