1. Multiple Alignment
Dr Avril Coghlan
alc@sanger.ac.uk
Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
2. Pairwise versus Multiple Alignment
• So far we have considered the alignment of two
sequences (‘pairwise alignment’)
Q K E S G P S S S Y C
| | | | |
V Q Q E S G L V R T T C
• Alignment can be performed between three or more
sequences (‘multiple alignment’)
Q K E S G P S S S Y C
| | | | |
V Q Q E S G L V R T T C
| | | | | | | |
V Q K E S L L V R S T C
3. Multiple alignment
• Multiple alignments are useful for comparing many
homologous sequences at once
Multiple alignment of part of Eyeless from different animals
• Multiple alignments can be global or local
The majority of widely used programs for making multiple alignments
(eg. CLUSTAL, T-COFFEE) create global multiple alignments (not
local multiple alignments)
If the sequences share one stretch of high sequence similarity, it might
make sense to make a multiple alignment of just that region of
similarity eg. for Eyeless
You can “cut out” the region of similarity from each sequence, & make a
multiple alignment of that region eg. using CLUSTAL
4. Real data: Eyeless proteins
Do you think it’s sensible to
make a global multiple
alignment of these
sequences?
5. The alignment is not very
reliable in regions of low
similarity
for example look at the
alignment of fly Eyeless to
the other proteins here
6. • Algorithms for aligning 2 sequences (eg. N-W, S-W) can be
extended to multiple sequences
For aligning 3 sequences using N-W, we fill in a table T that is a 3D cube,
using the recurrence relation:
T(i-1,j-1,k-1) + σ(S1(i),S2(j)) + σ(S1(i),S3(k)) + σ(S2(j),S3(k))
T(i, j, k) = max T(i-1, j, k) + gap penalty + gap penalty
T(i, j-1, k) + gap penalty + gap penalty
T(i, j, k-1) + gap penalty + gap penalty
T(i-1, j, k-1) + σ(S1(i),S3(k)) + gap penalty + gap penalty
T(i, j-1, k-1) + σ(S2(j),S3(k)) + gap penalty + gap penalty
T(i-1, j-1, k) + σ(S1(i),S2(j)) + gap penalty + gap penalty
7. • The run-time increases exponentially with the
number of sequences you want to align
Aligning 4 sequences of 100 amino acids takes ~3 days!
• Heuristic algorithms for multiple alignment are
generally used, as they are fast
eg. CLUSTAL, T-COFFEE
‘Heuristic’ means they’re not guaranteed to find the best solution (best
alignment here)
(While N-W & S-W are proven to find the best alignment)
• A popular heuristic algorithm is CLUSTAL, by Des
Higgins and Paul Sharp at Trinity College Dublin
(1988)
Uses a ‘progressive alignment’ approach ie. aligns the most similar 2
sequences first; adds the next most similar sequence to that
alignment; adds the next most similar sequence … etc.
8. CLUSTAL
• A popular heuristic algorithm is CLUSTAL, by Des
Higgins and Paul Sharp at TCD (1988)
Cited >37,000 times; D. Higgins is Ireland’s most cited scientist
• CLUSTAL makes a global multiple alignment using a
‘progressive alignment’ approach
• First computes all pairwise alignments and calculates
sequence similarity between pairs
• These similarities are used to build a rough ‘guide
tree’ S1
S2
S3
S4
9. •
1 Then aligns the most similar pair of sequences
This gives us an alignment of 2 sequences (called a ‘profile’)
eg. alignment of sequences S1 and S2
•
2 Aligns the next closest pair of sequences (or pair of
profiles, or sequence and profile)
eg. alignment of sequences S1 and S2
•
3 Aligns the next closest pair of seqs/profiles
eg. alignment of profiles S1-S2 and S3-S4
MQTIF S1
MQTIF
LH-IW 1
MQTIF LHIW S2
LH-IW
LQS-W 3
LQSW
L-S-F LQSW S3
2
L-SF
LSF S4
10. • A property of this method is that gap creation is
irreversible: ‘once a gap, always a gap’
MQTIF S1
MQTIF
LH-IW 1
MQTIF LHIW S2
LH-IW
LQS-W 3
LQSW
L-S-F LQSW S3
2
L-SF
LSF S4
• This is a ‘heuristic algorithm’, ie. is not guaranteed to
give the best alignment
However, is very fast & works well in most cases
11. Software for making alignments
• For multiple alignment (heuristic programs)
CLUSTAL http://www.ebi.ac.uk/Tools/msa/clustalw2/
T-COFFEE http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi
MUSCLE http://www.ebi.ac.uk/Tools/msa/muscle/
MAFFT http://mafft.cbrc.jp/alignment/software/
12. Further Reading
• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
• Chapter 6 in Deonier et al book Computational Genome Analysis
• Practical on multiple alignment in R in the Little Book of R for
Bioinformatics:
https://a-little-book-of-r-for-
bioinformatics.readthedocs.org/en/latest/src/chapter5.html
Notes de l'éditeur
Mouse sequence from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSMUST00000111083.1 Chicken from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSGALT00000019805.3 Seasquirt from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSCINT00000013350.2 Human Eyeless (PAX6) from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENST00000379111.1 D. Melanogaster Eyeless from: http://www.treefam.org/cgi-bin/TFseq.pl?id=FBtr0100396.5 Aligned using clustalw. Viewed in Jalview. Saved as humanflyothers_clustal.png
Mouse sequence from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSMUST00000111083.1 Chicken from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSGALT00000019805.3 Seasquirt from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSCINT00000013350.2 Human Eyeless (PAX6) from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENST00000379111.1 D. Melanogaster Eyeless from: http://www.treefam.org/cgi-bin/TFseq.pl?id=FBtr0100396.5 Aligned using clustalw. Viewed in Jalview. Saved as humanflyothers_clustal.png
Mouse sequence from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSMUST00000111083.1 Chicken from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSGALT00000019805.3 Seasquirt from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENSCINT00000013350.2 Human Eyeless (PAX6) from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENST00000379111.1 D. Melanogaster Eyeless from: http://www.treefam.org/cgi-bin/TFseq.pl?id=FBtr0100396.5 Aligned using clustalw. Viewed in Jalview. Saved as humanflyothers_clustal.png
Image from www.cs.iastate.edu/~cs544/.../Multiple_Sequence_Alignment.ppt slide 12 For recurrence relation, see page 189 in Jones & Pevzner ‘An introduction to bioinformatics algorithms’