Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
BLAST algorithm for sequence comparison
1. Basic Local Alignment Search Tool
(BLAST) by Stephen F. Altschul et al
J. Mol. Bio 1990
W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala
5. • Function of a newly sequenced gene or a protein
can be predicted by discovering it’s homology to
a known gene or a protein.
• A major task of Bioinformatics is to find
homologous sequence in a database of
sequences
• Databases of DNA and amino acid sequences
continue to grow in size.
• There are number of software tools based on
many algorithms.
Introduction
7. Needleman-Wunsch Algorithm
• Dynamic programming algorithm
• Assign scores to insertions ,deletions and
replacements.
• Compute an alignment of two sequences with least
mutations.
• Measurement of Similarity
• Because of computational requirements impractical
for searching a large database without a
supercomputer
Introduction
8. Smith Waterman
• Rapid Heuristic Algorithm
• Allows large databases to be searched on common computers
FastP Algorithm
• David J. Lipman and William R. Pearson in 1985
• First find locally similar regions between 2 sequences based on
identities but not gaps.
• Then rescores those regions using a similarity matrix.
• Quite popular
Introduction
10. BLAST(Basic Local Alignment Search Tool)
Algorithm
• A new approach to rapid sequence comparison.
• It directly approximate alignments that optimize a
measure of local similarity called the Maximal
Segment Pair(MSP) score.
• Recent mathematical results on MSP scores
allows performance analysis of this method
• Generated alignment has a statistical significance.
• Simple & a robust algorithm
Introduction
11. Can be applied to various contexts
• DNA sequences
• Protein sequences
• Motif searches
• Gene identification searches
• Analysis of multiple similarity regions in long DNA
sequences
Characteristic features
• Flexibility & Tractability.
• Very much faster than existing sequence
comparison tools of comparable sensitivity.
Introduction
This research paper
describes all the methods
& implementations of
BLAST algorithm.
14. TWO TYPES OF SIMILARITY MEASURES
Methods
Global
• Optimize the overall alignment of two
sequences.
• Includes large stretches of low similarity
Local
• Seek only relatively conserved subsequences
• Single comparison may yield several distinct
subsequence alignments
15. • Local similarity measure are preferred for
database searches where cDNAs can be
compared with partially sequence genes.
• Many similarity measures including the
one they describe begins with a matrix of
similarity score for all possible pairs of
residues
B
L
A
S
T
Methods
16. Scoring Alignments
• Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)
• Identities & conservative replacements >>> Positive Scores
• Unlikely Replacements >> Negative scores
• Amino acid sequences : “PAM” matrix
BLOSUM
• DNA sequences : match = +5
mismatch = -4
• Sequence Segment – Contiguous stretch of residues of any length
• Similarity score for segments is the sum of similarity values for
each pair of aligned residues
Methods
17. Maximal Segment Pair (MSP)
• Given these rules they have defined a Maximal Segment Pair
(MSP) which is the highest scoring pair of identical length
segments chosen from 2 sequences.
• The similarity score of an MSP is called the MSP score which is
calculated by BLAST.
• With long sequences the search for the MSP score becomes
computationally demanding.
• Therefore BLAST searches for locally maximal segment pairs –
Score cannot be improved by either extending or shortening
segments.
Methods
19. • Goal is to report those database sequences that have MSP
score above some cutoff score S.
• Statistically the highest MSP score S can be estimated at
which “chance similarities” are likely to appear.
• BLAST minimizes time spent on database sequences whose
similarity with the query has little chance of exceeding this
score.
Methods
20. • Let a word pair be a segment pair with a fixed length w.
• Main strategy: seek only segment pairs (one from database,
one query) that contain a word pair with score at least T.
• Such hit will be extended until it exceeds the cutoff score S
and those hits will be the final output of BLAST.
• Lower T => Fewer false negatives
• Lower T => More pairs to analyze
Methods
22. • BLAST finds locally maximal
segment pairs that exceeds a
particular cutoff.
• Detailed annotation of Three
Algorithmic steps.
• Compile a list of high-scoring
words.
• Scanning the DB for hits.
• Extending hits that meet certain s
coring criteria (Extend only word
pairs with a score of at least T to
determine if it has a segment pair
of score at least S).
METHODS
B
L
A
S
T
23. COMPILING OF HIGH SCORING WORDS
• Obtain the list of words in the target sequence (k-mers), that
give a score of T or higher when aligned with the query
sequence.
• Assume that the query sequence is P Q G E F G
• We have the following four 3-mers (recall, the number
of k-mers is always N-k+1):
P Q G
Q G E
G E F
E F G
METHODS
Each of these 3-mers are then scored
against each and every one of the k-mers in
each of the target sequence. For long
sequences, this could well include all 8000
possible k-mers.
24. • So, of all pairwise scorings for P Q G (using the BLOSUM-
62 matrix), we can find the following high scoring ones:
• P Q G (of course, this is a perfect match) score of
7+5+6 = 18
• P E G score of 7+2+6 = 15
• P Q A score of 7+5+0 = 12
METHODS
25. SCANNING THE DB FOR HITS
• Scan the database for hits with the compiled list of words
obtained in previous step.
• How efficiently search a long sequence for multiple
occurrences of short sequences.
• BLAST has two approaches
• Indexing approach
• Finite state machine
METHODS
26. INDEXING APPROACH – EXAMPLE 1
• Build a lookup table of size |Σ|w for all w-length words in DB.
METHODS
27. INDEXING APPROACH – EXAMPLE 2
• Let w=3. For amino acids, the number of words is 203.
• Map a word to an integer between 1 and 203.
• Thus a word has an index into an array.
• Each index points to a list of matches of the word in the
query sequence.
• As we scan the database, each database word immediately
leads to the hits in the query sequence.
METHODS
28. EXTENDING HITS
• Once the hits are located both in the query and the target
sequence, extend the hits to form high scoring segment
pairs.
• Find the highest scoring segment (the maximal segment
pair) or those whose score exceeds (another user set)
threshold S.
• When manage to find a hit (a match between a “word” and a
database entry), extend the hit in either direction.
• Keep track of the score (use a scoring matrix).
METHODS
29. EXTENDING HITS – EXAMPLE 1
• Extend each seed on either side until the aggregate
alignment score falls below a threshold.
• Un-gapped: Extend by only either matches or mismatches.
• Gapped: Extend by matches, mismatches or a limited number of
insertion/deletion gaps.
METHODS
33. • Finally, evaluate the statistical
significance of the alignments / scores
that exceed the threshold.
• BLAST statistical significance of MSP
scores can be evaluated by following
factors.
• Performance of BLAST with random
sequences
• Performance of BLAST with
homologous sequences
• Performance comparing long DNA
sequences
RESULTS
B
L
A
S
T
35. • When two random sequences of length m
and n compared, the probability of finding
at least one HSP “by chance” is:
• Hence, the probability of finding exactly x
HSPs with a score ≥ S is given by:
RESULTS
E
eXPXPXP
1)0(1)1(1)1(
!
)(
x
E
exXP
x
E
B
L
A
S
T
36. • Where E can be defined by according to the Karlin-Altschul
equation.
• E(HSPs with score) ≥ S, also called the E-value:
RESULTS
37. • The probability of finding c or more
distinct segment pairs, all with a score of
at least S, is given by the formula:
• Utilizing this formula can be detected two
sequences that share distinct regions of
similarity as significantly related.
RESULTS
1
0 !
1)(1)(
c
i
i
E
i
E
ecXPcXP
B
L
A
S
T
38. The choice of word length
& threshold parameters
Necessary & Sufficient
Adjustments to be Done in Terms
of Word Length & Threshold Value
RESULTS
39. Time required to execute BLAST
• To Compile List of Words.
• To Scan the Database for Hits (MSP>T).
• To Extend All Hits to Seek Segment Pairs with
Scores Exceeding the Cutoff (HSP>S).
• All these three steps depend on W & T
Can we make the process more optimal?
• Decrease the time spent on step 3, by
increasing the W. But there are complementary
problems created by larger W.
• For Proteins – 20W possible word, Therefor
when W increases number of words generated
by query grows exponentially. (But number of
words increases linearly with the length of the
query)
• It increases time spent on step 1 & also the
amount of memory required.
Results
B
L
A
S
T
40. Optimal T and W values
• For protein sequences W=4, T=17.
• For DNA sequences W=11.
How W and T affect the performance of
BLAST
Results
B
L
A
S
T
T Execution Time Accuracy
Speed
W Execution Time Accuracy
Speed
Computational
Complexity ?
41. Performance of BLAST
with homologous
sequences
Things to be Noted When Query
Sequence BLAST Against Set of
Homologous Sequences
RESULTS
42. What is homologous?
• Related by common ancestor.
Researchers Example 1
• Search for wooly monkey sequence.
• When W=4 & T=17.
• Found 178 MSPs with scores (50-80).
• Random model suggest that BLAST should miss
24 of MSPs
• But actual miss 43.
• Therefor error = 44.2%
Researchers Example 2
• Search for mouse sequences.
• Same W & T as previous.
• Found 33 MSPs with scores (45-65).
• Random model suggest that BLAST should miss 8
of MSPs
• But actual miss 2.
RESULTS
B
L
A
S
T
43. • Failure to detect significant similarity does
only shows our inability to detect homology, it
does not prove that the sequences are not
homologous.
• The overall performance of BLAST depends on
the distribution of MSP scores.
Strengths of BLAST
• Great utility is for identify high scoring MSPs
quickly.
• Takes lower amount of time for the alignment
process.
Further improvements can be done
• Novel approaches like Position Specific Iterated
BLAST (PSI BLAST)
RESULTS
B
L
A
S
T
44. Comparison of two long
DNA sequences
Does Adjustments of W Make the
Process Faster?
RESULTS
45. Main Classes of Locally Similar Regions
• Genes.
• Long interspersed Repeats.
• Anticipated Weaker Similarities.
Example (Human Gene VS Rabbit Gene)
Step -1
• Match Score = 5, Mismatch Score = -4 & W=12.
• 93 Alignments scoring over 200, 57 Alignments
scoring over 350 with 1301 highest score.
Step -2
• W=8
• Only additional 32 alignments are found score
over 200.
• Use of Smaller W does not provides new
essential information always.
Results
B
L
A
S
T
48. Underlying Concept of BLAST
• Simple & Robust.
• Can be implemented in many ways.
• Can be used in variety of contexts.
Researchers Implementation
• Used a shared memory version of BLAST.
• Why shared memory?
• Loads compressed DNA file into memory
once & allow subsequent steps to skip that
step.
• BLAST approach permits construction of extremely
fast programs for database searching, Which
provides additional advantage on mathematical
advantage
To Whom This Tool Would Help
• Molecular Biologists
• Doctors & etc…
Conclusion
B
L
A
S
T
50. Q&A
Q&A
• What are the disadvantages of Blast with compared to other
heuristic algorithms? – by Pubudu
• What is the criteria of selecting threshold S and T values
using in implementation part of this research paper? – by
Parinda
• What are the chances that the Maximal Segment Pair score
for two unrelated sequences would be greater than or
equal to S value? – by Parinda
• Among PAM and BLOSUM matrices what is the most
suitable matrix when scoring segments in amino acid
sequences ? Why? – by Shashika