SlideShare une entreprise Scribd logo
1  sur  51
Basic Local Alignment Search Tool
(BLAST) by Stephen F. Altschul et al
J. Mol. Bio 1990
W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala
CONTENTS
ConclusionResults
MethodsIntroduction
»
»
»
»
»
Questions &
Answers
»
INTRODUCTION
WHY COMPARE ONE
SEQUENCE TO ANOTHER?
Introduction
• Function of a newly sequenced gene or a protein
can be predicted by discovering it’s homology to
a known gene or a protein.
• A major task of Bioinformatics is to find
homologous sequence in a database of
sequences
• Databases of DNA and amino acid sequences
continue to grow in size.
• There are number of software tools based on
many algorithms.
Introduction
EVOLUTION OF
SIMILARITY SEARCH
ALGORITHMS
Introduction
Needleman-Wunsch Algorithm
• Dynamic programming algorithm
• Assign scores to insertions ,deletions and
replacements.
• Compute an alignment of two sequences with least
mutations.
• Measurement of Similarity
• Because of computational requirements impractical
for searching a large database without a
supercomputer
Introduction
Smith Waterman
• Rapid Heuristic Algorithm
• Allows large databases to be searched on common computers
FastP Algorithm
• David J. Lipman and William R. Pearson in 1985
• First find locally similar regions between 2 sequences based on
identities but not gaps.
• Then rescores those regions using a similarity matrix.
• Quite popular
Introduction
NEED FOR A FAST
ALIGNMENT METHOD
Introduction
BLAST(Basic Local Alignment Search Tool)
Algorithm
• A new approach to rapid sequence comparison.
• It directly approximate alignments that optimize a
measure of local similarity called the Maximal
Segment Pair(MSP) score.
• Recent mathematical results on MSP scores
allows performance analysis of this method
• Generated alignment has a statistical significance.
• Simple & a robust algorithm
Introduction
Can be applied to various contexts
• DNA sequences
• Protein sequences
• Motif searches
• Gene identification searches
• Analysis of multiple similarity regions in long DNA
sequences
Characteristic features
• Flexibility & Tractability.
• Very much faster than existing sequence
comparison tools of comparable sensitivity.
Introduction
This research paper
describes all the methods
& implementations of
BLAST algorithm.
METHODS
THE MSP MEASURE
Discussion on Maximal Segment
Pairs
METHODS
TWO TYPES OF SIMILARITY MEASURES
Methods
Global
• Optimize the overall alignment of two
sequences.
• Includes large stretches of low similarity
Local
• Seek only relatively conserved subsequences
• Single comparison may yield several distinct
subsequence alignments
• Local similarity measure are preferred for
database searches where cDNAs can be
compared with partially sequence genes.
• Many similarity measures including the
one they describe begins with a matrix of
similarity score for all possible pairs of
residues
B
L
A
S
T
Methods
Scoring Alignments
• Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)
• Identities & conservative replacements >>> Positive Scores
• Unlikely Replacements >> Negative scores
• Amino acid sequences : “PAM” matrix
BLOSUM
• DNA sequences : match = +5
mismatch = -4
• Sequence Segment – Contiguous stretch of residues of any length
• Similarity score for segments is the sum of similarity values for
each pair of aligned residues
Methods
Maximal Segment Pair (MSP)
• Given these rules they have defined a Maximal Segment Pair
(MSP) which is the highest scoring pair of identical length
segments chosen from 2 sequences.
• The similarity score of an MSP is called the MSP score which is
calculated by BLAST.
• With long sequences the search for the MSP score becomes
computationally demanding.
• Therefore BLAST searches for locally maximal segment pairs –
Score cannot be improved by either extending or shortening
segments.
Methods
RAPID APPROXIMATION
OF MSP SCORES
METHODS
• Goal is to report those database sequences that have MSP
score above some cutoff score S.
• Statistically the highest MSP score S can be estimated at
which “chance similarities” are likely to appear.
• BLAST minimizes time spent on database sequences whose
similarity with the query has little chance of exceeding this
score.
Methods
• Let a word pair be a segment pair with a fixed length w.
• Main strategy: seek only segment pairs (one from database,
one query) that contain a word pair with score at least T.
• Such hit will be extended until it exceeds the cutoff score S
and those hits will be the final output of BLAST.
• Lower T => Fewer false negatives
• Lower T => More pairs to analyze
Methods
IMPLEMENTATION
Key Steps of BLAST Algorithm
METHODS
• BLAST finds locally maximal
segment pairs that exceeds a
particular cutoff.
• Detailed annotation of Three
Algorithmic steps.
• Compile a list of high-scoring
words.
• Scanning the DB for hits.
• Extending hits that meet certain s
coring criteria (Extend only word
pairs with a score of at least T to
determine if it has a segment pair
of score at least S).
METHODS
B
L
A
S
T
COMPILING OF HIGH SCORING WORDS
• Obtain the list of words in the target sequence (k-mers), that
give a score of T or higher when aligned with the query
sequence.
• Assume that the query sequence is P Q G E F G
• We have the following four 3-mers (recall, the number
of k-mers is always N-k+1):
P Q G
Q G E
G E F
E F G
METHODS
Each of these 3-mers are then scored
against each and every one of the k-mers in
each of the target sequence. For long
sequences, this could well include all 8000
possible k-mers.
• So, of all pairwise scorings for P Q G (using the BLOSUM-
62 matrix), we can find the following high scoring ones:
• P Q G (of course, this is a perfect match) score of
7+5+6 = 18
• P E G score of 7+2+6 = 15
• P Q A score of 7+5+0 = 12
METHODS
SCANNING THE DB FOR HITS
• Scan the database for hits with the compiled list of words
obtained in previous step.
• How efficiently search a long sequence for multiple
occurrences of short sequences.
• BLAST has two approaches
• Indexing approach
• Finite state machine
METHODS
INDEXING APPROACH – EXAMPLE 1
• Build a lookup table of size |Σ|w for all w-length words in DB.
METHODS
INDEXING APPROACH – EXAMPLE 2
• Let w=3. For amino acids, the number of words is 203.
• Map a word to an integer between 1 and 203.
• Thus a word has an index into an array.
• Each index points to a list of matches of the word in the
query sequence.
• As we scan the database, each database word immediately
leads to the hits in the query sequence.
METHODS
EXTENDING HITS
• Once the hits are located both in the query and the target
sequence, extend the hits to form high scoring segment
pairs.
• Find the highest scoring segment (the maximal segment
pair) or those whose score exceeds (another user set)
threshold S.
• When manage to find a hit (a match between a “word” and a
database entry), extend the hit in either direction.
• Keep track of the score (use a scoring matrix).
METHODS
EXTENDING HITS – EXAMPLE 1
• Extend each seed on either side until the aggregate
alignment score falls below a threshold.
• Un-gapped: Extend by only either matches or mismatches.
• Gapped: Extend by matches, mismatches or a limited number of
insertion/deletion gaps.
METHODS
EXTENDING HITS – EXAMPLE 2
METHODS
RESULTS
Evaluating Statistical
Significance
RESULTS
• Finally, evaluate the statistical
significance of the alignments / scores
that exceed the threshold.
• BLAST statistical significance of MSP
scores can be evaluated by following
factors.
• Performance of BLAST with random
sequences
• Performance of BLAST with
homologous sequences
• Performance comparing long DNA
sequences
RESULTS
B
L
A
S
T
Performance of BLAST
with random sequences
RESULTS
• When two random sequences of length m
and n compared, the probability of finding
at least one HSP “by chance” is:
• Hence, the probability of finding exactly x
HSPs with a score ≥ S is given by:
RESULTS
E
eXPXPXP 
 1)0(1)1(1)1(
!
)(
x
E
exXP
x
E

B
L
A
S
T
• Where E can be defined by according to the Karlin-Altschul
equation.
• E(HSPs with score) ≥ S, also called the E-value:
RESULTS
• The probability of finding c or more
distinct segment pairs, all with a score of
at least S, is given by the formula:
• Utilizing this formula can be detected two
sequences that share distinct regions of
similarity as significantly related.
RESULTS





1
0 !
1)(1)(
c
i
i
E
i
E
ecXPcXP
B
L
A
S
T
The choice of word length
& threshold parameters
Necessary & Sufficient
Adjustments to be Done in Terms
of Word Length & Threshold Value
RESULTS
Time required to execute BLAST
• To Compile List of Words.
• To Scan the Database for Hits (MSP>T).
• To Extend All Hits to Seek Segment Pairs with
Scores Exceeding the Cutoff (HSP>S).
• All these three steps depend on W & T
Can we make the process more optimal?
• Decrease the time spent on step 3, by
increasing the W. But there are complementary
problems created by larger W.
• For Proteins – 20W possible word, Therefor
when W increases number of words generated
by query grows exponentially. (But number of
words increases linearly with the length of the
query)
• It increases time spent on step 1 & also the
amount of memory required.
Results
B
L
A
S
T
Optimal T and W values
• For protein sequences W=4, T=17.
• For DNA sequences W=11.
How W and T affect the performance of
BLAST
Results
B
L
A
S
T
T Execution Time Accuracy
Speed
W Execution Time Accuracy
Speed
Computational
Complexity ?
Performance of BLAST
with homologous
sequences
Things to be Noted When Query
Sequence BLAST Against Set of
Homologous Sequences
RESULTS
What is homologous?
• Related by common ancestor.
Researchers Example 1
• Search for wooly monkey sequence.
• When W=4 & T=17.
• Found 178 MSPs with scores (50-80).
• Random model suggest that BLAST should miss
24 of MSPs
• But actual miss 43.
• Therefor error = 44.2%
Researchers Example 2
• Search for mouse sequences.
• Same W & T as previous.
• Found 33 MSPs with scores (45-65).
• Random model suggest that BLAST should miss 8
of MSPs
• But actual miss 2.
RESULTS
B
L
A
S
T
• Failure to detect significant similarity does
only shows our inability to detect homology, it
does not prove that the sequences are not
homologous.
• The overall performance of BLAST depends on
the distribution of MSP scores.
Strengths of BLAST
• Great utility is for identify high scoring MSPs
quickly.
• Takes lower amount of time for the alignment
process.
Further improvements can be done
• Novel approaches like Position Specific Iterated
BLAST (PSI BLAST)
RESULTS
B
L
A
S
T
Comparison of two long
DNA sequences
Does Adjustments of W Make the
Process Faster?
RESULTS
Main Classes of Locally Similar Regions
• Genes.
• Long interspersed Repeats.
• Anticipated Weaker Similarities.
Example (Human Gene VS Rabbit Gene)
Step -1
• Match Score = 5, Mismatch Score = -4 & W=12.
• 93 Alignments scoring over 200, 57 Alignments
scoring over 350 with 1301 highest score.
Step -2
• W=8
• Only additional 32 alignments are found score
over 200.
• Use of Smaller W does not provides new
essential information always.
Results
B
L
A
S
T
Results
B
L
A
S
T
The Time & Sensitivity of BLAST
on DNA Sequences as a Function
of W
CONCLUSION
Underlying Concept of BLAST
• Simple & Robust.
• Can be implemented in many ways.
• Can be used in variety of contexts.
Researchers Implementation
• Used a shared memory version of BLAST.
• Why shared memory?
• Loads compressed DNA file into memory
once & allow subsequent steps to skip that
step.
• BLAST approach permits construction of extremely
fast programs for database searching, Which
provides additional advantage on mathematical
advantage
To Whom This Tool Would Help
• Molecular Biologists
• Doctors & etc…
Conclusion
B
L
A
S
T
QUESTIONS &
ANSWERS
Q&A
Q&A
• What are the disadvantages of Blast with compared to other
heuristic algorithms? – by Pubudu
• What is the criteria of selecting threshold S and T values
using in implementation part of this research paper? – by
Parinda
• What are the chances that the Maximal Segment Pair score
for two unrelated sequences would be greater than or
equal to S value? – by Parinda
• Among PAM and BLOSUM matrices what is the most
suitable matrix when scoring segments in amino acid
sequences ? Why? – by Shashika
THANKS FOR
YOUR TIME

Contenu connexe

Tendances (20)

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
BLAST
BLASTBLAST
BLAST
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Msa
MsaMsa
Msa
 
Fasta
FastaFasta
Fasta
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
Blast
BlastBlast
Blast
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptx
 
Fasta
FastaFasta
Fasta
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
Phylogenetic data analysis
Phylogenetic data analysisPhylogenetic data analysis
Phylogenetic data analysis
 
BLAST
BLASTBLAST
BLAST
 

En vedette

BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)Sobia
 
BLAST [Basic Alignment Local Search Tool]
BLAST [Basic Alignment Local Search Tool]BLAST [Basic Alignment Local Search Tool]
BLAST [Basic Alignment Local Search Tool]BiotechOnline
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Pritom Chaki
 
Tips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsTips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsIntegrated DNA Technologies
 
Sequence Alignment,Blast, Fasta, MSA
Sequence Alignment,Blast, Fasta, MSASequence Alignment,Blast, Fasta, MSA
Sequence Alignment,Blast, Fasta, MSASucheta Tripathy
 
Enabling faster analysis of vaccine adverse event reports with ontology support
Enabling faster analysis of vaccine adverse event reports with ontology supportEnabling faster analysis of vaccine adverse event reports with ontology support
Enabling faster analysis of vaccine adverse event reports with ontology supportMelanie Courtot
 
Protein structure
Protein structureProtein structure
Protein structurePooja Pawar
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
 
Bioinfromatics - local alignment
Bioinfromatics - local alignmentBioinfromatics - local alignment
Bioinfromatics - local alignmentVivek Chandramohan
 

En vedette (20)

blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)
 
BLAST [Basic Alignment Local Search Tool]
BLAST [Basic Alignment Local Search Tool]BLAST [Basic Alignment Local Search Tool]
BLAST [Basic Alignment Local Search Tool]
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
BLAST and sequence alignment
 
BLAST
BLASTBLAST
BLAST
 
Fasta
FastaFasta
Fasta
 
Fasta
FastaFasta
Fasta
 
Fasta
FastaFasta
Fasta
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
Tips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsTips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI tools
 
Sequence Alignment,Blast, Fasta, MSA
Sequence Alignment,Blast, Fasta, MSASequence Alignment,Blast, Fasta, MSA
Sequence Alignment,Blast, Fasta, MSA
 
Enabling faster analysis of vaccine adverse event reports with ontology support
Enabling faster analysis of vaccine adverse event reports with ontology supportEnabling faster analysis of vaccine adverse event reports with ontology support
Enabling faster analysis of vaccine adverse event reports with ontology support
 
BLAST
BLASTBLAST
BLAST
 
K8 bs pa islam
K8 bs pa   islamK8 bs pa   islam
K8 bs pa islam
 
Protein structure
Protein structureProtein structure
Protein structure
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Bioinfromatics - local alignment
Bioinfromatics - local alignmentBioinfromatics - local alignment
Bioinfromatics - local alignment
 

Similaire à BLAST algorithm for sequence comparison

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekingeProf. Wim Van Criekinge
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeProf. Wim Van Criekinge
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234alizain9604
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
Database similarity searching blast and fasta
Database similarity searching blast and fastaDatabase similarity searching blast and fasta
Database similarity searching blast and fastaSwathi764350
 
CDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringCDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringMarco Antoniotti
 
Parallel DNA Sequence Alignment
Parallel DNA Sequence AlignmentParallel DNA Sequence Alignment
Parallel DNA Sequence AlignmentGiuliana Carullo
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfsriaisvariyasundar
 

Similaire à BLAST algorithm for sequence comparison (20)

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Database similarity searching blast and fasta
Database similarity searching blast and fastaDatabase similarity searching blast and fasta
Database similarity searching blast and fasta
 
CDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringCDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoring
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Parallel DNA Sequence Alignment
Parallel DNA Sequence AlignmentParallel DNA Sequence Alignment
Parallel DNA Sequence Alignment
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdf
 

Dernier

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Dernier (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

BLAST algorithm for sequence comparison

  • 1. Basic Local Alignment Search Tool (BLAST) by Stephen F. Altschul et al J. Mol. Bio 1990 W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala
  • 4. WHY COMPARE ONE SEQUENCE TO ANOTHER? Introduction
  • 5. • Function of a newly sequenced gene or a protein can be predicted by discovering it’s homology to a known gene or a protein. • A major task of Bioinformatics is to find homologous sequence in a database of sequences • Databases of DNA and amino acid sequences continue to grow in size. • There are number of software tools based on many algorithms. Introduction
  • 7. Needleman-Wunsch Algorithm • Dynamic programming algorithm • Assign scores to insertions ,deletions and replacements. • Compute an alignment of two sequences with least mutations. • Measurement of Similarity • Because of computational requirements impractical for searching a large database without a supercomputer Introduction
  • 8. Smith Waterman • Rapid Heuristic Algorithm • Allows large databases to be searched on common computers FastP Algorithm • David J. Lipman and William R. Pearson in 1985 • First find locally similar regions between 2 sequences based on identities but not gaps. • Then rescores those regions using a similarity matrix. • Quite popular Introduction
  • 9. NEED FOR A FAST ALIGNMENT METHOD Introduction
  • 10. BLAST(Basic Local Alignment Search Tool) Algorithm • A new approach to rapid sequence comparison. • It directly approximate alignments that optimize a measure of local similarity called the Maximal Segment Pair(MSP) score. • Recent mathematical results on MSP scores allows performance analysis of this method • Generated alignment has a statistical significance. • Simple & a robust algorithm Introduction
  • 11. Can be applied to various contexts • DNA sequences • Protein sequences • Motif searches • Gene identification searches • Analysis of multiple similarity regions in long DNA sequences Characteristic features • Flexibility & Tractability. • Very much faster than existing sequence comparison tools of comparable sensitivity. Introduction This research paper describes all the methods & implementations of BLAST algorithm.
  • 13. THE MSP MEASURE Discussion on Maximal Segment Pairs METHODS
  • 14. TWO TYPES OF SIMILARITY MEASURES Methods Global • Optimize the overall alignment of two sequences. • Includes large stretches of low similarity Local • Seek only relatively conserved subsequences • Single comparison may yield several distinct subsequence alignments
  • 15. • Local similarity measure are preferred for database searches where cDNAs can be compared with partially sequence genes. • Many similarity measures including the one they describe begins with a matrix of similarity score for all possible pairs of residues B L A S T Methods
  • 16. Scoring Alignments • Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein) • Identities & conservative replacements >>> Positive Scores • Unlikely Replacements >> Negative scores • Amino acid sequences : “PAM” matrix BLOSUM • DNA sequences : match = +5 mismatch = -4 • Sequence Segment – Contiguous stretch of residues of any length • Similarity score for segments is the sum of similarity values for each pair of aligned residues Methods
  • 17. Maximal Segment Pair (MSP) • Given these rules they have defined a Maximal Segment Pair (MSP) which is the highest scoring pair of identical length segments chosen from 2 sequences. • The similarity score of an MSP is called the MSP score which is calculated by BLAST. • With long sequences the search for the MSP score becomes computationally demanding. • Therefore BLAST searches for locally maximal segment pairs – Score cannot be improved by either extending or shortening segments. Methods
  • 18. RAPID APPROXIMATION OF MSP SCORES METHODS
  • 19. • Goal is to report those database sequences that have MSP score above some cutoff score S. • Statistically the highest MSP score S can be estimated at which “chance similarities” are likely to appear. • BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this score. Methods
  • 20. • Let a word pair be a segment pair with a fixed length w. • Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score at least T. • Such hit will be extended until it exceeds the cutoff score S and those hits will be the final output of BLAST. • Lower T => Fewer false negatives • Lower T => More pairs to analyze Methods
  • 21. IMPLEMENTATION Key Steps of BLAST Algorithm METHODS
  • 22. • BLAST finds locally maximal segment pairs that exceeds a particular cutoff. • Detailed annotation of Three Algorithmic steps. • Compile a list of high-scoring words. • Scanning the DB for hits. • Extending hits that meet certain s coring criteria (Extend only word pairs with a score of at least T to determine if it has a segment pair of score at least S). METHODS B L A S T
  • 23. COMPILING OF HIGH SCORING WORDS • Obtain the list of words in the target sequence (k-mers), that give a score of T or higher when aligned with the query sequence. • Assume that the query sequence is P Q G E F G • We have the following four 3-mers (recall, the number of k-mers is always N-k+1): P Q G Q G E G E F E F G METHODS Each of these 3-mers are then scored against each and every one of the k-mers in each of the target sequence. For long sequences, this could well include all 8000 possible k-mers.
  • 24. • So, of all pairwise scorings for P Q G (using the BLOSUM- 62 matrix), we can find the following high scoring ones: • P Q G (of course, this is a perfect match) score of 7+5+6 = 18 • P E G score of 7+2+6 = 15 • P Q A score of 7+5+0 = 12 METHODS
  • 25. SCANNING THE DB FOR HITS • Scan the database for hits with the compiled list of words obtained in previous step. • How efficiently search a long sequence for multiple occurrences of short sequences. • BLAST has two approaches • Indexing approach • Finite state machine METHODS
  • 26. INDEXING APPROACH – EXAMPLE 1 • Build a lookup table of size |Σ|w for all w-length words in DB. METHODS
  • 27. INDEXING APPROACH – EXAMPLE 2 • Let w=3. For amino acids, the number of words is 203. • Map a word to an integer between 1 and 203. • Thus a word has an index into an array. • Each index points to a list of matches of the word in the query sequence. • As we scan the database, each database word immediately leads to the hits in the query sequence. METHODS
  • 28. EXTENDING HITS • Once the hits are located both in the query and the target sequence, extend the hits to form high scoring segment pairs. • Find the highest scoring segment (the maximal segment pair) or those whose score exceeds (another user set) threshold S. • When manage to find a hit (a match between a “word” and a database entry), extend the hit in either direction. • Keep track of the score (use a scoring matrix). METHODS
  • 29. EXTENDING HITS – EXAMPLE 1 • Extend each seed on either side until the aggregate alignment score falls below a threshold. • Un-gapped: Extend by only either matches or mismatches. • Gapped: Extend by matches, mismatches or a limited number of insertion/deletion gaps. METHODS
  • 30. EXTENDING HITS – EXAMPLE 2 METHODS
  • 33. • Finally, evaluate the statistical significance of the alignments / scores that exceed the threshold. • BLAST statistical significance of MSP scores can be evaluated by following factors. • Performance of BLAST with random sequences • Performance of BLAST with homologous sequences • Performance comparing long DNA sequences RESULTS B L A S T
  • 34. Performance of BLAST with random sequences RESULTS
  • 35. • When two random sequences of length m and n compared, the probability of finding at least one HSP “by chance” is: • Hence, the probability of finding exactly x HSPs with a score ≥ S is given by: RESULTS E eXPXPXP   1)0(1)1(1)1( ! )( x E exXP x E  B L A S T
  • 36. • Where E can be defined by according to the Karlin-Altschul equation. • E(HSPs with score) ≥ S, also called the E-value: RESULTS
  • 37. • The probability of finding c or more distinct segment pairs, all with a score of at least S, is given by the formula: • Utilizing this formula can be detected two sequences that share distinct regions of similarity as significantly related. RESULTS      1 0 ! 1)(1)( c i i E i E ecXPcXP B L A S T
  • 38. The choice of word length & threshold parameters Necessary & Sufficient Adjustments to be Done in Terms of Word Length & Threshold Value RESULTS
  • 39. Time required to execute BLAST • To Compile List of Words. • To Scan the Database for Hits (MSP>T). • To Extend All Hits to Seek Segment Pairs with Scores Exceeding the Cutoff (HSP>S). • All these three steps depend on W & T Can we make the process more optimal? • Decrease the time spent on step 3, by increasing the W. But there are complementary problems created by larger W. • For Proteins – 20W possible word, Therefor when W increases number of words generated by query grows exponentially. (But number of words increases linearly with the length of the query) • It increases time spent on step 1 & also the amount of memory required. Results B L A S T
  • 40. Optimal T and W values • For protein sequences W=4, T=17. • For DNA sequences W=11. How W and T affect the performance of BLAST Results B L A S T T Execution Time Accuracy Speed W Execution Time Accuracy Speed Computational Complexity ?
  • 41. Performance of BLAST with homologous sequences Things to be Noted When Query Sequence BLAST Against Set of Homologous Sequences RESULTS
  • 42. What is homologous? • Related by common ancestor. Researchers Example 1 • Search for wooly monkey sequence. • When W=4 & T=17. • Found 178 MSPs with scores (50-80). • Random model suggest that BLAST should miss 24 of MSPs • But actual miss 43. • Therefor error = 44.2% Researchers Example 2 • Search for mouse sequences. • Same W & T as previous. • Found 33 MSPs with scores (45-65). • Random model suggest that BLAST should miss 8 of MSPs • But actual miss 2. RESULTS B L A S T
  • 43. • Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous. • The overall performance of BLAST depends on the distribution of MSP scores. Strengths of BLAST • Great utility is for identify high scoring MSPs quickly. • Takes lower amount of time for the alignment process. Further improvements can be done • Novel approaches like Position Specific Iterated BLAST (PSI BLAST) RESULTS B L A S T
  • 44. Comparison of two long DNA sequences Does Adjustments of W Make the Process Faster? RESULTS
  • 45. Main Classes of Locally Similar Regions • Genes. • Long interspersed Repeats. • Anticipated Weaker Similarities. Example (Human Gene VS Rabbit Gene) Step -1 • Match Score = 5, Mismatch Score = -4 & W=12. • 93 Alignments scoring over 200, 57 Alignments scoring over 350 with 1301 highest score. Step -2 • W=8 • Only additional 32 alignments are found score over 200. • Use of Smaller W does not provides new essential information always. Results B L A S T
  • 46. Results B L A S T The Time & Sensitivity of BLAST on DNA Sequences as a Function of W
  • 48. Underlying Concept of BLAST • Simple & Robust. • Can be implemented in many ways. • Can be used in variety of contexts. Researchers Implementation • Used a shared memory version of BLAST. • Why shared memory? • Loads compressed DNA file into memory once & allow subsequent steps to skip that step. • BLAST approach permits construction of extremely fast programs for database searching, Which provides additional advantage on mathematical advantage To Whom This Tool Would Help • Molecular Biologists • Doctors & etc… Conclusion B L A S T
  • 50. Q&A Q&A • What are the disadvantages of Blast with compared to other heuristic algorithms? – by Pubudu • What is the criteria of selecting threshold S and T values using in implementation part of this research paper? – by Parinda • What are the chances that the Maximal Segment Pair score for two unrelated sequences would be greater than or equal to S value? – by Parinda • Among PAM and BLOSUM matrices what is the most suitable matrix when scoring segments in amino acid sequences ? Why? – by Shashika

Notes de l'éditeur

  1. Lower amount of time for search – Therefor better when there is massive databases to be queried
  2. Lower E values better – That means very low probability alignment occurs due to luck.