SlideShare une entreprise Scribd logo
1  sur  29
Know before analyze
blast result
BLAST
a sequence comparison algorithm used to search databases for optimal local alignments
to a query.
search is done for a word of length "W" that scores at least "T" when compared to the
query using a substitution matrix.
generate an alignment with a score exceeding the threshold of "S". The "T" parameter
dictates the speed and sensitivity of the search.
query
The input sequence (or
other type of search
term) to which all of the
entries in a database are
to be compared.
Subject
The output sequences
that BLAST found
after searching the
datbase
Algorithm
A fixed procedure embodied in a computer program.
Alignment
The process or result of matching up the nucleotide or amino acid residues of two or more
biological sequences to achieve maximal levels of identity and, in the case of amino acid
sequences the degree of similarity and the possibility of homology.
Bit score
The bit score, S', is derived from the raw alignment score, S, taking the statistical properties
of the scoring system into account. Because bit scores are normalized with respect to the
scoring system, they can be used to compare alignment scores from different searches.
E value
The Expectation value or Expect value represents the number of different alignments with
scores equivalent to or better than S that is expected to occur in a database search by chance.
The lower the E value, the more significant the score and the alignment.
gap
A space introduced into an alignment to compensate for insertions and deletions in one
sequence relative to another.
H
H is the relative entropy of the target and background residue frequencies. (Karlin and
Altschul, 1990).
A measure of the average information (in bits) available per position that distinguishes an
alignment from chance.
At high values of H short alignments can be distinguished by chance, whereas at lower H
values a longer alignment may be necessary (Altschul, 1991).
similarity
The extent to which nucleotide or protein sequences are related.
Similarity between two sequences can be expressed as percent sequence identity and/or
percent positive substitutions.
identity
The extent to which two (nucleotide or amino acid) sequences have the same residues at
the same positions in an alignment, often expressed as a percentage.
K
a natural scale for search space size
used in converting a raw score (S) to a bit score (S').
lambda
a natural scale for scoring system.
used in converting a raw score (S) to a bit score (S').
p value
The probability of a chance alignment occurring with a particular score or a better score
in a database search.
The most highly significant P values will be those close to 0.
P values and E values are different ways of representing the significance of the
What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one can "expect" to
see by chance when searching a database of a particular size.
It decreases exponentially as the Score (S) of the match increases.
Essentially, the E value describes the random background noise.
For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a
database of the current size one might expect to see 1 match with a similar score simply
by chance.
The lower the E-value, or the closer it is to zero, the more "significant" the match is.
You can change the Expect value threshold on most BLAST search pages.
When the Expect value is increased from the default value of 10, a larger list with more
low-scoring hits can be reported.
HSP
A High-scoring Segment Pair (HSP) is a local alignment with no gaps that
achieves one of the highest alignment scores in a given search.
low complexity region
A region of biased composition in nucleic acid and protein sequences.
These include homopolymeric runs, short-period repeats, and subtler over
representation of one or a few residues.
The SEG program is used to mask or filter low complexity regions in amino acid
queries.
The DUST program is used to mask or filter such regions in nucleic acid queries.
masking
Also known as filtering.
The removal of repeated or low complexity regions from a sequence in order to
improve the sensitivity of sequence similarity searches performed with that
sequence
What is "low-complexity" sequence?
Regions with low-complexity sequence have an unusual composition that can create
problems in sequence similarity searching.
For amino acid queries this compositional bias is determined by the SEG program (Wootton
and Federhen, 1996).
For nucleotide queries it is determined by the DustMasker program (Morgulis, et al.,2006).
Low-complexity sequence can often be recognized by visual inspection. For example, the
protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the
nucleotide sequence AAATAAAAAAAATAAAAAAT.
Filters are used to remove low-complexity sequence because it can cause artifactual hits.
In BLAST searches performed without a filter, high scoring hits may be reported only
because of the presence of a low-complexity region.
Most often, it is inappropriate to consider this type of match as the result of shared
homology.
Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences
that are not truly related.
local alignment
The alignment of a high-scoring region of two nucleic acid or protein sequences.
optimal alignment
An alignment of two sequences with the highest possible score.
raw score
The score of an alignment,
S, calculated as the sum of
substitution and gap
scores.
Substitution scores are given by a look-up table ( PAM, BLOSUM).
Gap scores are typically calculated as the sum of G, the gap opening penalty and L, the gap
extension penalty.
For a gap of length n, the gap cost would be G+Ln.
AACGTTTCCAGTCCAAATAGCTAGGC
***--*** *-***-**-******
AACCGTTC TACAATTACCTAGGC* = Positive match
- = Mismatch
Gap= gap between the sequences
Score for each positive hit = 1, total no of positive hits=18
Score for each mismatch = 2, total number of mismatches= 5
Penalty score for each gap= 2 (you can set the penalty score of gap), total
numbers of gaps =3
score = Number of positive hits × score of each positive hit-
{(Number of mismatches × Score of each mismatches) + (Number of
gaps × score for each gap penalty)}
substitution
The presence of a non-identical amino acid at a given position in an alignment.
If the aligned residues have similar physico-chemical properties or have a positive score in
the governing scoring matrix the substitution is said to be conservative.
substitution scoring matrix
A scoring matrix containing values proportional to the probability that amino acid i
mutates into amino acid j for all pairs of amino acids.
Such matrices are constructed by assembling a large and diverse sample of verified
pairwise alignments of protein sequences.
If the sample is large enough, the resulting matrices should reflect the true probabilities of
mutations occurring through a period of evolution.
The BLOSUM matrices are examples of substitution scoring matrices.
PAM
Percent Accepted Mutation (PAM) is unit introduced by Margaret Dayhoff and colleagues to
quantify the amount of evolutionary change in a protein sequence.
1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a
protein sequence.
A PAM(x) substitution matrix is a look-up table in which scores for each amino acid
substitution have been calculated based on the frequency of that substitution in closely related
proteins that have experienced a certain amount (x) of evolutionary divergence.
unitary matrix
Also known as identity matrix. This is a scoring system in which only identical characters
receive a positive score
BLOSUM
A Blocks Substitution Matrix is a substitution scoring matrix in which scores for each position
are derived from observations of the frequencies of substitutions in blocks of local alignments
in related proteins.
Each matrix is tailored to a particular
evolutionary distance.
In the BLOSUM62 matrix, for example, the
alignment from which scores were derived was
created using sequences sharing no more than
62% identity.
Sequences more identical than 62% are
represented by a single sequence in the alignment
so as to avoid over-weighting closely related
family members. (Henikoff and Henikoff, 1992)
profile
A table that lists the frequencies of each amino acid in each position of protein sequence
alignment. Frequencies are calculated from multiple alignments of sequences containing a
domain of interest.
PSSM
A Position-Specific Scoring Matrix (PSSM) is a profile that gives the log-odds score for
finding a particular matching amino acid in a target sequence.
How blast work
Using a heuristic method, BLAST finds similar sequences, not by comparing either
sequence in its entirety, but rather by locating short matches between the two sequences
This process of finding initial words is called seeding.
While attempting to find similarity in sequences, sets of common letters, known as
words, are very important.
For example, suppose that the sequence contains the following stretch of letters, GLKFA.
If a BLASTp was being conducted under default conditions, the word size would be 3
letters.
In this case, using the given stretch of letters, the searched words would be GLK, LKF,
KFA.
An overview of the BLASTP algorithm
(a protein to protein search)
Remove low-complexity region or sequence repeats in the query sequence.
"Low-complexity region" means a region of a sequence composed of few kinds of elements.
These regions might give high scores that confuse the program to find the actual significant
sequences in the database, so they should be filtered out.
The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and
then be ignored by the BLAST program.
To filter out the low-complexity regions, the SEG program is used for protein sequences and
the program DUST is used for DNA sequences.
On the other hand, the program XNU is used to mask off the tandem repeats in protein
sequences.
Make a k-letter word list of the query sequence.
Take k=3 for example, we list the words of length 3 in the query protein sequence (k is
usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is
included. The method is illustrated in figure 1.
List the possible matching words.
BLAST only cares about the high-scoring words.
The scores are created by comparing the word illustrated previously with all the 3-letter
words.
By using the scoring matrix (substitution matrix) to score the comparison of each residue
pair, there are 20^3 possible match scores for a 3-letter word.
For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12,
respectively.
For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3.
After that, a neighborhood word score threshold T is used to reduce the number of possible
matching words.
The words whose scores are greater than the threshold T will remain in the possible
matching words list, while those with lower scores will be discarded.
For example, PEG is kept, but PQA is abandoned when T is 13.
Organize the remaining high-scoring words into an efficient search tree.
This allows the program to rapidly compare the high-scoring words to the database
sequences.
Repeat steps for each k-letter word in the query sequence.
Scan the database sequences for exact matches with the remaining high-scoring words.
The BLAST program scans the database sequences for the remaining high-scoring word,
such as PEG, of each position.
If an exact match is found, this match is used to seed a possible un-gapped alignment
between the query and database sequences.
Extend the exact matches to high-scoring segment pair (HSP).
The original version of BLAST stretches a longer alignment between the query and the
database sequence in the left and right directions, from the position where the exact match
occurred.
The extension does not stop until the accumulated total score of the HSP begins to decrease.
A simplified example is presented in figure 2.
List all of the HSPs in the database whose score is high enough to be considered.
We list the HSPs whose scores are greater than the empirically determined cutoff score S.
By examining the distribution of the alignment scores modeled by comparing random
sequences, a cutoff score S can be determined such that its value is large enough to
guarantee the significance of the remaining HSPs.
Evaluate the significance of the HSP score.
BLAST next assesses the statistical significance of each HSP score by exploiting the
Gumbel extreme value distribution (EVD).
In accordance with the Gumbel EVD, the probability p of observing a score S equal to or
greater than x is given by the equation
where
The statistical parameters λ and κ are estimated by fitting the distribution of
the un-gapped local alignment scores, of the query sequence and a lot of
shuffled versions (Global or local shuffling) of a database sequence, to the
Gumbel extreme value distribution.
Note that λ and κ depend upon the substitution matrix, gap penalties, and
sequence composition (the letter frequencies). and are the effective lengths of
the query and database sequences, respectively.
Make two or more HSP regions into a longer alignment.
Sometimes, we find two or more HSP regions in one database sequence that can be made
into a longer alignment.
This provides additional evidence of the relation between the query and database sequence.
There are two methods, the Poisson method and the sum-of-scores method, to compare the
significance of the newly combined HSP regions.
Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52,
45), respectively.
The Poisson method gives more significance to the set with the maximal lower score
(45>40).
However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than
52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU-
BLAST use the sum-of scores method.
Show the gapped Smith-Waterman local alignments of the query and each of the
matched database sequences.
The original BLAST only generates un-gapped alignments including the initially found
HSPs individually, even when there is more than one HSP found in one database sequence.
Report every match whose expect score is lower than a threshold parameter E.

Contenu connexe

Tendances

Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
Data Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptData Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptBangaluru
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentRamya S
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBIgeetikaJethra
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSMSCW Mysore
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs OsamaZafar16
 
Lectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acidsLectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acidsRishabh Jain
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databasesPranavathiyani G
 

Tendances (20)

BLAST
BLASTBLAST
BLAST
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Data Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptData Base in Bioinformatics.ppt
Data Base in Bioinformatics.ppt
 
Protein databases
Protein databasesProtein databases
Protein databases
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Protein database
Protein databaseProtein database
Protein database
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
FASTA
FASTAFASTA
FASTA
 
Lectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acidsLectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acids
 
Fasta
FastaFasta
Fasta
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 

En vedette

Enfermedad del parkinson
Enfermedad del parkinsonEnfermedad del parkinson
Enfermedad del parkinsonvivita1070
 
Presentation1 110814223135-phpapp01
Presentation1 110814223135-phpapp01Presentation1 110814223135-phpapp01
Presentation1 110814223135-phpapp01Neha Pathania
 
2009 2010 nigerian library experience office power point presentation
2009    2010 nigerian library experience office power point presentation2009    2010 nigerian library experience office power point presentation
2009 2010 nigerian library experience office power point presentationChristchurch Girls' high School
 
Logo moselectro concepts, 1-t version
Logo moselectro   concepts, 1-t versionLogo moselectro   concepts, 1-t version
Logo moselectro concepts, 1-t versionsandro2001
 
Dublin - Troop 356 JWH
Dublin - Troop 356 JWHDublin - Troop 356 JWH
Dublin - Troop 356 JWHMommasun
 
Back to School
Back to SchoolBack to School
Back to Schoolmrdavispe
 
Public Management 2012
Public Management 2012Public Management 2012
Public Management 2012Der Konijnen
 
Artisteer2 user manual
Artisteer2 user manualArtisteer2 user manual
Artisteer2 user manualagungcubung
 
jQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズ
jQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズjQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズ
jQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズYoshinori Kobayashi
 
Alfonso pinilla ej tema 4
Alfonso pinilla ej tema 4Alfonso pinilla ej tema 4
Alfonso pinilla ej tema 4Slavah
 
Every8d出版品應用
Every8d出版品應用Every8d出版品應用
Every8d出版品應用EVERY8D 許
 
De Dieu Butler App for dummies
De Dieu Butler App for dummiesDe Dieu Butler App for dummies
De Dieu Butler App for dummiesTom Hes
 
One’s own self pr
One’s own self prOne’s own self pr
One’s own self prJinwoo Jeong
 
สอนแต่งภาพ
สอนแต่งภาพสอนแต่งภาพ
สอนแต่งภาพMild Mutita
 

En vedette (20)

Blast
BlastBlast
Blast
 
BLAST
BLASTBLAST
BLAST
 
Enfermedad del parkinson
Enfermedad del parkinsonEnfermedad del parkinson
Enfermedad del parkinson
 
Presentation1 110814223135-phpapp01
Presentation1 110814223135-phpapp01Presentation1 110814223135-phpapp01
Presentation1 110814223135-phpapp01
 
Segundo periodo 7 1
Segundo periodo 7 1Segundo periodo 7 1
Segundo periodo 7 1
 
2009 2010 nigerian library experience office power point presentation
2009    2010 nigerian library experience office power point presentation2009    2010 nigerian library experience office power point presentation
2009 2010 nigerian library experience office power point presentation
 
Logo moselectro concepts, 1-t version
Logo moselectro   concepts, 1-t versionLogo moselectro   concepts, 1-t version
Logo moselectro concepts, 1-t version
 
Higher marketing
Higher marketingHigher marketing
Higher marketing
 
Dublin - Troop 356 JWH
Dublin - Troop 356 JWHDublin - Troop 356 JWH
Dublin - Troop 356 JWH
 
Back to School
Back to SchoolBack to School
Back to School
 
Public Management 2012
Public Management 2012Public Management 2012
Public Management 2012
 
Moixonets
MoixonetsMoixonets
Moixonets
 
Artisteer2 user manual
Artisteer2 user manualArtisteer2 user manual
Artisteer2 user manual
 
jQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズ
jQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズjQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズ
jQuery UI Tabs で効率よくタブ機能を実現しよう! 14.05.23 HTML5 jQueryビギナーズ
 
Alfonso pinilla ej tema 4
Alfonso pinilla ej tema 4Alfonso pinilla ej tema 4
Alfonso pinilla ej tema 4
 
Every8d出版品應用
Every8d出版品應用Every8d出版品應用
Every8d出版品應用
 
De Dieu Butler App for dummies
De Dieu Butler App for dummiesDe Dieu Butler App for dummies
De Dieu Butler App for dummies
 
Komunikasyon
KomunikasyonKomunikasyon
Komunikasyon
 
One’s own self pr
One’s own self prOne’s own self pr
One’s own self pr
 
สอนแต่งภาพ
สอนแต่งภาพสอนแต่งภาพ
สอนแต่งภาพ
 

Similaire à How the blast work

4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptxArupKhakhlari1
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fastaALLIENU
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptxericndunek
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptxPiyushBehgal1
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)Ariful Islam Sagar
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
multiple sequence alignment
multiple sequence alignmentmultiple sequence alignment
multiple sequence alignmentharshita agarwal
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Safa Khalid
 

Similaire à How the blast work (20)

Sequence database
Sequence databaseSequence database
Sequence database
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
BLAST
BLASTBLAST
BLAST
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
Blast
BlastBlast
Blast
 
Blast
BlastBlast
Blast
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptx
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
multiple sequence alignment
multiple sequence alignmentmultiple sequence alignment
multiple sequence alignment
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
 

Plus de Atai Rabby

Identification of the positively selected genes governing host-pathogen arm r...
Identification of the positively selected genes governing host-pathogen arm r...Identification of the positively selected genes governing host-pathogen arm r...
Identification of the positively selected genes governing host-pathogen arm r...Atai Rabby
 
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...Atai Rabby
 
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...Atai Rabby
 
Quantative Structure-Activity Relationships (QSAR)
Quantative Structure-Activity Relationships (QSAR)Quantative Structure-Activity Relationships (QSAR)
Quantative Structure-Activity Relationships (QSAR)Atai Rabby
 
Antiviral drug
Antiviral drugAntiviral drug
Antiviral drugAtai Rabby
 
Actin, Myosin, and Cell Movement
Actin, Myosin, and Cell MovementActin, Myosin, and Cell Movement
Actin, Myosin, and Cell MovementAtai Rabby
 
Thyroid hormone
Thyroid hormoneThyroid hormone
Thyroid hormoneAtai Rabby
 
Parathyroid hormone
Parathyroid hormoneParathyroid hormone
Parathyroid hormoneAtai Rabby
 
Gonadal hormone
Gonadal hormoneGonadal hormone
Gonadal hormoneAtai Rabby
 
Adrenal hormone
Adrenal hormoneAdrenal hormone
Adrenal hormoneAtai Rabby
 
Bioinformatics practical note
Bioinformatics practical noteBioinformatics practical note
Bioinformatics practical noteAtai Rabby
 
Rice dna extraction miniprep protocol
Rice dna extraction miniprep protocolRice dna extraction miniprep protocol
Rice dna extraction miniprep protocolAtai Rabby
 
Restriction mapping of bacterial dna
Restriction mapping of bacterial dnaRestriction mapping of bacterial dna
Restriction mapping of bacterial dnaAtai Rabby
 
Mesurement of cretinine kinase from blood of a cardiac patient
Mesurement of cretinine kinase from blood of a cardiac patientMesurement of cretinine kinase from blood of a cardiac patient
Mesurement of cretinine kinase from blood of a cardiac patientAtai Rabby
 
Fasta file extensions & meaning
Fasta file extensions & meaningFasta file extensions & meaning
Fasta file extensions & meaningAtai Rabby
 
The biochemistry of memory
The biochemistry of memoryThe biochemistry of memory
The biochemistry of memoryAtai Rabby
 
Transmission of nerve impulses
Transmission of nerve impulsesTransmission of nerve impulses
Transmission of nerve impulsesAtai Rabby
 

Plus de Atai Rabby (20)

Immunoassay
ImmunoassayImmunoassay
Immunoassay
 
Identification of the positively selected genes governing host-pathogen arm r...
Identification of the positively selected genes governing host-pathogen arm r...Identification of the positively selected genes governing host-pathogen arm r...
Identification of the positively selected genes governing host-pathogen arm r...
 
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
 
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
 
Quantative Structure-Activity Relationships (QSAR)
Quantative Structure-Activity Relationships (QSAR)Quantative Structure-Activity Relationships (QSAR)
Quantative Structure-Activity Relationships (QSAR)
 
Antiviral drug
Antiviral drugAntiviral drug
Antiviral drug
 
Real-Time PCR
Real-Time PCRReal-Time PCR
Real-Time PCR
 
Actin, Myosin, and Cell Movement
Actin, Myosin, and Cell MovementActin, Myosin, and Cell Movement
Actin, Myosin, and Cell Movement
 
Thyroid hormone
Thyroid hormoneThyroid hormone
Thyroid hormone
 
Parathyroid hormone
Parathyroid hormoneParathyroid hormone
Parathyroid hormone
 
Gonadal hormone
Gonadal hormoneGonadal hormone
Gonadal hormone
 
Gi hormone
Gi hormoneGi hormone
Gi hormone
 
Adrenal hormone
Adrenal hormoneAdrenal hormone
Adrenal hormone
 
Bioinformatics practical note
Bioinformatics practical noteBioinformatics practical note
Bioinformatics practical note
 
Rice dna extraction miniprep protocol
Rice dna extraction miniprep protocolRice dna extraction miniprep protocol
Rice dna extraction miniprep protocol
 
Restriction mapping of bacterial dna
Restriction mapping of bacterial dnaRestriction mapping of bacterial dna
Restriction mapping of bacterial dna
 
Mesurement of cretinine kinase from blood of a cardiac patient
Mesurement of cretinine kinase from blood of a cardiac patientMesurement of cretinine kinase from blood of a cardiac patient
Mesurement of cretinine kinase from blood of a cardiac patient
 
Fasta file extensions & meaning
Fasta file extensions & meaningFasta file extensions & meaning
Fasta file extensions & meaning
 
The biochemistry of memory
The biochemistry of memoryThe biochemistry of memory
The biochemistry of memory
 
Transmission of nerve impulses
Transmission of nerve impulsesTransmission of nerve impulses
Transmission of nerve impulses
 

Dernier

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 

Dernier (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 

How the blast work

  • 2. BLAST a sequence comparison algorithm used to search databases for optimal local alignments to a query. search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search. query The input sequence (or other type of search term) to which all of the entries in a database are to be compared. Subject The output sequences that BLAST found after searching the datbase
  • 3. Algorithm A fixed procedure embodied in a computer program. Alignment The process or result of matching up the nucleotide or amino acid residues of two or more biological sequences to achieve maximal levels of identity and, in the case of amino acid sequences the degree of similarity and the possibility of homology. Bit score The bit score, S', is derived from the raw alignment score, S, taking the statistical properties of the scoring system into account. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. E value The Expectation value or Expect value represents the number of different alignments with scores equivalent to or better than S that is expected to occur in a database search by chance. The lower the E value, the more significant the score and the alignment.
  • 4. gap A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. H H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). A measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H short alignments can be distinguished by chance, whereas at lower H values a longer alignment may be necessary (Altschul, 1991). similarity The extent to which nucleotide or protein sequences are related. Similarity between two sequences can be expressed as percent sequence identity and/or percent positive substitutions.
  • 5.
  • 6. identity The extent to which two (nucleotide or amino acid) sequences have the same residues at the same positions in an alignment, often expressed as a percentage. K a natural scale for search space size used in converting a raw score (S) to a bit score (S'). lambda a natural scale for scoring system. used in converting a raw score (S) to a bit score (S'). p value The probability of a chance alignment occurring with a particular score or a better score in a database search. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the
  • 7. What is the Expect (E) value? The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. The lower the E-value, or the closer it is to zero, the more "significant" the match is. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.
  • 8.
  • 9. HSP A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. low complexity region A region of biased composition in nucleic acid and protein sequences. These include homopolymeric runs, short-period repeats, and subtler over representation of one or a few residues. The SEG program is used to mask or filter low complexity regions in amino acid queries. The DUST program is used to mask or filter such regions in nucleic acid queries. masking Also known as filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence
  • 10. What is "low-complexity" sequence? Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996). For nucleotide queries it is determined by the DustMasker program (Morgulis, et al.,2006). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits. In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region. Most often, it is inappropriate to consider this type of match as the result of shared homology. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
  • 11. local alignment The alignment of a high-scoring region of two nucleic acid or protein sequences. optimal alignment An alignment of two sequences with the highest possible score. raw score The score of an alignment, S, calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table ( PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the gap opening penalty and L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln.
  • 12. AACGTTTCCAGTCCAAATAGCTAGGC ***--*** *-***-**-****** AACCGTTC TACAATTACCTAGGC* = Positive match - = Mismatch Gap= gap between the sequences Score for each positive hit = 1, total no of positive hits=18 Score for each mismatch = 2, total number of mismatches= 5 Penalty score for each gap= 2 (you can set the penalty score of gap), total numbers of gaps =3 score = Number of positive hits × score of each positive hit- {(Number of mismatches × Score of each mismatches) + (Number of gaps × score for each gap penalty)}
  • 13. substitution The presence of a non-identical amino acid at a given position in an alignment. If the aligned residues have similar physico-chemical properties or have a positive score in the governing scoring matrix the substitution is said to be conservative. substitution scoring matrix A scoring matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of protein sequences. If the sample is large enough, the resulting matrices should reflect the true probabilities of mutations occurring through a period of evolution. The BLOSUM matrices are examples of substitution scoring matrices.
  • 14. PAM Percent Accepted Mutation (PAM) is unit introduced by Margaret Dayhoff and colleagues to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. unitary matrix Also known as identity matrix. This is a scoring system in which only identical characters receive a positive score
  • 15. BLOSUM A Blocks Substitution Matrix is a substitution scoring matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff, 1992)
  • 16. profile A table that lists the frequencies of each amino acid in each position of protein sequence alignment. Frequencies are calculated from multiple alignments of sequences containing a domain of interest. PSSM A Position-Specific Scoring Matrix (PSSM) is a profile that gives the log-odds score for finding a particular matching amino acid in a target sequence.
  • 17.
  • 19. Using a heuristic method, BLAST finds similar sequences, not by comparing either sequence in its entirety, but rather by locating short matches between the two sequences This process of finding initial words is called seeding. While attempting to find similarity in sequences, sets of common letters, known as words, are very important. For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLASTp was being conducted under default conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, KFA.
  • 20. An overview of the BLASTP algorithm (a protein to protein search)
  • 21. Remove low-complexity region or sequence repeats in the query sequence. "Low-complexity region" means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.
  • 22. Make a k-letter word list of the query sequence. Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is included. The method is illustrated in figure 1.
  • 23. List the possible matching words. BLAST only cares about the high-scoring words. The scores are created by comparing the word illustrated previously with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.
  • 24. Organize the remaining high-scoring words into an efficient search tree. This allows the program to rapidly compare the high-scoring words to the database sequences. Repeat steps for each k-letter word in the query sequence. Scan the database sequences for exact matches with the remaining high-scoring words. The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.
  • 25. Extend the exact matches to high-scoring segment pair (HSP). The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2.
  • 26. List all of the HSPs in the database whose score is high enough to be considered. We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs. Evaluate the significance of the HSP score. BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD). In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation where
  • 27. The statistical parameters λ and κ are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution. Note that λ and κ depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies). and are the effective lengths of the query and database sequences, respectively.
  • 28. Make two or more HSP regions into a longer alignment. Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions. Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively. The Poisson method gives more significance to the set with the maximal lower score (45>40). However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU- BLAST use the sum-of scores method.
  • 29. Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences. The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence. Report every match whose expect score is lower than a threshold parameter E.