SlideShare une entreprise Scribd logo
1  sur  124
FBW
23-10-2012

Wim Van Criekinge
BPC 2013

• 10 anniversary edition of the
bioinformatics programming
challenge
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Needleman-Wunsch-edu.pl

The Score Matrix
---------------Seq1(j)1
2
3
4 5
6
7
8
9
Seq2
*
C
K
H
V
F
C
R
(i) *
0
-1
-2
-3
-4
-5
-6
-7
1
C
-1
1 a 0
-1
-2
-3
-4
-5
2
K
-2
0c
2b
1
0
-1
-2
-3
3
K
-3
-1
1
1
0
-1
-2
-3
A:
4
C
-4
-2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
0
0
0
-1
0
-1
if
5
F
-5
-3
-1(substr(seq1,j-1,1) eq substr(seq2,i-1,1)
-1
-1
1
0
-1
6
C
-6
-4 up_score = matrix(i-1,j) + GAP 2
-2
-2
-2
0
1
B:
7
K
-7
-5
-3
-3
-3
-1
1
1
8
C
-8
-6 left_score =-4
-4
-4
0
C:
matrix(i,j-1) +-2
GAP 0
9
V
-9
-7
-5
-5
-3
-3
-1
-1
Multiple Alignment Method

• The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of
pairwise alignment methods.
• The principal is that multiple alignments
is achieved by successive application
of pairwise methods.
– First do all pairwise alignments (not just one
sequence with all others)
– Then combine pairwise alignments to generate
overall alignment
Database Searching

• Consider the task of searching
SWISS-PROT against a query
sequence:
– say our query sequence is 362
amino- acids long
– SWISS-PROT release 38
contains 29,085,265 amino acids
– finding local alignments via
dynamic programming would
entail O(1010)matrix operations

• Given size of databases, more
efficient methods needed
Heuristic approaches to DP for database searching
FASTA (Pearson 1995)

BLAST (Altschul 1990, 1997)

Uses heuristics to avoid
calculating the full dynamic
programming matrix

Uses rapid word lookup
methods to completely skip
most of the database
entries

Speed up searches by an
order of magnitude
compared to full SmithWaterman
The statistical side of FASTA is
still stronger than BLAST

Extremely fast
One order of magnitude
faster than FASTA
Two orders of magnitude
faster than SmithWaterman

Almost as sensitive as FASTA
FASTA

« Hit and extend heuristic»
• Problem: Too many calculations
“wasted” by comparing regions
that have nothing in common
• Initial insight: Regions that are
similar between two sequences
are likely to share short
stretches that are identical
• Basic method: Look for similar
regions only near short
stretches that match exactly
FASTA-Stages

1.
2.
3.

4.

5.

Find k-tups in the two sequences (k=1,2 for
proteins, 4-6 for DNA sequences)
Score and select top 10 scoring “local diagonals”
Rescan top 10 regions, score with PAM250
(proteins) or DNA scoring matrix. Trim off the
ends of the regions to achieve highest scores.
Try to join regions with gapped alignments. Join
if similarity score is one standard deviation above
average expected score
After finding the best initial region, FASTA
performs a global alignment of a 32 residue wide
region centered on the best initial region, and
uses the score as the optimized score.
FastA

• Sensitivity: the ability of a
program to identify weak but
biologically significant sequence
similarity.
• Selectivity: the ability of a
program to discriminate between
true matches and matches
occurring by chance alone.
– A decrease in selectivity results in
more false positives being reported.
FastA (http://www.ebi.ac.uk/fasta33/)

Gap opening penalty
-12, -16 by default
for fasta with
proteins and DNA,
respectively
Gap extension
penalty -2, -4 by
default for fasta
with proteins and
DNA, respectively

Max number of
scores and
alignments is 100

Blosum50
default.
Lower PAM
higher blosum
to detect close
sequences
Higher PAM and
lower blosum
to detect distant
sequences

The larger the
word-length the
less sensitive, but
faster the search
will be
FastA Output
Initn, init1, opt, zscore calculated
during run

E score expectation
value, how
many hits are
expected to be
found by
chance with
such a score
while
comparing
this query to
this database.

Database
code
hyperlinked
to the SRS
database at
EBI

Accession
number

Description

Length

E() does not
represent the
% similarity
FastA is a family of programs

FastA, TFastA, FastX, FastY
Query:

DNAProtein

Database:DNA

Protein
FASTA problems

FASTA can miss significant similarity
since
– For proteins, similar sequences do
not have to share identical residues
• Asp-Lys-Val is quite similar to
• Glu-Arg-Ile yet it is missed even with
ktuple size of 1 since no amino acid
matches

• Gly-Asp-Gly-Lys-Gly is quite similar
to Gly-Glu-Gly-Arg-Gly but there is
no match with ktuple size of 2
FASTA problems

FASTA can miss significant
similarity since
– For nucleic acids, due to codon
“wobble”, DNA sequences may
look like XXyXXyXXy where X’s
are conserved and y’s are not
• GGuUCuACgAAg and
GGcUCcACaAAA both code for
the same peptide sequence (Gly-SerThr-Lys) but they don’t match with
ktuple size of 3 or higher
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast Local Blast
Blast
BLAST - Basic Local Alignment
Search Tool
What does BLAST do?

• Search a large target set of sequences...

• …for hits to a query sequence...
• …and return the alignments and scores from those
hits...
• Do it fast.
Show me those sequences that deserve a second look.
Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity to
distant related sequences.
The big red button

Do My Job

It is dangerous to hide too much of the
underlying complexity from the scientists.
Overview

• Approach: find segment pairs
by first finding word pairs that
score above a threshold, i.e.,
find word pairs of fixed length
wwith a score of at least T
• Key concept “Neigborhood”:
Seems similar to FASTA, but
we are searching for words
which score above T rather than
that match exactly
• Calculate neigborhood (T) for
substrings of query (size W)
Overview

Compile a list of words which give a score
above T when paired with the query sequence.
– Example using PAM-120 for query sequence ACDE
(w=4, T=17):
A

C

D

E
A C

D

E = +3 +9 +5 +5 = 22

• try all possibilities:
A
A

A
A

A
A

A = +3 -3
C = +3 -3

• ...too slow, try directed change

0 0 = 0
0 -7 = -7

no good
no good
Overview

A
A
g
n
I
k

C D E
C D E = +3 +9 +5 +5 = 22
• change 1st pos. to all acceptable substitutions
C D E = +1 +9 +5 +5 = 20ok
C D E = +0 +9 +5 +5 = 19 ok
C D E = -1 +9 +5 +5 = 18 ok
C D E = -2 +9 +5 +5 = 17 ok

• change 2nd pos.: can't - all alternatives negative
and the other three positions only add up to 13
• change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
• continue - use recursion

• For "best" values of w and T there are typically
about 50 words in the list for every residue in the
query sequence
Neighborhood.pl
# Calculate neighborhood
my %NH;
for (my $i = 0; $i < @A; $i++) {
my $s1 = $S{$W[0]}{$A[$i]};
for (my $j = 0; $j < @A; $j++) {
my $s2 = $S{$W[1]}{$A[$j]};
for (my $k = 0; $k < @A; $k++) {
my $s3 = $S{$W[2]}{$A[$k]};
my $score = $s1 + $s2 + $s3;
my $word = "$A[$i]$A[$j]$A[$k]";
next if $word =~ /[BZX*]/;
$NH{$word} = $score if $score >= $T;
}
}
}
# Output neighborhood
foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) {
print "$word $NH{$word}n";
}
BLOSUM62 RGD 11

PAM200 RGD 13

RGD 17
KGD 14
QGD 13
RGE 13
EGD 12
HGD 12
NGD 12
RGN 12
AGD 11
MGD 11
RAD 11
RGQ 11
RGS 11
RND 11
RSD 11
SGD 11
TGD 11

RGD 18
RGE 17
RGN 16
KGD 15
RGQ 15
KGE 14
HGD 13
KGN 13
RAD 13
RGA 13
RGG 13
RGH 13
RGK 13
RGS 13
RGT 13
RSD 13
WGD 13
indexed

*
Score

Trim to max

S
Length of extension

*Two non-overlapping HSP’s on a diagonal within distance A
indexed

*
Score

Trim to max

S
Length of extension

*Two non-overlapping HSP’s on a diagonal within distance A
The BLAST algorithm

• Break the search sequence into words
– W = 3 for proteins, W = 12 for DNA
MCGPFILGTYC
CGP

MCG, CGP, GPF, PFI, FIL,
ILG, LGT, GTY, TYC

MCG

• Include in the search all words that score
above a certain value (T) for any search word
MCGCGP
MCT
MGP
MCN
CTP
…
…

…

This list can be
computed in linear
time
The Blast Algorithm (2)

• Search for the words in the database
– Word locations can be precomputed and indexed
– Searching for a short string in a long string

• HSP (High Scoring Pair) = A match between
a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a
diagonal within distance A
• Extend the hit until the score falls below a
threshold value, S
BLAST parameters

• Lowering the neighborhood word threshold (T)
allows more distantly related sequences to be found,
at the expense of increased noise in the results set.
• Choosing a value for w
– small w: many matches to expand
– big w: many words to be generated
– w=4 is a good compromise

• Lowering the segment extension cutoff (S) returns
longer extensions for each hit.
• Changing the minimum E-value changes the
threshold for reporting a hit.
Critical parameters: T,W and scoring matrix

• The proper value of T depends ons both the
values in the scoring matrix and balance
between speed and sensitivity
• Higher values of T progressively remove
more word hits and reduce the search space.
• Word size (W) of 1 will produce more hits
than a word size of 10. In general, if T is
scaled uniformly with W, smaller word
sizes incraese sensitivity and decrease
speed.
• The interplay between W,T and the scoring
matrix is criticial and choosing them wisely
is the most effective way of controlling the
speed and sensiviy of blast
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Database Searching

• How can we find a particular short sequence
in a database of sequences (or one HUGE
sequence)?
• Problem is identical to local sequence
alignment, but on a much larger scale.
• We must also have some idea of the
significance of a database hit.
– Databases always return some kind of hit, how
much attention should be paid to the result?

• How can we determine how “unusual” a
particular alignment score is?
Significance
Sentence 1:
“These algorithms are trying to find the best way to match up
two sequences”
Sentence 2:
“This does not mean that they will find anything profound”
ALIGNMENT:

THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...:
:
::::..
:: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND-----12 exact matches
14 conservative substitutions

Is this a good alignment?
Overview

• A key to the utility of BLAST is
the ability to calculate expected
probabilities of occurrence of
Maximum Segment Pairs
(MSPs) given w and T
• This allows BLAST to rank
matching sequences in order of
“significance” and to cut off
listings at a user-specified
probability
Mathematical Basis of BLAST

• Model matches as a sequence of coin tosses
• Let p be the probability of a “head”
– For a “fair” coin, p = 0.5

• (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R = log1/p (n).
• Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32

• Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
Mathematical Basis of BLAST

• To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.

AATCAT

HTHHHT

ATTCAG

• For DNA, the probability of a “head” is 1/4
– What is it for amino acid sequences?
Mathematical Basis of BLAST

• So, for one particular alignment, the Erdös-Rényi
property can be applied
• What about for all possible alignments?
– Consider that sequences are being shifted back and
forth, dot matrix plot

• The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
Analytical derivation

Erdös-Rényi
…
…
…
Karlin-Alschul
Karlin-Alschul Statistics

E=kmn-λS
This equation states that the number of alignments
expected by chance (E) during the sequence
database search is a function of the size of the
search space (m*n), the normalized score (λS)
and a minor constant (k mostly 0.1)

E-Value grows linearly with the product of target and
query sizes. Doubling target set size and doubling
query length have the same effect on e-value
Analytical derivation

Erdös-Rényi
…
…
…
Karlin-Alschul

R=log1/p(mn)

E=kmn-λS
Scoring alignments

• Score: S (~R)
– S= M(qi,ti) - gaps
• Any alignment has a score
• Any two sequences have a(t least one)
optimal alignment
• For a particular scoring matrix and its
associated gap initiation and extention costs
one must calculate λ and k
• Unfortunately (for gapped alignments), you
can’t do this analytically and the values must
be estimated empirically
– The procedure involves aligning random
sequences (Monte Carlo approach) with a specific
scoring scheme and observing the alignment
properties (scores, target frequencies and
lengths)
Significance

“Monte Carlo” Approach:
• Compares result to randomized
result, similarly to results generated by a
roulette wheel at Monte Carlo
• Typical procedure for alignments
– Randomize sequence A
– Align to sequence B
– Repeat many times (hundreds)
– Keep track op optimal score

• Histogram of scores …
Assessing significance requires a distribution

Frequency

• I have an pumpkin of diameter 1m. Is that unusual?

Diameter (m)
Significance
Normal Distribution does NOT Fit Alignment Scores !!

• In seeking optimal Alignments between two
sequences, one desires those that have the highest
score - i.e. one is seeking a distribution of maxima
• In seeking optimal Matches between an Input
Sequence and Sequence Entries in a Database, one
again desires the matches that have the highest
score, and these are obtained via examination of the
distribution of such scores for the entries in the
database - this is again a distribution of maxima.
“A Normal Distribution is a distribution of Sums of
independent variables rather than a sum of their
Maxima.“
Comparing distributions

Gaussian:

f x

1
e
2

Extreme Value:

x

2

2

2

f x

1

x

e

x

e

e
Alignment scores follow extreme value distributions
Alignment of unrelated/random sequences result in scores
following an extreme value distribution
x

P = 1 –e-E

E

P(x S) = 1-exp(-k m n e- S)
m, n: sequence lengths.

k,

free parameters.

E=-ln(1-P)

This can be shown analytically for ungapped alignments and has
been found empirically to also hold for gapped alignments under
commonly used conditions.
Alignment scores follow extreme value distributions
Alignment algorithms will always produce
alignments, regardless of whether it is meaningful or not
=> important to have way of selecting significant alignments
from large set of database hits.
Solution: fit distribution of scores from database search to
extreme value distribution; determine p-value of hit from this
fitted distribution.
Example: scores fitted to
extreme value distribution.
99.9% of this distribution is
located below score=112

=> hit with score = 112 has a
p-value of 0.1%
Significance
BLAST uses precomputed extreme
value distributions to calculate Evalues from alignment scores
For this reason BLAST only allows
certain combinations of substitution
matrices and gap penalties
This also means that the fit is based on
a different data set than the one you
are working on
A word of caution: BLAST tends to overestimate the significance of its
matches
E-values from BLAST are fine for identifying sure hits
One should be careful using BLAST’s E-values to judge if a marginal hit
can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
Determining P-values

• If we can estimate and , then we can
determine, for a given match score x, the
probability that a random match with score x
or greater would have occurred in the
database.
• For sequence matches, a scoring system and
database can be parameterized by two
parameters, k and , related to and .
– It would be nice if we could compare hit
significance without regard to the scoring system
used!
Bit Scores

• The expected number of hits with score
is:
E = Kmne s

S

– Where m and n are the sequence lengths

• Normalize the raw score using:
S

S

ln K
ln 2

• Obtains a “bit score” S’, with a standard set of
units.
S
• The new E-value is: E mn 2
-74
-73
-72
-71
-70
-69
-68
-67
-66
-65
-64
-63
-61
-60
-59
-58
-57
-56
-55
-54
-53
-52
-51
-50
-49

*
*****
*******
**********
***************
*************************
*************************
************************************
*****************************************
************************************************************
************************
*****************************
*******************
**************
*********
********
*****
****
*
*
*

Needleman-wunsch-Monte-Carlo.pl

(Average around -64 !)
FastA Output

• The distribution of scores graph of
frequency of observed scores
• expected curve (asterisks) according
to the extreme value distribution
–the theoretic curve should be
similar to the observed results
• deviations indicate that the fitting
parameters are wrong
–too weak gap penalties
–compositional biases
FastA Output
< 20 222
0 :*
22 30
0 :*
24 18
1 :*
26 18 15 :*
28 46 159 :*
30 207 963 :*
32 1016 3724 := *
34 4596 10099 :==== *
36 9835 20741 :=========
*
38 23408 34278 :====================
*
40 41534 47814 :=================================== *
42 53471 58447 :============================================ *
44 73080 64473 :====================================================*=======
46 70283 65667 :=====================================================*====
48 64918 62869 :===================================================*==
50 65930 57368 :===============================================*=======
52 47425 50436 :======================================= *
54 36788 43081 :=============================== *
56 33156 35986 :============================ *
58 26422 29544 :====================== *
60 21578 23932 :================== *
62 19321 19187 :===============*
64 15988 15259 :============*=
66 14293 12060 :=========*==
68 11679 9486 :=======*==
70 10135 7434 :======*==
FastA Output

72 8957 5809 :====*===
74 7728 4529 :===*===
76 6176 3525 :==*===
78 5363 2740 :==*==
80 4434 2128 :=*==
82 3823 1628 :=*==
84 3231 1289 :=*=
86 2474 998 :*==
88 2197 772 :*=
90 1716 597 :*=
92 1430 462 :*=
:===============*========================
94 1250 358 :*=
:============*===========================
96 954 277 :*
:=========*=======================
98 756 214 :*
:=======*===================
100 678 166 :*
:=====*==================
102 580 128 :*
:====*===============
104 476 99 :*
:===*=============
106 367 77 :*
:==*==========
108 309 59 :*
:==*========
110 287 46 :*
:=*========
112 206 36 :*
:=*======
114 161 28 :*
:*=====
116 144 21 :*
:*====
118 127 16 :*
:*====
>120 886 13 :*
:*==============================

Related
FastA Output

• A summary of the statistics and of the
program parameters follows the histogram.
– An important number in this summary is the
Kolmogorov-Smirnov statistic, which indicates
how well the actual data fit the theoretical
statistical distribution. The lower this value, the
better the fit, and the more reliable the statistical
estimates.
– In general, a Kolmogorov-Smirnov statistic under
0.1 indicates a good fit with the theoretical model.
If the statistic is higher than 0.2, the statistics may
not be valid, and it is recommended to repeat the
search, using more stringent (more negative)
values for the gap penalty parameters.
Statistics summary

• Optimal local alignment scores for pairs of random
amino acid sequences of the same length follow and
extreme-value distribution. For any score S, the
probability of observing a score >= S is given by the
Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(lambda.S))
• k en Lambda are parameters related to the position
of the maximum and the with of the distribution,
• Note the long tail at the right. This means that a
score serveral standard deviations above the mean
has higher probability of arising by chance (that is, it
is less significant) than if the scores followed a
normal distribution.
P-values

• Many programs report P = the probability that the
alignment is no better than random. The relationship
between Z and P depends on the distribution of the
scores from the control population, which do NOT
follow the normal distributions
– P<=10E-100 (exact match)
– P in range 10E-100 10E-50 (sequences nearly identical eg.
Alleles or SNPs
– P in range 10E-50 10E-10 (closely related
sequenes, homology certain)
– P in range 10-5 10E-1 (usually distant relatives)
– P > 10-1 (match probably insignificant)
E

• For database searches, most programs report E-values. The
E-value of an alignemt is the expected number of sequences
that give the same Z-score or better if the database is probed
with a random sequence. E is found by multiplying the value
of P by the size of the database probed. Note that E but not P
depends on the size of the database. Values of P are
between 0 and 1. Values of E are between 0 and the number
of sequences in the database searched:
– E<=0.02
sequences probably homologous
– E between 0.02 and 1
homology cannot be ruled out
– E>1
you would have to expect this good a match by just chance
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast Local Blast
Blast
Blast

BLAST is actually a family of programs:
• BLASTN - Nucleotide query searching a
nucleotide database.
• BLASTP - Protein query searching a
protein database.
• BLASTX - Translated nucleotide query
sequence (6 frames) searching a protein
database.
• TBLASTN - Protein query searching a
translated nucleotide (6 frames) database.
• TBLASTX - Translated nucleotide query (6
frames) searching a translated nucleotide
(6 frames) database.
Blast
Blast
Blast
Blast
Blast
Blast
Blast
Tips

• Be aware of what options you
have selected when using
BLAST, or FASTA
implementations.
• Treat BLAST searches as
scientific experiments
• So you should try your searches
with the filters on and off to see
whether it makes any difference
to the output
Tips: Low-complexity and Gapped Blast Algorithm

• The common, Web-based ones often have
default settings that will affect the outcome
of your searches. By default all NCBI BLAST
implementations filter out biased sequence
composition from your query sequence (e.g.
signal peptide and transmembrane
sequences - beware!).

• The SEG program has been implemented
as part of the blast routine in order to mask
low-complexity regions
• Low-complexity regions are denoted by
strings of Xs in the query sequence
Tips

• The sequence databases contain a
wealth of information. They also
contain a lot of errors. Contaminants
…
• Annotation errors, frameshifts that
may result in erroneous conceptual
translations.
• Hypothetical proteins ?

• In the words of Fox Mulder, "Trust
no one."
Tips

• Once you get a match to things
in the databases, check whether
the match is to the entire
protein, or to a domain. Don't
immediately assume that a
match means that your protein
carries out the same function
(see above). Compare your
protein and the match protein(s)
along their entire lengths before
making this assumption.
Tips

• Domain matches can also cause problems
by hiding other informative matches. For
instance if your protein contains a common
domain you'll get significant matches to
every homologous sequence in the
database. BLAST only reports back a
limited number of matches, ordered by P
value.
• If this list consists only of matches to the
same domain, cut this bit out of your query
sequence and do the BLAST search again
with the edited sequence (e.g. NHR).
Tips

• Do controls wherever possible. In
particular when you use a particular
search software for the first time.
• Suitable positive controls would be protein
sequences known to have distant
homologues in the databases to check
how good the software is at detecting such
matches.
• Negative controls can be employed to
make sure the compositional bias of the
sequence isn't giving you false positives.
Shuffle your query sequence and see what
difference this makes to the matches that
are returned. A real match should be lost
upon shuffling of your sequence.
Tips

• Perform Controls
#!/usr/bin/perl -w
use strict;
my ($def, @seq) = <>;
print $def;
chomp @seq;
@seq = split(//, join("", @seq));
my $count = 0;
while (@seq) {
my $index = rand(@seq);
my $base = splice(@seq, $index, 1);
print $base;
print "n" if ++$count % 60 == 0;
}
print "n" unless $count %60 == 0;
Tips

• Read the footer first
• View results graphically
• Parse Blasts with Bioperl
FastA vs. Blast

• BLAST's major advantage is its speed.
– 2-3 minutes for BLAST versus several hours
for a sensitive FastA search of the whole of
GenBank.

• When both programs use their default
setting, BLAST is usually more sensitive
than FastA for detecting protein sequence
similarity.
– Since it doesn't require a perfect sequence
match in the first stage of the search.
FastA vs. Blast
Weakness of BLAST:
– The long word size it uses in the initial stage of DNA
sequence similarity searches was chosen for speed, and not
sensitivity.
– For a thorough DNA similarity search, FastA is the
program of choice, especially when run with a lowered
KTup value.
– FastA is also better suited to the specialised task of
detecting genomic DNA regions using a cDNA query
sequence, because it allows the use of a gap extension
penalty of 0. BLAST, which only creates ungapped
alignments, will usually detect only the longest exon, or fail
altogether.

• In general, a BLAST search using the default
parameters should be the first step in a database
similarity search strategy. In many cases, this is all
that may be required to yield all the information
needed, in a very short time.
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast Local Blast
BLAT
PSI-Blast

1. Old (ungapped) BLAST

2. New BLAST (allows gaps)
3. Profile -> PSI Blast - Position Specific
Iterated
Strategy:Multiple alignment of the hits
Calculates a position-specific score matrix
Searches with this matrix
In many cases is much more sensitive to weak but
biologically relevant sequence similarities
PSSM !!!
PSI-Blast

• Patterns of conservation from the alignment of
related sequences can aid the recognition of
distant similarities.
– These patterns have been variously called motifs,
profiles, position-specific score matrices, and
Hidden Markov Models.
For each position in the derived pattern, every
amino acid is assigned a score.
(1) Highly conserved residue at a position: that
residue is assigned a high positive score, and
others are assigned high negative scores.
(2) Weakly conserved positions: all residues receive
scores near zero.
(3) Position-specific scores can also be assigned to
potential insertions and deletions.
Pattern

• a set of alternative
sequences, using
“regular expressions”
• Prosite
(http://www.expasy.org/
prosite/)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
PSI-Blast

• The power of profile methods can be
further enhanced through iteration of
the search procedure.
– After a profile is run against a database,
new similar sequences can be detected. A
new multiple alignment, which includes
these sequences, can be constructed, a
new profile abstracted, and a new
database search performed.
– The procedure can be iterated as often as
desired or until convergence, when no new
statistically significant sequences are
detected.
PSI-Blast
(1) PSI-BLAST takes as an input a single protein sequence
and compares it to a protein database, using the gapped
BLAST program.
(2) The program constructs a multiple alignment, and then a
profile, from any significant local alignments found.
The original query sequence serves as a template for the multiple
alignment andprofile, whose lengths are identical to that of the
query. Different numbers of sequences can be aligned in different
template positions.

(3) The profile is compared to the protein database, again
seeking local alignments using the BLAST algorithm.
(4) PSI-BLAST estimates the statistical significance of the local
alignments found.
Because profile substitution scores are constructed to a fixed
scale, and gap scores remain independent of position, the
statistical theory andparameters for gapped BLAST alignments
remain applicable to profile alignments.

(5) Finally, PSI-BLAST iterates, by returning to step (2), a
specified number of times or until convergence.
PSI-BLAST

PSSM

PSSM
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST pitfalls

• Avoid too close sequences: overfit!
• Can include false homologous! Therefore check
the matches carefully: include or exclude
sequences based on biological knowledge.
• The E-value reflects the significance of the
match to the previous training set not to the
original sequence!
• Choose carefully your query sequence.
• Try reverse experiment to certify.
Reduce overfitting risk by Cobbler

• A single sequence is selected
from a set of blocks and enriched
by replacing the conserved
regions delineated by the blocks
by consensus residues derived
from the blocks.
• Embedding consensus residues
improves performance

• S. Henikoff and J.G. Henikoff;
Protein Science (1997) 6:698705.
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
PHI-Blast Local Blast
(Pattern-Hit Initiated BLAST)
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Installing Blast Locally

• 2 flavors: NCBI/WuBlast
• Excutables:
– ftp://ftp.ncbi.nih.gov/blast/executables/

• Database:
– ftp://ftp.ncbi.nih.gov/blast/db/

• Formatdb
– formatdb -i ecoli.nt -p F
– formatdb -i ecoli.protein -p T

• For options: blastall – blastall -p blastp -i query -d database -o output
DataBase Searching

Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Main database: BLAT

• BLAT: BLAST-Like Alignment Tool
• Aligns the input sequence to the
Human Genome
• Connected to several databases, like:
–
–
–
–

mRNAs
ESTs
RepeatMasker
RefSeq

- GenScan
- TwinScan
- UniGene
- CpG Islands
BLAT Human Genome Browser
BLAT method

• Align sequence with BLAT, get alignment
info
• Per BLAT hit, pick up additional info from
connected databases:
–
–
–
–
–

mRNAs
ESTs
RepeatMasker
CpG Islands
RefSeq Genes
Weblems

W5.1: Submit the amino acid sequence of papaya
papein to a BLAST (gapped and ungapped) and to a
PSI-BLAST search. What are the main difference in
results?
W5.2: Is there a relationship between Klebsiella
aerogenes urease, Pseudomonas diminuta
phosphotriesterase and mouse adenosine deaminase
? Also use DALI, ClustalW and T-coffee.
W5.3: Yeast two-hybrid typically yields DNA
sequences. How would you find the corresponding
protein ?
W5.4: When and why would you use tblastn ?
W5.5: How would you search a database if you want to
restrict the search space to those entries having a
secretion signal consisting of 4 consecutive (Nterminal) basic residues ?

Contenu connexe

Tendances

MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformaticsVinaKhan1
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
Alignment of pairs of sequence (Types of Similarity Sequences)
Alignment of pairs of sequence (Types of Similarity Sequences)Alignment of pairs of sequence (Types of Similarity Sequences)
Alignment of pairs of sequence (Types of Similarity Sequences)Rahul M. Prathap
 
Sequence similarity tools.pptx
Sequence similarity tools.pptxSequence similarity tools.pptx
Sequence similarity tools.pptxPagudalaSangeetha
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentGeethanjaliAnilkumar2
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
Phylogenetic trees
Phylogenetic treesPhylogenetic trees
Phylogenetic treesmartyynyyte
 
Algorithm research project neighbor joining
Algorithm research project neighbor joiningAlgorithm research project neighbor joining
Algorithm research project neighbor joiningJay Mehta
 
Sequence alignment 1
Sequence alignment 1Sequence alignment 1
Sequence alignment 1SumatiHajela
 

Tendances (20)

MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Alignment of pairs of sequence (Types of Similarity Sequences)
Alignment of pairs of sequence (Types of Similarity Sequences)Alignment of pairs of sequence (Types of Similarity Sequences)
Alignment of pairs of sequence (Types of Similarity Sequences)
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Sequence similarity tools.pptx
Sequence similarity tools.pptxSequence similarity tools.pptx
Sequence similarity tools.pptx
 
Fasta
FastaFasta
Fasta
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Sequence database
Sequence databaseSequence database
Sequence database
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Phylogenetic trees
Phylogenetic treesPhylogenetic trees
Phylogenetic trees
 
Algorithm research project neighbor joining
Algorithm research project neighbor joiningAlgorithm research project neighbor joining
Algorithm research project neighbor joining
 
Sequence alignment 1
Sequence alignment 1Sequence alignment 1
Sequence alignment 1
 

En vedette

Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Prof. Wim Van Criekinge
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekingeProf. Wim Van Criekinge
 
2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekingeProf. Wim Van Criekinge
 
2016 bioinformatics i_phylogenetics_wim_vancriekinge
2016 bioinformatics i_phylogenetics_wim_vancriekinge2016 bioinformatics i_phylogenetics_wim_vancriekinge
2016 bioinformatics i_phylogenetics_wim_vancriekingeProf. Wim Van Criekinge
 
2016 bioinformatics i_bio_python_ii_wimvancriekinge
2016 bioinformatics i_bio_python_ii_wimvancriekinge2016 bioinformatics i_bio_python_ii_wimvancriekinge
2016 bioinformatics i_bio_python_ii_wimvancriekingeProf. Wim Van Criekinge
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekingeProf. Wim Van Criekinge
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekingeProf. Wim Van Criekinge
 

En vedette (10)

Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge
 
2016 bioinformatics i_phylogenetics_wim_vancriekinge
2016 bioinformatics i_phylogenetics_wim_vancriekinge2016 bioinformatics i_phylogenetics_wim_vancriekinge
2016 bioinformatics i_phylogenetics_wim_vancriekinge
 
2016 bioinformatics i_bio_python_ii_wimvancriekinge
2016 bioinformatics i_bio_python_ii_wimvancriekinge2016 bioinformatics i_bio_python_ii_wimvancriekinge
2016 bioinformatics i_bio_python_ii_wimvancriekinge
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge
 
2017 biological databasespart2
2017 biological databasespart22017 biological databasespart2
2017 biological databasespart2
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Mysql introduction
Mysql introduction Mysql introduction
Mysql introduction
 

Similaire à Bioinformatics t5-database searching-v2013_wim_vancriekinge

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingProf. Wim Van Criekinge
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekingeProf. Wim Van Criekinge
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniquesruchibioinfo
 
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Komei Sugiura
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTetsuya Sakai
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxPriyadharshiniG41
 
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...ChemAxon
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regionsbutest
 

Similaire à Bioinformatics t5-database searching-v2013_wim_vancriekinge (20)

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniques
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
 
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Mayank
MayankMayank
Mayank
 

Plus de Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

Plus de Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Dernier

Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 

Dernier (20)

Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 

Bioinformatics t5-database searching-v2013_wim_vancriekinge

  • 1.
  • 3.
  • 4. BPC 2013 • 10 anniversary edition of the bioinformatics programming challenge
  • 5. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 6. Needleman-Wunsch-edu.pl The Score Matrix ---------------Seq1(j)1 2 3 4 5 6 7 8 9 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 a 0 -1 -2 -3 -4 -5 2 K -2 0c 2b 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 A: 4 C -4 -2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH 0 0 0 -1 0 -1 if 5 F -5 -3 -1(substr(seq1,j-1,1) eq substr(seq2,i-1,1) -1 -1 1 0 -1 6 C -6 -4 up_score = matrix(i-1,j) + GAP 2 -2 -2 -2 0 1 B: 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 left_score =-4 -4 -4 0 C: matrix(i,j-1) +-2 GAP 0 9 V -9 -7 -5 -5 -3 -3 -1 -1
  • 7. Multiple Alignment Method • The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. • The principal is that multiple alignments is achieved by successive application of pairwise methods. – First do all pairwise alignments (not just one sequence with all others) – Then combine pairwise alignments to generate overall alignment
  • 8. Database Searching • Consider the task of searching SWISS-PROT against a query sequence: – say our query sequence is 362 amino- acids long – SWISS-PROT release 38 contains 29,085,265 amino acids – finding local alignments via dynamic programming would entail O(1010)matrix operations • Given size of databases, more efficient methods needed
  • 9. Heuristic approaches to DP for database searching FASTA (Pearson 1995) BLAST (Altschul 1990, 1997) Uses heuristics to avoid calculating the full dynamic programming matrix Uses rapid word lookup methods to completely skip most of the database entries Speed up searches by an order of magnitude compared to full SmithWaterman The statistical side of FASTA is still stronger than BLAST Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than SmithWaterman Almost as sensitive as FASTA
  • 10. FASTA « Hit and extend heuristic» • Problem: Too many calculations “wasted” by comparing regions that have nothing in common • Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical • Basic method: Look for similar regions only near short stretches that match exactly
  • 11. FASTA-Stages 1. 2. 3. 4. 5. Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences) Score and select top 10 scoring “local diagonals” Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
  • 12.
  • 13.
  • 14. FastA • Sensitivity: the ability of a program to identify weak but biologically significant sequence similarity. • Selectivity: the ability of a program to discriminate between true matches and matches occurring by chance alone. – A decrease in selectivity results in more false positives being reported.
  • 15. FastA (http://www.ebi.ac.uk/fasta33/) Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively Max number of scores and alignments is 100 Blosum50 default. Lower PAM higher blosum to detect close sequences Higher PAM and lower blosum to detect distant sequences The larger the word-length the less sensitive, but faster the search will be
  • 16. FastA Output Initn, init1, opt, zscore calculated during run E score expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database. Database code hyperlinked to the SRS database at EBI Accession number Description Length E() does not represent the % similarity
  • 17. FastA is a family of programs FastA, TFastA, FastX, FastY Query: DNAProtein Database:DNA Protein
  • 18. FASTA problems FASTA can miss significant similarity since – For proteins, similar sequences do not have to share identical residues • Asp-Lys-Val is quite similar to • Glu-Arg-Ile yet it is missed even with ktuple size of 1 since no amino acid matches • Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is no match with ktuple size of 2
  • 19. FASTA problems FASTA can miss significant similarity since – For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not • GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-SerThr-Lys) but they don’t match with ktuple size of 3 or higher
  • 20. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast Blast
  • 21. BLAST - Basic Local Alignment Search Tool
  • 22. What does BLAST do? • Search a large target set of sequences... • …for hits to a query sequence... • …and return the alignments and scores from those hits... • Do it fast. Show me those sequences that deserve a second look. Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.
  • 23. The big red button Do My Job It is dangerous to hide too much of the underlying complexity from the scientists.
  • 24. Overview • Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length wwith a score of at least T • Key concept “Neigborhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly • Calculate neigborhood (T) for substrings of query (size W)
  • 25. Overview Compile a list of words which give a score above T when paired with the query sequence. – Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E = +3 +9 +5 +5 = 22 • try all possibilities: A A A A A A A = +3 -3 C = +3 -3 • ...too slow, try directed change 0 0 = 0 0 -7 = -7 no good no good
  • 26. Overview A A g n I k C D E C D E = +3 +9 +5 +5 = 22 • change 1st pos. to all acceptable substitutions C D E = +1 +9 +5 +5 = 20ok C D E = +0 +9 +5 +5 = 19 ok C D E = -1 +9 +5 +5 = 18 ok C D E = -2 +9 +5 +5 = 17 ok • change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 • change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok • continue - use recursion • For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence
  • 27. Neighborhood.pl # Calculate neighborhood my %NH; for (my $i = 0; $i < @A; $i++) { my $s1 = $S{$W[0]}{$A[$i]}; for (my $j = 0; $j < @A; $j++) { my $s2 = $S{$W[1]}{$A[$j]}; for (my $k = 0; $k < @A; $k++) { my $s3 = $S{$W[2]}{$A[$k]}; my $score = $s1 + $s2 + $s3; my $word = "$A[$i]$A[$j]$A[$k]"; next if $word =~ /[BZX*]/; $NH{$word} = $score if $score >= $T; } } } # Output neighborhood foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) { print "$word $NH{$word}n"; }
  • 28. BLOSUM62 RGD 11 PAM200 RGD 13 RGD 17 KGD 14 QGD 13 RGE 13 EGD 12 HGD 12 NGD 12 RGN 12 AGD 11 MGD 11 RAD 11 RGQ 11 RGS 11 RND 11 RSD 11 SGD 11 TGD 11 RGD 18 RGE 17 RGN 16 KGD 15 RGQ 15 KGE 14 HGD 13 KGN 13 RAD 13 RGA 13 RGG 13 RGH 13 RGK 13 RGS 13 RGT 13 RSD 13 WGD 13
  • 29.
  • 30. indexed * Score Trim to max S Length of extension *Two non-overlapping HSP’s on a diagonal within distance A
  • 31. indexed * Score Trim to max S Length of extension *Two non-overlapping HSP’s on a diagonal within distance A
  • 32. The BLAST algorithm • Break the search sequence into words – W = 3 for proteins, W = 12 for DNA MCGPFILGTYC CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCG • Include in the search all words that score above a certain value (T) for any search word MCGCGP MCT MGP MCN CTP … … … This list can be computed in linear time
  • 33. The Blast Algorithm (2) • Search for the words in the database – Word locations can be precomputed and indexed – Searching for a short string in a long string • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, S
  • 34.
  • 35. BLAST parameters • Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. • Choosing a value for w – small w: many matches to expand – big w: many words to be generated – w=4 is a good compromise • Lowering the segment extension cutoff (S) returns longer extensions for each hit. • Changing the minimum E-value changes the threshold for reporting a hit.
  • 36. Critical parameters: T,W and scoring matrix • The proper value of T depends ons both the values in the scoring matrix and balance between speed and sensitivity • Higher values of T progressively remove more word hits and reduce the search space. • Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes incraese sensitivity and decrease speed. • The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast
  • 37. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 38. Database Searching • How can we find a particular short sequence in a database of sequences (or one HUGE sequence)? • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. – Databases always return some kind of hit, how much attention should be paid to the result? • How can we determine how “unusual” a particular alignment score is?
  • 39. Significance Sentence 1: “These algorithms are trying to find the best way to match up two sequences” Sentence 2: “This does not mean that they will find anything profound” ALIGNMENT: THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES :: :.. . .. ...: : ::::.. :: . : ... THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND-----12 exact matches 14 conservative substitutions Is this a good alignment?
  • 40. Overview • A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T • This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability
  • 41. Mathematical Basis of BLAST • Model matches as a sequence of coin tosses • Let p be the probability of a “head” – For a “fair” coin, p = 0.5 • (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). • Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 • Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
  • 42. Mathematical Basis of BLAST • To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. AATCAT HTHHHT ATTCAG • For DNA, the probability of a “head” is 1/4 – What is it for amino acid sequences?
  • 43. Mathematical Basis of BLAST • So, for one particular alignment, the Erdös-Rényi property can be applied • What about for all possible alignments? – Consider that sequences are being shifted back and forth, dot matrix plot • The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.
  • 45. Karlin-Alschul Statistics E=kmn-λS This equation states that the number of alignments expected by chance (E) during the sequence database search is a function of the size of the search space (m*n), the normalized score (λS) and a minor constant (k mostly 0.1) E-Value grows linearly with the product of target and query sizes. Doubling target set size and doubling query length have the same effect on e-value
  • 47. Scoring alignments • Score: S (~R) – S= M(qi,ti) - gaps • Any alignment has a score • Any two sequences have a(t least one) optimal alignment
  • 48. • For a particular scoring matrix and its associated gap initiation and extention costs one must calculate λ and k • Unfortunately (for gapped alignments), you can’t do this analytically and the values must be estimated empirically – The procedure involves aligning random sequences (Monte Carlo approach) with a specific scoring scheme and observing the alignment properties (scores, target frequencies and lengths)
  • 49. Significance “Monte Carlo” Approach: • Compares result to randomized result, similarly to results generated by a roulette wheel at Monte Carlo • Typical procedure for alignments – Randomize sequence A – Align to sequence B – Repeat many times (hundreds) – Keep track op optimal score • Histogram of scores …
  • 50. Assessing significance requires a distribution Frequency • I have an pumpkin of diameter 1m. Is that unusual? Diameter (m)
  • 51.
  • 52.
  • 53. Significance Normal Distribution does NOT Fit Alignment Scores !! • In seeking optimal Alignments between two sequences, one desires those that have the highest score - i.e. one is seeking a distribution of maxima • In seeking optimal Matches between an Input Sequence and Sequence Entries in a Database, one again desires the matches that have the highest score, and these are obtained via examination of the distribution of such scores for the entries in the database - this is again a distribution of maxima. “A Normal Distribution is a distribution of Sums of independent variables rather than a sum of their Maxima.“
  • 54. Comparing distributions Gaussian: f x 1 e 2 Extreme Value: x 2 2 2 f x 1 x e x e e
  • 55. Alignment scores follow extreme value distributions Alignment of unrelated/random sequences result in scores following an extreme value distribution x P = 1 –e-E E P(x S) = 1-exp(-k m n e- S) m, n: sequence lengths. k, free parameters. E=-ln(1-P) This can be shown analytically for ungapped alignments and has been found empirically to also hold for gapped alignments under commonly used conditions.
  • 56. Alignment scores follow extreme value distributions Alignment algorithms will always produce alignments, regardless of whether it is meaningful or not => important to have way of selecting significant alignments from large set of database hits. Solution: fit distribution of scores from database search to extreme value distribution; determine p-value of hit from this fitted distribution. Example: scores fitted to extreme value distribution. 99.9% of this distribution is located below score=112 => hit with score = 112 has a p-value of 0.1%
  • 57. Significance BLAST uses precomputed extreme value distributions to calculate Evalues from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
  • 58. Determining P-values • If we can estimate and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. • For sequence matches, a scoring system and database can be parameterized by two parameters, k and , related to and . – It would be nice if we could compare hit significance without regard to the scoring system used!
  • 59. Bit Scores • The expected number of hits with score is: E = Kmne s S – Where m and n are the sequence lengths • Normalize the raw score using: S S ln K ln 2 • Obtains a “bit score” S’, with a standard set of units. S • The new E-value is: E mn 2
  • 61. FastA Output • The distribution of scores graph of frequency of observed scores • expected curve (asterisks) according to the extreme value distribution –the theoretic curve should be similar to the observed results • deviations indicate that the fitting parameters are wrong –too weak gap penalties –compositional biases
  • 62. FastA Output < 20 222 0 :* 22 30 0 :* 24 18 1 :* 26 18 15 :* 28 46 159 :* 30 207 963 :* 32 1016 3724 := * 34 4596 10099 :==== * 36 9835 20741 :========= * 38 23408 34278 :==================== * 40 41534 47814 :=================================== * 42 53471 58447 :============================================ * 44 73080 64473 :====================================================*======= 46 70283 65667 :=====================================================*==== 48 64918 62869 :===================================================*== 50 65930 57368 :===============================================*======= 52 47425 50436 :======================================= * 54 36788 43081 :=============================== * 56 33156 35986 :============================ * 58 26422 29544 :====================== * 60 21578 23932 :================== * 62 19321 19187 :===============* 64 15988 15259 :============*= 66 14293 12060 :=========*== 68 11679 9486 :=======*== 70 10135 7434 :======*==
  • 63. FastA Output 72 8957 5809 :====*=== 74 7728 4529 :===*=== 76 6176 3525 :==*=== 78 5363 2740 :==*== 80 4434 2128 :=*== 82 3823 1628 :=*== 84 3231 1289 :=*= 86 2474 998 :*== 88 2197 772 :*= 90 1716 597 :*= 92 1430 462 :*= :===============*======================== 94 1250 358 :*= :============*=========================== 96 954 277 :* :=========*======================= 98 756 214 :* :=======*=================== 100 678 166 :* :=====*================== 102 580 128 :* :====*=============== 104 476 99 :* :===*============= 106 367 77 :* :==*========== 108 309 59 :* :==*======== 110 287 46 :* :=*======== 112 206 36 :* :=*====== 114 161 28 :* :*===== 116 144 21 :* :*==== 118 127 16 :* :*==== >120 886 13 :* :*============================== Related
  • 64. FastA Output • A summary of the statistics and of the program parameters follows the histogram. – An important number in this summary is the Kolmogorov-Smirnov statistic, which indicates how well the actual data fit the theoretical statistical distribution. The lower this value, the better the fit, and the more reliable the statistical estimates. – In general, a Kolmogorov-Smirnov statistic under 0.1 indicates a good fit with the theoretical model. If the statistic is higher than 0.2, the statistics may not be valid, and it is recommended to repeat the search, using more stringent (more negative) values for the gap penalty parameters.
  • 65. Statistics summary • Optimal local alignment scores for pairs of random amino acid sequences of the same length follow and extreme-value distribution. For any score S, the probability of observing a score >= S is given by the Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(lambda.S)) • k en Lambda are parameters related to the position of the maximum and the with of the distribution, • Note the long tail at the right. This means that a score serveral standard deviations above the mean has higher probability of arising by chance (that is, it is less significant) than if the scores followed a normal distribution.
  • 66. P-values • Many programs report P = the probability that the alignment is no better than random. The relationship between Z and P depends on the distribution of the scores from the control population, which do NOT follow the normal distributions – P<=10E-100 (exact match) – P in range 10E-100 10E-50 (sequences nearly identical eg. Alleles or SNPs – P in range 10E-50 10E-10 (closely related sequenes, homology certain) – P in range 10-5 10E-1 (usually distant relatives) – P > 10-1 (match probably insignificant)
  • 67. E • For database searches, most programs report E-values. The E-value of an alignemt is the expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. E is found by multiplying the value of P by the size of the database probed. Note that E but not P depends on the size of the database. Values of P are between 0 and 1. Values of E are between 0 and the number of sequences in the database searched: – E<=0.02 sequences probably homologous – E between 0.02 and 1 homology cannot be ruled out – E>1 you would have to expect this good a match by just chance
  • 68. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast Blast
  • 69. Blast BLAST is actually a family of programs: • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database.
  • 70. Blast
  • 71. Blast
  • 72. Blast
  • 73. Blast
  • 74. Blast
  • 75. Blast
  • 76. Blast
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85. Tips • Be aware of what options you have selected when using BLAST, or FASTA implementations. • Treat BLAST searches as scientific experiments • So you should try your searches with the filters on and off to see whether it makes any difference to the output
  • 86. Tips: Low-complexity and Gapped Blast Algorithm • The common, Web-based ones often have default settings that will affect the outcome of your searches. By default all NCBI BLAST implementations filter out biased sequence composition from your query sequence (e.g. signal peptide and transmembrane sequences - beware!). • The SEG program has been implemented as part of the blast routine in order to mask low-complexity regions • Low-complexity regions are denoted by strings of Xs in the query sequence
  • 87. Tips • The sequence databases contain a wealth of information. They also contain a lot of errors. Contaminants … • Annotation errors, frameshifts that may result in erroneous conceptual translations. • Hypothetical proteins ? • In the words of Fox Mulder, "Trust no one."
  • 88. Tips • Once you get a match to things in the databases, check whether the match is to the entire protein, or to a domain. Don't immediately assume that a match means that your protein carries out the same function (see above). Compare your protein and the match protein(s) along their entire lengths before making this assumption.
  • 89. Tips • Domain matches can also cause problems by hiding other informative matches. For instance if your protein contains a common domain you'll get significant matches to every homologous sequence in the database. BLAST only reports back a limited number of matches, ordered by P value. • If this list consists only of matches to the same domain, cut this bit out of your query sequence and do the BLAST search again with the edited sequence (e.g. NHR).
  • 90. Tips • Do controls wherever possible. In particular when you use a particular search software for the first time. • Suitable positive controls would be protein sequences known to have distant homologues in the databases to check how good the software is at detecting such matches. • Negative controls can be employed to make sure the compositional bias of the sequence isn't giving you false positives. Shuffle your query sequence and see what difference this makes to the matches that are returned. A real match should be lost upon shuffling of your sequence.
  • 91. Tips • Perform Controls #!/usr/bin/perl -w use strict; my ($def, @seq) = <>; print $def; chomp @seq; @seq = split(//, join("", @seq)); my $count = 0; while (@seq) { my $index = rand(@seq); my $base = splice(@seq, $index, 1); print $base; print "n" if ++$count % 60 == 0; } print "n" unless $count %60 == 0;
  • 92. Tips • Read the footer first • View results graphically • Parse Blasts with Bioperl
  • 93. FastA vs. Blast • BLAST's major advantage is its speed. – 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank. • When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. – Since it doesn't require a perfect sequence match in the first stage of the search.
  • 94. FastA vs. Blast Weakness of BLAST: – The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity. – For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value. – FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether. • In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.
  • 95. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 96. PSI-Blast 1. Old (ungapped) BLAST 2. New BLAST (allows gaps) 3. Profile -> PSI Blast - Position Specific Iterated Strategy:Multiple alignment of the hits Calculates a position-specific score matrix Searches with this matrix In many cases is much more sensitive to weak but biologically relevant sequence similarities PSSM !!!
  • 97. PSI-Blast • Patterns of conservation from the alignment of related sequences can aid the recognition of distant similarities. – These patterns have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models. For each position in the derived pattern, every amino acid is assigned a score. (1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores. (2) Weakly conserved positions: all residues receive scores near zero. (3) Position-specific scores can also be assigned to potential insertions and deletions.
  • 98. Pattern • a set of alternative sequences, using “regular expressions” • Prosite (http://www.expasy.org/ prosite/)
  • 99. PSSM (Position Specific Scoring Matrice)
  • 100. PSSM (Position Specific Scoring Matrice)
  • 101. PSSM (Position Specific Scoring Matrice)
  • 102. PSI-Blast • The power of profile methods can be further enhanced through iteration of the search procedure. – After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. – The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected.
  • 103. PSI-Blast (1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program. (2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment andprofile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions. (3) The profile is compared to the protein database, again seeking local alignments using the BLAST algorithm. (4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory andparameters for gapped BLAST alignments remain applicable to profile alignments. (5) Finally, PSI-BLAST iterates, by returning to step (2), a specified number of times or until convergence.
  • 109. PSI-BLAST pitfalls • Avoid too close sequences: overfit! • Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. • The E-value reflects the significance of the match to the previous training set not to the original sequence! • Choose carefully your query sequence. • Try reverse experiment to certify.
  • 110. Reduce overfitting risk by Cobbler • A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. • Embedding consensus residues improves performance • S. Henikoff and J.G. Henikoff; Protein Science (1997) 6:698705.
  • 111. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 117. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 118. Installing Blast Locally • 2 flavors: NCBI/WuBlast • Excutables: – ftp://ftp.ncbi.nih.gov/blast/executables/ • Database: – ftp://ftp.ncbi.nih.gov/blast/db/ • Formatdb – formatdb -i ecoli.nt -p F – formatdb -i ecoli.protein -p T • For options: blastall – blastall -p blastp -i query -d database -o output
  • 119. DataBase Searching Dynamic Programming Reloaded Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 120. Main database: BLAT • BLAT: BLAST-Like Alignment Tool • Aligns the input sequence to the Human Genome • Connected to several databases, like: – – – – mRNAs ESTs RepeatMasker RefSeq - GenScan - TwinScan - UniGene - CpG Islands
  • 121. BLAT Human Genome Browser
  • 122. BLAT method • Align sequence with BLAT, get alignment info • Per BLAT hit, pick up additional info from connected databases: – – – – – mRNAs ESTs RepeatMasker CpG Islands RefSeq Genes
  • 123.
  • 124. Weblems W5.1: Submit the amino acid sequence of papaya papein to a BLAST (gapped and ungapped) and to a PSI-BLAST search. What are the main difference in results? W5.2: Is there a relationship between Klebsiella aerogenes urease, Pseudomonas diminuta phosphotriesterase and mouse adenosine deaminase ? Also use DALI, ClustalW and T-coffee. W5.3: Yeast two-hybrid typically yields DNA sequences. How would you find the corresponding protein ? W5.4: When and why would you use tblastn ? W5.5: How would you search a database if you want to restrict the search space to those entries having a secretion signal consisting of 4 consecutive (Nterminal) basic residues ?