SlideShare une entreprise Scribd logo
1  sur  166
FBW
07-11-2017
Wim Van Criekinge
Google Calendar
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Needleman-Wunsch-Complete.py
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
a
bc
A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)
B: up_score = matrix(i-1,j) + GAP
C: left_score = matrix(i,j-1) + GAP
Extensions to basic dynamic programming method
use gap penalties
– constant gap penalty for gap > 1
– gap penalty proportional to gap size
• one penalty for starting a gap (gap
opening penalty)
• different (lower) penalty for adding to a
gap (gap extension penalty)
use blosum62
• instead of MATCH and MISMATCH
Dynamic Programming: Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
Needleman-Wunsch-Complete.py
•The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of pairwise
alignment methods.
•The principal is that multiple alignments
is achieved by successive application of
pairwise methods.
• First do all pairwise alignments (not just one
sequence with all others)
• Then combine pairwise alignments to generate
overall alignment
Multiple Alignment Method
• Multiple Sequence Alignment:
–ClustalW, MSA
• Short Read Sequence Alignment:
–BWA, Bowtie
• Database Search:
–BLAST, FASTA, HMMER
• Genomic Analysis:
–BLAT
Sequence Alignment Tools
Read Length is Not As Important
For Resequencing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
8 10 12 14 16 18 20
Length of K-mer Reads (bp)
%ofPairedK-merswithUniquely
AssignableLocation
E.COLI
HUMAN
Jay Shendure
Short Read Alignment Software
Bowtie: memory-‐efficientshort read aligner. It
aligns short DNA sequences (reads) to the human
genome at a rate of over 25 million 35-‐bpreads per
hours
Burrows-‐Wheeler Aligner (BWA): an aligner that
implements two algorithms: bwa-‐shortand BWA-‐
SW. The former works for query sequences shorter
than 200 bp and the latter for longer sequences
up to around 100 kbp.
Mapping Reads Back
• Hash Table (Lookup table)
– FAST, but requires perfect matches. [O(m n + N)]
• Array Scanning
– Can handle mismatches, but not gaps. [O(m N)]
• Dynamic Programming (Smith Waterman)
– Indels
– Mathematically optimal solution
– Slow (most programs use Hash Mapping as a prefilter) [O(mnN)]
• Burrows-Wheeler Transform (BW Transform)
– FAST. [O(m + N)] (without mismatch/gap)
– Memory efficient.
– But for gaps/mismatches, it lacks sensitivity
Why Burrows-Wheeler?
• BWT very compact:
– Approximately ½ byte per base
– As large as the original text, plus a few
“extras”
– Can fit onto a standard computer with 2GB of
memory
• Linear-time search algorithm
– proportional to length of query for exact
matches
Burrows-Wheeler Transform (BWT)
acaacg$
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
gc$aaac
Burrows-Wheeler Matrix (BWM)
BWT
Burrows-Wheeler Transform
a ba a ba $
T
Sort
a bba $ a a
BWT(T)
Last column
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Burrows-Wheeler
Matrix
Burrows M,Wheeler DJ: A block sorting lossless data compression algorithm.
Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
Reversible permutation of the characters of a string, used originally for compression
How is itreversible?How is it useful forcompression? How is it anindex?
Burrows-Wheeler Transform
def rotations(t):
""" Return list of rotations of input string t
""" tt = t * 2
return [ tt[i:i+len(t)] for i in xrange(0, len(t))
]
def bwm(t):
""" Return lexicographically sorted list of t’s rotations
"""
return sorted(rotations(t))
def bwtViaBwm(t):
""" Given T, returns BWT(T) by way of the BWM
"""
return ''.join(map(lambda x: x[-‐1], bwm(t)))
Make list of all rotations
Sort them
Take last column
>>> bwtViaBwm("Tomorrow_and_tomorrow_and_tomorrow$")
'w$wwdd nnoooaattTmmmrrrrrrooo ooo'
>>> bwtViaBwm("It_was_the_best_of_times_it_was_the_worst_of_times$")
's$esttssfftteww_hhmmbootttt_ii woeeaaressIi '
>>> bwtViaBwm('in_the_jingle_jangle_morning_Ill_come_following_you$')
'u_gleeeengj_mlhl_nnnnt$nwj lggIolo_iiiiarfcmylo_oo_'
Python example: BWT_v2.py
Key observation – T ranking
1$acaacg1
2aacg$ac1
1acaacg$1
3acg$aca2
1caacg$a1
2cg$acaa3
1g$acaac2
a1c1a2a3c2g1$1
“last first (LF) mapping”
The i-th occurrence of character X in
the last column corresponds to
the same text character as the i-th
occurrence of X in the first column.
Burrows-Wheeler Transform: LF Mapping
BWM with T-ranking: $ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3$ a0
F L
LF Mapping: The ith occurrence of a character c in L and the ith occurrence of c
in F correspond to the same occurrence in T
However we rank occurrences of c, ranks appear in the same order in F and L
Why does the LF Mapping hold ?
Burrows-Wheeler Transform: LF Mapping
BWM with B-ranking:
a3 b1 a1 a2 b0
$ a3 b1 a1 a2
a2 b0 a3 $ a3
b0 a0 $ a3 b1
b1 a1 a2 b0 a0
a0 $ a3 b1 a1
a1 a2 b0 a0 $
F
$
a0
a1
a2
a3
b0
b1
L
a0
b0
b1
a1
$
a2
a3
Ascending rank
F now has very simple structure: a $, a block of as with ascending ranks, a
block of bs with ascending ranks
Burrows-Wheeler Transform
F L
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3row 6
Which BWM row begins with b1?
Skip row starting with $ (1 row)
Skip rows starting with a (4 rows)
Skip row starting with b0 (1 row)
Answer: row 6
Burrows-Wheeler Transform
Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T
Which BWM row (0-based) begins with G100? (Ranks areB-ranks.)
Skip row starting with $ (1 row)
Skip rows starting with A (300 rows)
Skip rows starting with C (400 rows)
Skip first 100 rows starting with G (100 rows)
Answer: row 1 + 300 + 400 + 100 = row 801
Burrows-Wheeler Transform:reversing
Reverse BWT(T) starting at right-hand-side of T and moving left
L
a0
b0
b1
a1
$
a2
a3
F
$
a0
a1
a2
a3
b0
b1
Start in first row. F must have $. L contains
character just prior to $: a0
a0: LF Mapping says this is same occurrence of a
as first a in F.Jump to row beginning with a0. L
contains character just prior to a0:b0.
Repeat for b0, get a2
Repeat for a2, get a1
Repeat for a1, get b1
Repeat for b1, get a3
Repeat for a3, get$, done Reverse of chars we visited = a3 b1 a1 a2 b0 a0 $ = T
Burrows-Wheeler Transform:reversing
Another way to visualize reversing BWT(T):
F L
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
$ a0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b1 a3
$ a0
a0 b0
a1 b1
a3 $
b0 a2
b1 a3
$ a0
a0 b0
a2 a1
a3 $
b0 a2
b1 a3
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
F L F L F L F L F L F L
a0 b0
a1 b1
a2 a1
b0 a2
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
T: a3 b1 a1 a2 b0 a0 $
BWT is useful forcompression:
Sorts characters by right-context, making a more compressible string
It’sreversible:
Repeated applications of LF Mapping, recreating T from right to left
FM - Index
Burrows-Wheeler Transform
Steps in using BWA
Download and install BWA on Linux/Mac.
Export the path or use the exact path.
bunzip2 bwa-0.5.9.tar.bz2
tar xvf bwa-0.5.9.tar
cd bwa-0.5.9 | make
make
Download the reference genome using wget.
Create the index for the reference genome
(assuming the reference sequences are in wg.fa).
Only needs to be performed once for each
genome. Use –a for small genomes.
• Mapping short reads to the reference genome.
• 1. Align sequences using mul0ple threads (eg 4 CPUs).
Assume the short reads are in the s_3_sequence.txt.gz file.
• bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz >
s_3_sequence.txt.bwa
bwa index -p hg19bwaidx -a bwtsw wg.fa
2. Create alignment in the SAM format (a generic format for
storing large nucleo0de sequence alignments):
bwa samse hg19bwaidx s_3_sequence.txt.bwa
s_3_sequence.txt.gz > s_3_sequence.txt.sam
Mapping long reads can be done using the
bwasw command:
bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam
Sequence Alignment/Map Format
Sequence Reads +
Reference Sequence
Alignment Software
SAM File
Resequencing RNA Seq SNPs
Reads: Illumina reads.
Reference: whole genome,
contig, chromosome.
BWA, Bowtie
Most of the
analysis
happens when
considering the
SAM files.
SAM format
“A tab-‐delimitedtext format consisting of a header section,
which is optional, and an alignment section”
https://samtools.github.io/hts-specs/SAMv1.pdf
Example of CIGAR
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
In the SAM file you will have the following fields:
• POS:5
• CIGAR: 3M1I3M1D5M
The POS indicates that the read aligns starting at position 5 on the reference. The
CIGAR says that the first 3 bases in the read sequence align with the reference.
The next base in the read does not exist in the reference. Then 3 bases align with
the reference. The next reference base does not exist in the read sequence, then 5
more bases align with the reference. Note that at position 14, the base in the read
is different than the reference, but it still counts as an M since it aligns to that
position.
Harvesting Information from SAM
• Query name, QNAME (SAM)/read_name (BAM).
• FLAG provides the following informa0on:
– are there multiple fragments?
– are all fragments properly aligned?
– is this fragment unmapped?
– is the next fragment unmapped?
– is this query the reverse strand?
– is the next fragment the reverse strand?
– is this the last fragment?
– is this a secondary alignment?
– did this read fail quality controls?
– is this read a PCR or optical duplicate?
BAM
• BAM is a compressed version of the SAM file format.
• BAM is compressed in the BGZF format. All multi-byte numbers
in BAM are little-endian, regardless of the machine endianness.
As an example, suppose we have the hexadecimal number
12345678.
• There are multiple programs that convert BAM files to SAM files
and vice versa (eg samtools)
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
• Consider the task of searching SWISSPROT against a
query sequence:
• say our query sequence is 362 amino acids long
• SWISSPROT release 38 contains 29,085,265 amino acids
• finding local alignments via dynamic programming would
entail O(1010) matrix operations
• Given size of databases, more efficient methods
needed
Database Searching
FASTA (Pearson 1995)
Uses heuristics to avoid
calculating the full dynamic
programming matrix
Speed up searches by an
order of magnitude
compared to full Smith-
Waterman
The statistical side of FASTA is
still stronger than BLAST
BLAST (Altschul 1990, 1997)
Uses rapid word lookup
methods to completely skip
most of the database
entries
Extremely fast
One order of magnitude
faster than FASTA
Two orders of magnitude
faster than Smith-
Waterman
Almost as sensitive as FASTA
Heuristic approaches to DP for database searching
« Hit and extend heuristic»
• Problem: Too many calculations “wasted” by
comparing regions that have nothing in common
• Initial insight: Regions that are similar between two
sequences are likely to share short stretches that are
identical
• Basic method: Look for similar regions only near short
stretches that match exactly
FASTA
FASTA-Stages
1. Find k-tups in the two sequences (k=1,2 for proteins,
4-6 for DNA sequences)
2. Score and select top 10 scoring “local diagonals”
3. Rescan top 10 regions, score with PAM250 (proteins)
or DNA scoring matrix. Trim off the ends of the regions
to achieve highest scores.
4. Try to join regions with gapped alignments. Join if
similarity score is one standard deviation above
average expected score
5. After finding the best initial region, FASTA performs a
global alignment of a 32 residue wide region centered
on the best initial region, and uses the score as the
optimized score.
• Sensitivity: the ability of a program to identify weak but
biologically significant sequence similarity.
• Selectivity: the ability of a program to discriminate
between true matches and matches occurring by chance
alone.
• A decrease in selectivity results in more false positives being
reported.
FastA
FastA (http://www.ebi.ac.uk/fasta33/)
Blosum50
default.
Lower PAM
higher blosum
to detect close
sequences
Higher PAM and
lower blosum
to detect distant
sequences
Gap opening penalty
-12, -16 by default
for fasta with
proteins and DNA,
respectively
Gap extension
penalty -2, -4 by
default for fasta
with proteins and
DNA, respectively
The larger the
word-length the
less sensitive, but
faster the search
will be
Max number of
scores and
alignments is 100
FastA Output
Database
code
hyperlinked
to the SRS
database at
EBI
Accession
number
Description Length
Initn, init1, opt, z-
score calculated
during run
E score -
expectation
value, how
many hits are
expected to be
found by
chance with
such a score
while
comparing
this query to
this database.
E() does not
represent the
% similarity
Query: DNA Protein
Database:DNA Protein
FastA is a family of programs
FastA, TFastA, FastX, FastY
FASTA can miss significant similarity since
• For proteins, similar sequences do not have to share identical
residues
•Asp-Lys-Valis quite similar to
•Glu-Arg-Ileyet it is missed even with ktuple size of 1 since no
amino acid matches
•Gly-Asp-Gly-Lys-Glyis quite similar to Gly-Glu-
Gly-Arg-Glybut there is no match with ktuple size of 2
FASTA problems
FASTA can miss significant similarity since
• For nucleic acids, due to codon “wobble”, DNA sequences
may look like XXyXXyXXy where X’s are conserved and y’s are
not
•GGuUCuACgAAgand GGcUCcACaAAA
both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they
don’t match with ktuple size of 3 or higher
FASTA problems
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
BLAST - Basic Local Alignment
Search Tool
What does BLAST do?
• Search a large target set of sequences...
• …for hits to a query sequence...
• …and return the alignments and scores from those hits...
• Do it fast.
Show me those sequences that deserve a second look. Blast
programs were designed for fast database searching, with
minimal sacrifice of sensitivity to distant related sequences.
The big red button
Do My Job
It is dangerous to hide too much of the
underlying complexity from the scientists.
• Approach: find segment pairs by first finding word
pairs that score above a threshold, i.e., find word pairs
of fixed length w with a score of at least T
• Key concept “Neigborhood”: Seems similar to FASTA,
but we are searching for words which score above T
rather than that match exactly
• Calculate neigborhood (T) for substrings of query (size
W)
Overview
Compile a list of words which give a score above T when paired with the query
sequence.
• Example using PAM-120 for query sequence ACDE (w=4, T=17):
A C D E
A C D E = +3 +9 +5 +5 = 22
• try all possibilities:
A A A A = +3 -3 0 0 = 0 no good
A A A C = +3 -3 0 -7 = -7 no good
• ...too slow, try directed change
Overview
A C D E
A C D E = +3 +9 +5 +5 = 22
• change 1st pos. to all acceptable substitutions
g C D E = +1 +9 +5 +5 = 20 ok
n C D E = +0 +9 +5 +5 = 19 ok
I C D E = -1 +9 +5 +5 = 18 ok
k C D E = -2 +9 +5 +5 = 17 ok
• change 2nd pos.: can't - all alternatives negative and the other three positions only add up
to 13
• change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
• continue - use recursion
• For "best" values of w and T there are typically
about 50 words in the list for every residue in the
query sequence
Overview
Blast_Neighbourhood.py
Blast_Neighbourhood.py
Blast_Neighbourhood.py
Blast_Neighbourhood.py
Blast_Neighbourhood.py
Blast_Neighbourhood.py
Blast_Neighbourhood.py
Blast_Neighbourhood.py
BLOSUM62 RGD 11
RGD 17
KGD 14
QGD 13
RGE 13
EGD 12
HGD 12
NGD 12
RGN 12
AGD 11
MGD 11
RAD 11
RGQ 11
RGS 11
RND 11
RSD 11
SGD 11
TGD 11
PAM200 RGD 13
RGD 18
RGE 17
RGN 16
KGD 15
RGQ 15
KGE 14
HGD 13
KGN 13
RAD 13
RGA 13
RGG 13
RGH 13
RGK 13
RGS 13
RGT 13
RSD 13
WGD 13
S
Length of extension
Score
Trim to max
indexed
*
*Two non-overlapping HSP’s on a diagonal within distance A
S
Length of extension
Score
Trim to max
indexed
*
*Two non-overlapping HSP’s on a diagonal within distance A
The BLAST algorithm
• Break the search sequence into words
• W = 3 for proteins, W = 12 for DNA
• Include in the search all words that score above a certain value (T) for
any search word
MCGPFILGTYC
MCG
CGP
MCG, CGP, GPF, PFI, FIL,
ILG, LGT, GTY, TYC
MCG CGP
MCT MGP …
MCN CTP
… …
This list can be
computed in linear
time
The Blast Algorithm (2)
• Search for the words in the database
• Word locations can be precomputed and indexed
• Searching for a short string in a long string
• HSP (High Scoring Pair) = A match between a query word and the
database
• Find a “hit”: Two non-overlapping HSP’s on a diagonal within
distance A
• Extend the hit until the score falls below a threshold value, S
True positives False positives
False negatives
Sequences reported
as related
Sequences reported
as unrelated
True negatives
homologous
sequences
non-homologous
sequences
Sensitivity:
ability to find
true positives
Specificity:
ability to minimize
false positives
BLAST parameters
• Lowering the neighborhood word threshold (T) allows
more distantly related sequences to be found, at the
expense of increased noise in the results set.
• Choosing a value for w
• small w: many matches to expand
• big w: many words to be generated
• w=4 is a good compromise
• Lowering the segment extension cutoff (S) returns
longer extensions for each hit.
• Changing the minimum E-value changes the threshold
for reporting a hit.
Critical parameters: T,W and scoring matrix
•The proper value of T depends ons both the
values in the scoring matrix and balance
between speed and sensitivity
•Higher values of T progressively remove more
word hits and reduce the search space.
•Word size (W) of 1 will produce more hits
than a word size of 10. In general, if T is scaled
uniformly with W, smaller word sizes incraese
sensitivity and decrease speed.
•The interplay between W,T and the scoring
matrix is criticial and choosing them wisely is
the most effective way of controlling the
speed and sensiviy of blast
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Database Searching
• How can we find a particular short sequence in a database of
sequences (or one HUGE sequence)?
• Problem is identical to local sequence alignment, but on a much
larger scale.
• We must also have some idea of the significance of a database hit.
• Databases always return some kind of hit, how much attention should be
paid to the result?
• How can we determine how “unusual” a particular alignment score
is?
Sentence 1:
“These algorithms are trying to find the best way to match up two sequences”
Sentence 2:
“This does not mean that they will find anything profound”
ALIGNMENT:
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...: : ::::.. :: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------
12 exact matches
14 conservative substitutions
Is this a good alignment?
Significance
• A key to the utility of BLAST is the ability to calculate
expected probabilities of occurrence of Maximum
Segment Pairs (MSPs) given w and T
• This allows BLAST to rank matching sequences in order
of “significance” and to cut off listings at a user-
specified probability
Overview
Mathematical Basis of BLAST
•Model matches as a sequence of coin tosses
•Let p be the probability of a “head”
• For a “fair” coin, p = 0.5
•(Erdös-Rényi) If there are n throws, then the expected
length R of the longest run of heads is
R = log1/p (n).
•Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
•Trick is how to model DNA (or amino acid) sequence
alignments as coin tosses.
Mathematical Basis of BLAST
•To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.
•For DNA, the probability of a “head” is 1/4
• What is it for amino acid sequences?
AATCAT
ATTCAG
HTHHHT
Mathematical Basis of BLAST
• So, for one particular alignment, the Erdös-Rényi
property can be applied
• What about for all possible alignments?
• Consider that sequences are being shifted back and forth, dot
matrix plot
• The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
Analytical derivation
Erdös-Rényi
…
…
…
Karlin-Alschul
Karlin-Alschul Statistics
E=kmn-λS
This equation states that the number of alignments expected by chance (E)
during the sequence database search is a function of the size of the search
space (m*n), the normalized score (λS) and a minor constant (k mostly 0.1)
E-Value grows linearly with the product of target
and query sizes. Doubling target set size and
doubling query length have the same effect on e-
value
Analytical derivation
Erdös-Rényi
…
…
…
Karlin-Alschul
R=log1/p(mn)
E=kmn-λS
Scoring alignments
•Score: S (~R)
•S=SM(qi,ti) - Sgaps
•Any alignment has a score
•Any two sequences have a(t least one) optimal
alignment
• For a particular scoring matrix and its associated gap initiation and
extention costs one must calculate λand k
• Unfortunately (for gapped alignments), you can’t do this analytically
and the values must be estimated empirically
• The procedure involves aligning random sequences (Monte Carlo approach)
with a specific scoring scheme and observing the alignment properties
(scores, target frequencies and lengths)
“Monte Carlo” Approach:
•Compares result to randomized result, similarly to
results generated by a roulette wheel at Monte
Carlo
•Typical procedure for alignments
• Randomize sequence A
• Align to sequence B
• Repeat many times (hundreds)
• Keep track op optimal score
• Histogram of scores …
Significance
Assessing significance requires a distribution
•I have an pumpkin of diameter 1m. Is that unusual?
Diameter (m)
Frequency
• In seeking optimal Alignments between two sequences,
one desires those that have the highest score - i.e. one is
seeking a distribution of maxima
• In seeking optimal Matches between an Input Sequence
and Sequence Entries in a Database, one again desires the
matches that have the highest score, and these are
obtained via examination of the distribution of such scores
for the entries in the database - this is again a distribution
of maxima.
“A Normal Distribution is a distribution of Sums of
independent variables rather than a sum of their Maxima.“
Normal Distribution does NOT Fit Alignment Scores !!
Significance
Comparing distributions


 











x
e
x
eexf
1 
 
2
2
2
2
1 




x
exf
Extreme Value:Gaussian:
P(xS) = 1-exp(-kmne-S)
m, n: sequence lengths.
k, : free parameters.
This can be shown analytically for ungapped alignments and has
been found empirically to also hold for gapped alignments under
commonly used conditions.
Alignment of unrelated/random sequences result in scores
following an extreme value distribution
Alignment scores follow extreme value distributions
E
x
P = 1 –e-E
E=-ln(1-P)
Alignment algorithms will always produce alignments,
regardless of whether it is meaningful or not
=> important to have way of selecting significant alignments
from large set of database hits.
Solution: fit distribution of scores from database search to
extreme value distribution; determine p-value of hit from this
fitted distribution.
Example: scores fitted to
extreme value distribution.
99.9% of this distribution is
located below score=112
=> hit with score = 112 has a
p-value of 0.1%
Alignment scores follow extreme value distributions
BLAST uses precomputed extreme
value distributions to calculate E-
values from alignment scores
For this reason BLAST only allows
certain combinations of substitution
matrices and gap penalties
This also means that the fit is based on
a different data set than the one you
are working on
A word of caution: BLAST tends to overestimate the significance of its
matches
E-values from BLAST are fine for identifying sure hits
One should be careful using BLAST’s E-values to judge if a marginal hit
can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
Significance
• The distribution of scores graph of
frequency of observed scores
• expected curve (asterisks) according to
the extreme value distribution
• the theoretic curve should be similar to the observed
results
• deviations indicate that the fitting
parameters are wrong
• too weak gap penalties
• compositional biases
FastA Output
< 20 222 0 :*
22 30 0 :*
24 18 1 :*
26 18 15 :*
28 46 159 :*
30 207 963 :*
32 1016 3724 := *
34 4596 10099 :==== *
36 9835 20741 :========= *
38 23408 34278 :==================== *
40 41534 47814 :=================================== *
42 53471 58447 :============================================ *
44 73080 64473 :====================================================*=======
46 70283 65667 :=====================================================*====
48 64918 62869 :===================================================*==
50 65930 57368 :===============================================*=======
52 47425 50436 :======================================= *
54 36788 43081 :=============================== *
56 33156 35986 :============================ *
58 26422 29544 :====================== *
60 21578 23932 :================== *
62 19321 19187 :===============*
64 15988 15259 :============*=
66 14293 12060 :=========*==
68 11679 9486 :=======*==
70 10135 7434 :======*==
FastA Output
72 8957 5809 :====*===
74 7728 4529 :===*===
76 6176 3525 :==*===
78 5363 2740 :==*==
80 4434 2128 :=*==
82 3823 1628 :=*==
84 3231 1289 :=*=
86 2474 998 :*==
88 2197 772 :*=
90 1716 597 :*=
92 1430 462 :*= :===============*========================
94 1250 358 :*= :============*===========================
96 954 277 :* :=========*=======================
98 756 214 :* :=======*===================
100 678 166 :* :=====*==================
102 580 128 :* :====*===============
104 476 99 :* :===*=============
106 367 77 :* :==*==========
108 309 59 :* :==*========
110 287 46 :* :=*========
112 206 36 :* :=*======
114 161 28 :* :*=====
116 144 21 :* :*====
118 127 16 :* :*====
>120 886 13 :* :*==============================
Related
FastA Output
Complete version !
• A summary of the statistics and of the program
parameters follows the histogram.
• An important number in this summary is the
Kolmogorov-Smirnov statistic, which indicates how well
the actual data fit the theoretical statistical distribution.
The lower this value, the better the fit, and the more
reliable the statistical estimates.
• In general, a Kolmogorov-Smirnov statistic under 0.1
indicates a good fit with the theoretical model. If the
statistic is higher than 0.2, the statistics may not be
valid, and it is recommended to repeat the search, using
more stringent (more negative) values for the gap
penalty parameters.
FastA Output
Statistics summary
• Optimal local alignment scores for pairs of random
amino acid sequences of the same length follow and
extreme-value distribution. For any score S, the
probability of observing a score >= S is given by the
Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(-
lambda.S))
• k en Lambda are parameters related to the position of
the maximum and the with of the distribution,
• Note the long tail at the right. This means that a score
serveral standard deviations above the mean has higher
probability of arising by chance (that is, it is less
significant) than if the scores followed a normal
distribution.
P-values
• Many programs report P = the probability that the
alignment is no better than random. The relationship
between Z and P depends on the distribution of the
scores from the control population, which do NOT
follow the normal distributions
• P<=10E-100 (exact match)
• P in range 10E-100 10E-50 (sequences nearly identical eg. Alleles
or SNPs
• P in range 10E-50 10E-10 (closely related sequenes, homology
certain)
• P in range 10-5 10E-1 (usually distant relatives)
• P > 10-1 (match probably insignificant)
E
• For database searches, most programs report E-values. The
E-value of an alignemt is the expected number of sequences
that give the same Z-score or better if the database is
probed with a random sequence. E is found by multiplying
the value of P by the size of the database probed. Note that
E but not P depends on the size of the database. Values of P
are between 0 and 1. Values of E are between 0 and the
number of sequences in the database searched:
• E<=0.02 sequences probably homologous
• E between 0.02 and 1 homology cannot be ruled out
• E>1 you would have to expect this good a match by just chance
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
BLAST is actually a family of programs:
• BLASTN - Nucleotide query searching a
nucleotide database.
• BLASTP - Protein query searching a protein
database.
• BLASTX - Translated nucleotide query sequence
(6 frames) searching a protein database.
• TBLASTN - Protein query searching a translated
nucleotide (6 frames) database.
• TBLASTX - Translated nucleotide query (6
frames) searching a translated nucleotide (6
frames) database.
Blast
Blast
Blast
Blast
Blast
Blast
Blast
Blast
• Be aware of what options you have selected when using
BLAST, or FASTA implementations.
• Treat BLAST searches as scientific experiments
• So you should try your searches with the filters on and
off to see whether it makes any difference to the output
Tips
Tips: Low-complexity and Gapped Blast Algorithm
• The common, Web-based ones often have
default settings that will affect the outcome of
your searches. By default all NCBI BLAST
implementations filter out biased sequence
composition from your query sequence (e.g.
signal peptide and transmembrane
sequences - beware!).
• The SEG program has been implemented as
part of the blast routine in order to mask low-
complexity regions
• Low-complexity regions are denoted by
strings of Xs in the query sequence
•The sequence databases contain a
wealth of information. They also contain
a lot of errors. Contaminants …
•Annotation errors, frameshifts that may
result in erroneous conceptual
translations.
•Hypothetical proteins ?
•In the words of Fox Mulder, "Trust no
one."
Tips
• Once you get a match to things in the databases, check
whether the match is to the entire protein, or to a
domain. Don't immediately assume that a match means
that your protein carries out the same function (see
above). Compare your protein and the match protein(s)
along their entire lengths before making this
assumption.
Tips
• Domain matches can also cause problems by
hiding other informative matches. For instance if
your protein contains a common domain you'll
get significant matches to every homologous
sequence in the database. BLAST only reports
back a limited number of matches, ordered by P
value.
• If this list consists only of matches to the same
domain, cut this bit out of your query sequence
and do the BLAST search again with the edited
sequence (e.g. NHR).
Tips
• Do controls wherever possible. In particular
when you use a particular search software for
the first time.
• Suitable positive controls would be protein
sequences known to have distant homologues
in the databases to check how good the
software is at detecting such matches.
• Negative controls can be employed to make
sure the compositional bias of the sequence
isn't giving you false positives. Shuffle your
query sequence and see what difference this
makes to the matches that are returned. A real
match should be lost upon shuffling of your
sequence.
Tips
Tips: Blast-shuffle.py
Tips: Blast-shuffle.py
Tips: Blast-shuffle.py
Tips: Blast-shuffle.py
Tips: Blast-shuffle.py
Tips: Blast-shuffle.py
•BLAST's major advantage is its speed.
• 2-3 minutes for BLAST versus several hours for a
sensitive FastA search of the whole of GenBank.
•When both programs use their default
setting, BLAST is usually more sensitive
than FastA for detecting protein sequence
similarity.
• Since it doesn't require a perfect sequence
match in the first stage of the search.
FastA vs. Blast
Weakness of BLAST:
• The long word size it uses in the initial stage of DNA
sequence similarity searches was chosen for speed, and not
sensitivity.
• For a thorough DNA similarity search, FastA is the program of
choice, especially when run with a lowered KTup value.
• FastA is also better suited to the specialised task of detecting
genomic DNA regions using a cDNA query sequence, because
it allows the use of a gap extension penalty of 0. BLAST,
which only creates ungapped alignments, will usually detect
only the longest exon, or fail altogether.
• In general, a BLAST search using the default
parameters should be the first step in a database
similarity search strategy. In many cases, this is all
that may be required to yield all the information
needed, in a very short time.
FastA vs. Blast
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
1. Old (ungapped) BLAST
2. New BLAST (allows gaps)
3. Profile -> PSI Blast - Position Specific
Iterated
 Strategy:Multiple alignment of the hits
Calculates a position-specific score matrix
Searches with this matrix
 In many cases is much more sensitive to weak but
biologically relevant sequence similarities
 PSSM !!!
PSI-Blast
• Patterns of conservation from the alignment of
related sequences can aid the recognition of distant
similarities.
• These patterns have been variously called motifs, profiles,
position-specific score matrices, and Hidden Markov
Models.
For each position in the derived pattern, every amino acid
is assigned a score.
(1) Highly conserved residue at a position: that residue is
assigned a high positive score, and others are assigned
high negative scores.
(2) Weakly conserved positions: all residues receive scores
near zero.
(3) Position-specific scores can also be assigned to
potential insertions and deletions.
PSI-Blast
Pattern
•a set of alternative
sequences, using “regular
expressions”
•Prosite
(http://www.expasy.org/pr
osite/)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
PSSM (Position Specific Scoring Matrice)
•The power of profile methods can be
further enhanced through iteration of the
search procedure.
• After a profile is run against a database, new
similar sequences can be detected. A new
multiple alignment, which includes these
sequences, can be constructed, a new profile
abstracted, and a new database search
performed.
• The procedure can be iterated as often as
desired or until convergence, when no new
statistically significant sequences are detected.
PSI-Blast
(1) PSI-BLAST takes as an input a single protein sequence and
compares it to a protein database, using the gapped BLAST
program.
(2) The program constructs a multiple alignment, and then a profile,
from any significant local alignments found.
The original query sequence serves as a template for the multiple
alignment and profile, whose lengths are identical to that of the query.
Different numbers of sequences can be aligned in different template
positions.
(3) The profile is compared to the protein database, again seeking
local alignments using the BLAST algorithm.
(4) PSI-BLAST estimates the statistical significance of the local
alignments found.
Because profile substitution scores are constructed to a fixed scale, and
gap scores remain independent of position, the statistical theory and
parameters for gapped BLAST alignments remain applicable to profile
alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), a specified
number of times or until convergence.
PSI-Blast
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
PSI-BLAST
PSSM
PSSM
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST pitfalls
•Avoid too close sequences: overfit!
•Can include false homologous! Therefore check the
matches carefully: include or exclude sequences based
on biological knowledge.
•The E-value reflects the significance of the match to
the previous training set not to the original sequence!
•Choose carefully your query sequence.
•Try reverse experiment to certify.
• A single sequence is selected from a set of blocks and
enriched by replacing the conserved regions delineated
by the blocks by consensus residues derived from the
blocks.
• Embedding consensus residues improves performance
• S. Henikoff and J.G. Henikoff; Protein Science (1997)
6:698-705.
Reduce overfitting risk by Cobbler
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
PHI-Blast Local Blast
(Pattern-Hit Initiated BLAST)
PHI-Blast Local Blast
From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
PHI-Blast Local Blast
PHI-Blast Local Blast
PHI-Blast Local Blast
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Installing Blast Locally
• 2 flavors: NCBI/WuBlast
• Excutables:
• ftp://ftp.ncbi.nih.gov/blast/executables/
• Database:
• ftp://ftp.ncbi.nih.gov/blast/db/
• Formatdb
• formatdb -i ecoli.nt -p F
• formatdb -i ecoli.protein -p T
• For options: blastall -
• blastall -p blastp -i query -d database -o output
DataBase Searching
Dynamic Programming
Reloaded
Mapping short Read
Bowtie / BWA
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
Main database: BLAT
• BLAT: BLAST-Like Alignment Tool
• Aligns the input sequence to the Human Genome
• Connected to several databases, like:
• mRNAs - GenScan
• ESTs - TwinScan
• RepeatMasker - UniGene
• RefSeq - CpG Islands
-BLAT(compared with existing tools)
-more accurate
-500 times faster in mRNA/DNA alignment
-50 times faster in protein/protein alignment
-BLAT’s steps
1.using nonoverlapping k-mers to create index
2.using index to find homologous region
3.aligning these regions seperately
4.stiches these aligned region into larger alignment
5.revisit small internal exons possibly missed in first
stage and adjusts large gap boundaries that have
canonical
splice sites where feasible
Weblems
W5.1: Submit the amino acid sequence of papaya
papein to a BLAST (gapped and ungapped) and to a
PSI-BLAST search. What are the main difference in
results?
W5.2: Is there a relationship between Klebsiella
aerogenes urease, Pseudomonas diminuta
phosphotriesterase and mouse adenosine deaminase
? Also use DALI, ClustalW and T-coffee.
W5.3: Yeast two-hybrid typically yields DNA sequences.
How would you find the corresponding protein ?
W5.4: When and why would you use tblastn ?
W5.5: How would you search a database if you want to
restrict the search space to those entries having a
secretion signal consisting of 4 consecutive (N-
terminal) basic residues ?

Contenu connexe

Tendances

Instruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSEInstruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSEsalmancreation
 
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 80869840596838
 
Raspberry pi's gpio programming with go
Raspberry pi's gpio programming with goRaspberry pi's gpio programming with go
Raspberry pi's gpio programming with goKonstantin Shamko
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...Yuichiro Yasui
 
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchFast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
 
instruction set of 8086
instruction set of 8086instruction set of 8086
instruction set of 8086muneer.k
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenarioNaresh Bala
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designjpstudcorner
 
Instruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorInstruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorAshita Agrawal
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Yuichiro Yasui
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designIeee Xpert
 
8086 Microprocessor Instruction set
8086 Microprocessor Instruction set8086 Microprocessor Instruction set
8086 Microprocessor Instruction setVijay Kumar
 
Instruction sets of 8086
Instruction sets of 8086Instruction sets of 8086
Instruction sets of 8086Mahalakshmiv11
 

Tendances (19)

Instruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSEInstruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSE
 
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086
 
Network flow problems
Network flow problemsNetwork flow problems
Network flow problems
 
Raspberry pi's gpio programming with go
Raspberry pi's gpio programming with goRaspberry pi's gpio programming with go
Raspberry pi's gpio programming with go
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
 
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchFast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
 
8086 instruction set
8086 instruction set8086 instruction set
8086 instruction set
 
instruction set of 8086
instruction set of 8086instruction set of 8086
instruction set of 8086
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenario
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Instruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorInstruction Set of 8086 Microprocessor
Instruction Set of 8086 Microprocessor
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
 
Seq db searching
Seq db searchingSeq db searching
Seq db searching
 
Topological sort
Topological sortTopological sort
Topological sort
 
Max Flow Problem
Max Flow ProblemMax Flow Problem
Max Flow Problem
 
8086 instruction set
8086 instruction set8086 instruction set
8086 instruction set
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
8086 Microprocessor Instruction set
8086 Microprocessor Instruction set8086 Microprocessor Instruction set
8086 Microprocessor Instruction set
 
Instruction sets of 8086
Instruction sets of 8086Instruction sets of 8086
Instruction sets of 8086
 

En vedette (8)

P4 2017 io
P4 2017 ioP4 2017 io
P4 2017 io
 
Mysql all
Mysql allMysql all
Mysql all
 
P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
P2 2017 python_strings
P2 2017 python_stringsP2 2017 python_strings
P2 2017 python_strings
 
P1 3 2017_python_exercises
P1 3 2017_python_exercisesP1 3 2017_python_exercises
P1 3 2017_python_exercises
 
20170509 rand db_lesugent
20170509 rand db_lesugent20170509 rand db_lesugent
20170509 rand db_lesugent
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 

Similaire à T5 2017 database_searching_v_upload

Burrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabasesBurrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabasesSergio Shevchenko
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
Smashing the stack for fun and profit
Smashing the stack for fun and profitSmashing the stack for fun and profit
Smashing the stack for fun and profitAlexey Miasoedov
 
Speeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesSpeeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesNAVER Engineering
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptChellamuthuHaripriya
 
Analysis of T-Beam
Analysis of T-BeamAnalysis of T-Beam
Analysis of T-Beam01008828934
 
Reconsidering tracing in Ceph - Mohamad Gebai
Reconsidering tracing in Ceph - Mohamad GebaiReconsidering tracing in Ceph - Mohamad Gebai
Reconsidering tracing in Ceph - Mohamad GebaiCeph Community
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mappingsatrajit
 
Monitoring nonlinear profiles with {R}: an application to quality control
Monitoring nonlinear profiles with {R}: an application to quality controlMonitoring nonlinear profiles with {R}: an application to quality control
Monitoring nonlinear profiles with {R}: an application to quality controlEmilio L. Cano
 
Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...
Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...
Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...akaptur
 
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...Hsien-Hsin Sean Lee, Ph.D.
 
Virtual Machine for Regular Expressions
Virtual Machine for Regular ExpressionsVirtual Machine for Regular Expressions
Virtual Machine for Regular ExpressionsAlexander Yakushev
 

Similaire à T5 2017 database_searching_v_upload (20)

Burrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabasesBurrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabases
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Smashing the stack for fun and profit
Smashing the stack for fun and profitSmashing the stack for fun and profit
Smashing the stack for fun and profit
 
Speeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesSpeeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using Codes
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .ppt
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Analysis of T-Beam
Analysis of T-BeamAnalysis of T-Beam
Analysis of T-Beam
 
1406
14061406
1406
 
Reconsidering tracing in Ceph - Mohamad Gebai
Reconsidering tracing in Ceph - Mohamad GebaiReconsidering tracing in Ceph - Mohamad Gebai
Reconsidering tracing in Ceph - Mohamad Gebai
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mapping
 
Monitoring nonlinear profiles with {R}: an application to quality control
Monitoring nonlinear profiles with {R}: an application to quality controlMonitoring nonlinear profiles with {R}: an application to quality control
Monitoring nonlinear profiles with {R}: an application to quality control
 
Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...
Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...
Allison Kaptur: Bytes in the Machine: Inside the CPython interpreter, PyGotha...
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Virtual Machine for Regular Expressions
Virtual Machine for Regular ExpressionsVirtual Machine for Regular Expressions
Virtual Machine for Regular Expressions
 

Plus de Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

Plus de Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Dernier

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Dernier (20)

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

T5 2017 database_searching_v_upload

  • 1.
  • 4. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 5. Needleman-Wunsch-Complete.py The Score Matrix ---------------- Seq1(j)1 2 3 4 5 6 7 Seq2 * C K H V F C R (i) * 0 -1 -2 -3 -4 -5 -6 -7 1 C -1 1 0 -1 -2 -3 -4 -5 2 K -2 0 2 1 0 -1 -2 -3 3 K -3 -1 1 1 0 -1 -2 -3 4 C -4 -2 0 0 0 -1 0 -1 5 F -5 -3 -1 -1 -1 1 0 -1 6 C -6 -4 -2 -2 -2 0 2 1 7 K -7 -5 -3 -3 -3 -1 1 1 8 C -8 -6 -4 -4 -4 -2 0 0 9 V -9 -7 -5 -5 -3 -3 -1 -1 a bc A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH if (substr(seq1,j-1,1) eq substr(seq2,i-1,1) B: up_score = matrix(i-1,j) + GAP C: left_score = matrix(i,j-1) + GAP
  • 6. Extensions to basic dynamic programming method use gap penalties – constant gap penalty for gap > 1 – gap penalty proportional to gap size • one penalty for starting a gap (gap opening penalty) • different (lower) penalty for adding to a gap (gap extension penalty) use blosum62 • instead of MATCH and MISMATCH Dynamic Programming: Needleman-Wunsch-Complete.py
  • 10. •The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. •The principal is that multiple alignments is achieved by successive application of pairwise methods. • First do all pairwise alignments (not just one sequence with all others) • Then combine pairwise alignments to generate overall alignment Multiple Alignment Method
  • 11. • Multiple Sequence Alignment: –ClustalW, MSA • Short Read Sequence Alignment: –BWA, Bowtie • Database Search: –BLAST, FASTA, HMMER • Genomic Analysis: –BLAT Sequence Alignment Tools
  • 12. Read Length is Not As Important For Resequencing 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 8 10 12 14 16 18 20 Length of K-mer Reads (bp) %ofPairedK-merswithUniquely AssignableLocation E.COLI HUMAN Jay Shendure
  • 13. Short Read Alignment Software Bowtie: memory-‐efficientshort read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-‐bpreads per hours Burrows-‐Wheeler Aligner (BWA): an aligner that implements two algorithms: bwa-‐shortand BWA-‐ SW. The former works for query sequences shorter than 200 bp and the latter for longer sequences up to around 100 kbp.
  • 14. Mapping Reads Back • Hash Table (Lookup table) – FAST, but requires perfect matches. [O(m n + N)] • Array Scanning – Can handle mismatches, but not gaps. [O(m N)] • Dynamic Programming (Smith Waterman) – Indels – Mathematically optimal solution – Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] • Burrows-Wheeler Transform (BW Transform) – FAST. [O(m + N)] (without mismatch/gap) – Memory efficient. – But for gaps/mismatches, it lacks sensitivity
  • 15. Why Burrows-Wheeler? • BWT very compact: – Approximately ½ byte per base – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2GB of memory • Linear-time search algorithm – proportional to length of query for exact matches
  • 17. Burrows-Wheeler Transform a ba a ba $ T Sort a bba $ a a BWT(T) Last column $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Burrows-Wheeler Matrix Burrows M,Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Reversible permutation of the characters of a string, used originally for compression How is itreversible?How is it useful forcompression? How is it anindex?
  • 18. Burrows-Wheeler Transform def rotations(t): """ Return list of rotations of input string t """ tt = t * 2 return [ tt[i:i+len(t)] for i in xrange(0, len(t)) ] def bwm(t): """ Return lexicographically sorted list of t’s rotations """ return sorted(rotations(t)) def bwtViaBwm(t): """ Given T, returns BWT(T) by way of the BWM """ return ''.join(map(lambda x: x[-‐1], bwm(t))) Make list of all rotations Sort them Take last column >>> bwtViaBwm("Tomorrow_and_tomorrow_and_tomorrow$") 'w$wwdd nnoooaattTmmmrrrrrrooo ooo' >>> bwtViaBwm("It_was_the_best_of_times_it_was_the_worst_of_times$") 's$esttssfftteww_hhmmbootttt_ii woeeaaressIi ' >>> bwtViaBwm('in_the_jingle_jangle_morning_Ill_come_following_you$') 'u_gleeeengj_mlhl_nnnnt$nwj lggIolo_iiiiarfcmylo_oo_' Python example: BWT_v2.py
  • 19. Key observation – T ranking 1$acaacg1 2aacg$ac1 1acaacg$1 3acg$aca2 1caacg$a1 2cg$acaa3 1g$acaac2 a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column.
  • 20. Burrows-Wheeler Transform: LF Mapping BWM with T-ranking: $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3$ a0 F L LF Mapping: The ith occurrence of a character c in L and the ith occurrence of c in F correspond to the same occurrence in T However we rank occurrences of c, ranks appear in the same order in F and L
  • 21. Why does the LF Mapping hold ?
  • 22. Burrows-Wheeler Transform: LF Mapping BWM with B-ranking: a3 b1 a1 a2 b0 $ a3 b1 a1 a2 a2 b0 a3 $ a3 b0 a0 $ a3 b1 b1 a1 a2 b0 a0 a0 $ a3 b1 a1 a1 a2 b0 a0 $ F $ a0 a1 a2 a3 b0 b1 L a0 b0 b1 a1 $ a2 a3 Ascending rank F now has very simple structure: a $, a block of as with ascending ranks, a block of bs with ascending ranks
  • 23. Burrows-Wheeler Transform F L $ a0 a0 b0 a1 b1 a2 a1 a3 $ b0 a2 b1 a3row 6 Which BWM row begins with b1? Skip row starting with $ (1 row) Skip rows starting with a (4 rows) Skip row starting with b0 (1 row) Answer: row 6
  • 24. Burrows-Wheeler Transform Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T Which BWM row (0-based) begins with G100? (Ranks areB-ranks.) Skip row starting with $ (1 row) Skip rows starting with A (300 rows) Skip rows starting with C (400 rows) Skip first 100 rows starting with G (100 rows) Answer: row 1 + 300 + 400 + 100 = row 801
  • 25. Burrows-Wheeler Transform:reversing Reverse BWT(T) starting at right-hand-side of T and moving left L a0 b0 b1 a1 $ a2 a3 F $ a0 a1 a2 a3 b0 b1 Start in first row. F must have $. L contains character just prior to $: a0 a0: LF Mapping says this is same occurrence of a as first a in F.Jump to row beginning with a0. L contains character just prior to a0:b0. Repeat for b0, get a2 Repeat for a2, get a1 Repeat for a1, get b1 Repeat for b1, get a3 Repeat for a3, get$, done Reverse of chars we visited = a3 b1 a1 a2 b0 a0 $ = T
  • 26. Burrows-Wheeler Transform:reversing Another way to visualize reversing BWT(T): F L $ a0 a0 b0 a1 b1 a2 a1 a3 $ b0 a2 b1 a3 $ a0 a1 b1 a2 a1 a3 $ b0 a2 b1 a3 $ a0 a0 b0 a1 b1 a2 a1 a3 $ b1 a3 $ a0 a0 b0 a1 b1 a3 $ b0 a2 b1 a3 $ a0 a0 b0 a2 a1 a3 $ b0 a2 b1 a3 $ a0 a0 b0 a1 b1 a2 a1 a3 $ b0 a2 b1 a3 F L F L F L F L F L F L a0 b0 a1 b1 a2 a1 b0 a2 $ a0 a0 b0 a1 b1 a2 a1 a3 $ b0 a2 b1 a3 T: a3 b1 a1 a2 b0 a0 $
  • 27. BWT is useful forcompression: Sorts characters by right-context, making a more compressible string It’sreversible: Repeated applications of LF Mapping, recreating T from right to left FM - Index Burrows-Wheeler Transform
  • 28. Steps in using BWA Download and install BWA on Linux/Mac. Export the path or use the exact path. bunzip2 bwa-0.5.9.tar.bz2 tar xvf bwa-0.5.9.tar cd bwa-0.5.9 | make make Download the reference genome using wget.
  • 29. Create the index for the reference genome (assuming the reference sequences are in wg.fa). Only needs to be performed once for each genome. Use –a for small genomes. • Mapping short reads to the reference genome. • 1. Align sequences using mul0ple threads (eg 4 CPUs). Assume the short reads are in the s_3_sequence.txt.gz file. • bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa bwa index -p hg19bwaidx -a bwtsw wg.fa
  • 30. 2. Create alignment in the SAM format (a generic format for storing large nucleo0de sequence alignments): bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Mapping long reads can be done using the bwasw command: bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam
  • 31. Sequence Alignment/Map Format Sequence Reads + Reference Sequence Alignment Software SAM File Resequencing RNA Seq SNPs Reads: Illumina reads. Reference: whole genome, contig, chromosome. BWA, Bowtie Most of the analysis happens when considering the SAM files.
  • 32. SAM format “A tab-‐delimitedtext format consisting of a header section, which is optional, and an alignment section” https://samtools.github.io/hts-specs/SAMv1.pdf
  • 33. Example of CIGAR RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T In the SAM file you will have the following fields: • POS:5 • CIGAR: 3M1I3M1D5M The POS indicates that the read aligns starting at position 5 on the reference. The CIGAR says that the first 3 bases in the read sequence align with the reference. The next base in the read does not exist in the reference. Then 3 bases align with the reference. The next reference base does not exist in the read sequence, then 5 more bases align with the reference. Note that at position 14, the base in the read is different than the reference, but it still counts as an M since it aligns to that position.
  • 34. Harvesting Information from SAM • Query name, QNAME (SAM)/read_name (BAM). • FLAG provides the following informa0on: – are there multiple fragments? – are all fragments properly aligned? – is this fragment unmapped? – is the next fragment unmapped? – is this query the reverse strand? – is the next fragment the reverse strand? – is this the last fragment? – is this a secondary alignment? – did this read fail quality controls? – is this read a PCR or optical duplicate?
  • 35. BAM • BAM is a compressed version of the SAM file format. • BAM is compressed in the BGZF format. All multi-byte numbers in BAM are little-endian, regardless of the machine endianness. As an example, suppose we have the hexadecimal number 12345678. • There are multiple programs that convert BAM files to SAM files and vice versa (eg samtools)
  • 36. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 37. • Consider the task of searching SWISSPROT against a query sequence: • say our query sequence is 362 amino acids long • SWISSPROT release 38 contains 29,085,265 amino acids • finding local alignments via dynamic programming would entail O(1010) matrix operations • Given size of databases, more efficient methods needed Database Searching
  • 38. FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith- Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith- Waterman Almost as sensitive as FASTA Heuristic approaches to DP for database searching
  • 39. « Hit and extend heuristic» • Problem: Too many calculations “wasted” by comparing regions that have nothing in common • Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical • Basic method: Look for similar regions only near short stretches that match exactly FASTA
  • 40. FASTA-Stages 1. Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences) 2. Score and select top 10 scoring “local diagonals” 3. Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. 4. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score 5. After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
  • 41.
  • 42.
  • 43. • Sensitivity: the ability of a program to identify weak but biologically significant sequence similarity. • Selectivity: the ability of a program to discriminate between true matches and matches occurring by chance alone. • A decrease in selectivity results in more false positives being reported. FastA
  • 44. FastA (http://www.ebi.ac.uk/fasta33/) Blosum50 default. Lower PAM higher blosum to detect close sequences Higher PAM and lower blosum to detect distant sequences Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively The larger the word-length the less sensitive, but faster the search will be Max number of scores and alignments is 100
  • 45. FastA Output Database code hyperlinked to the SRS database at EBI Accession number Description Length Initn, init1, opt, z- score calculated during run E score - expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database. E() does not represent the % similarity
  • 46. Query: DNA Protein Database:DNA Protein FastA is a family of programs FastA, TFastA, FastX, FastY
  • 47. FASTA can miss significant similarity since • For proteins, similar sequences do not have to share identical residues •Asp-Lys-Valis quite similar to •Glu-Arg-Ileyet it is missed even with ktuple size of 1 since no amino acid matches •Gly-Asp-Gly-Lys-Glyis quite similar to Gly-Glu- Gly-Arg-Glybut there is no match with ktuple size of 2 FASTA problems
  • 48. FASTA can miss significant similarity since • For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not •GGuUCuACgAAgand GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with ktuple size of 3 or higher FASTA problems
  • 49. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 50. BLAST - Basic Local Alignment Search Tool
  • 51. What does BLAST do? • Search a large target set of sequences... • …for hits to a query sequence... • …and return the alignments and scores from those hits... • Do it fast. Show me those sequences that deserve a second look. Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.
  • 52. The big red button Do My Job It is dangerous to hide too much of the underlying complexity from the scientists.
  • 53. • Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T • Key concept “Neigborhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly • Calculate neigborhood (T) for substrings of query (size W) Overview
  • 54. Compile a list of words which give a score above T when paired with the query sequence. • Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E = +3 +9 +5 +5 = 22 • try all possibilities: A A A A = +3 -3 0 0 = 0 no good A A A C = +3 -3 0 -7 = -7 no good • ...too slow, try directed change Overview
  • 55. A C D E A C D E = +3 +9 +5 +5 = 22 • change 1st pos. to all acceptable substitutions g C D E = +1 +9 +5 +5 = 20 ok n C D E = +0 +9 +5 +5 = 19 ok I C D E = -1 +9 +5 +5 = 18 ok k C D E = -2 +9 +5 +5 = 17 ok • change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 • change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok • continue - use recursion • For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence Overview
  • 64. BLOSUM62 RGD 11 RGD 17 KGD 14 QGD 13 RGE 13 EGD 12 HGD 12 NGD 12 RGN 12 AGD 11 MGD 11 RAD 11 RGQ 11 RGS 11 RND 11 RSD 11 SGD 11 TGD 11 PAM200 RGD 13 RGD 18 RGE 17 RGN 16 KGD 15 RGQ 15 KGE 14 HGD 13 KGN 13 RAD 13 RGA 13 RGG 13 RGH 13 RGK 13 RGS 13 RGT 13 RSD 13 WGD 13
  • 65.
  • 66. S Length of extension Score Trim to max indexed * *Two non-overlapping HSP’s on a diagonal within distance A
  • 67. S Length of extension Score Trim to max indexed * *Two non-overlapping HSP’s on a diagonal within distance A
  • 68. The BLAST algorithm • Break the search sequence into words • W = 3 for proteins, W = 12 for DNA • Include in the search all words that score above a certain value (T) for any search word MCGPFILGTYC MCG CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCG CGP MCT MGP … MCN CTP … … This list can be computed in linear time
  • 69. The Blast Algorithm (2) • Search for the words in the database • Word locations can be precomputed and indexed • Searching for a short string in a long string • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, S
  • 70.
  • 71. True positives False positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non-homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives
  • 72. BLAST parameters • Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. • Choosing a value for w • small w: many matches to expand • big w: many words to be generated • w=4 is a good compromise • Lowering the segment extension cutoff (S) returns longer extensions for each hit. • Changing the minimum E-value changes the threshold for reporting a hit.
  • 73. Critical parameters: T,W and scoring matrix •The proper value of T depends ons both the values in the scoring matrix and balance between speed and sensitivity •Higher values of T progressively remove more word hits and reduce the search space. •Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes incraese sensitivity and decrease speed. •The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast
  • 74. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 75. Database Searching • How can we find a particular short sequence in a database of sequences (or one HUGE sequence)? • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. • Databases always return some kind of hit, how much attention should be paid to the result? • How can we determine how “unusual” a particular alignment score is?
  • 76. Sentence 1: “These algorithms are trying to find the best way to match up two sequences” Sentence 2: “This does not mean that they will find anything profound” ALIGNMENT: THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES :: :.. . .. ...: : ::::.. :: . : ... THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------ 12 exact matches 14 conservative substitutions Is this a good alignment? Significance
  • 77. • A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T • This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user- specified probability Overview
  • 78. Mathematical Basis of BLAST •Model matches as a sequence of coin tosses •Let p be the probability of a “head” • For a “fair” coin, p = 0.5 •(Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). •Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 •Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
  • 79. Mathematical Basis of BLAST •To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. •For DNA, the probability of a “head” is 1/4 • What is it for amino acid sequences? AATCAT ATTCAG HTHHHT
  • 80. Mathematical Basis of BLAST • So, for one particular alignment, the Erdös-Rényi property can be applied • What about for all possible alignments? • Consider that sequences are being shifted back and forth, dot matrix plot • The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.
  • 82. Karlin-Alschul Statistics E=kmn-λS This equation states that the number of alignments expected by chance (E) during the sequence database search is a function of the size of the search space (m*n), the normalized score (λS) and a minor constant (k mostly 0.1) E-Value grows linearly with the product of target and query sizes. Doubling target set size and doubling query length have the same effect on e- value
  • 84. Scoring alignments •Score: S (~R) •S=SM(qi,ti) - Sgaps •Any alignment has a score •Any two sequences have a(t least one) optimal alignment
  • 85. • For a particular scoring matrix and its associated gap initiation and extention costs one must calculate λand k • Unfortunately (for gapped alignments), you can’t do this analytically and the values must be estimated empirically • The procedure involves aligning random sequences (Monte Carlo approach) with a specific scoring scheme and observing the alignment properties (scores, target frequencies and lengths)
  • 86.
  • 87. “Monte Carlo” Approach: •Compares result to randomized result, similarly to results generated by a roulette wheel at Monte Carlo •Typical procedure for alignments • Randomize sequence A • Align to sequence B • Repeat many times (hundreds) • Keep track op optimal score • Histogram of scores … Significance
  • 88. Assessing significance requires a distribution •I have an pumpkin of diameter 1m. Is that unusual? Diameter (m) Frequency
  • 89.
  • 90.
  • 91. • In seeking optimal Alignments between two sequences, one desires those that have the highest score - i.e. one is seeking a distribution of maxima • In seeking optimal Matches between an Input Sequence and Sequence Entries in a Database, one again desires the matches that have the highest score, and these are obtained via examination of the distribution of such scores for the entries in the database - this is again a distribution of maxima. “A Normal Distribution is a distribution of Sums of independent variables rather than a sum of their Maxima.“ Normal Distribution does NOT Fit Alignment Scores !! Significance
  • 92. Comparing distributions                x e x eexf 1    2 2 2 2 1      x exf Extreme Value:Gaussian:
  • 93. P(xS) = 1-exp(-kmne-S) m, n: sequence lengths. k, : free parameters. This can be shown analytically for ungapped alignments and has been found empirically to also hold for gapped alignments under commonly used conditions. Alignment of unrelated/random sequences result in scores following an extreme value distribution Alignment scores follow extreme value distributions E x P = 1 –e-E E=-ln(1-P)
  • 94. Alignment algorithms will always produce alignments, regardless of whether it is meaningful or not => important to have way of selecting significant alignments from large set of database hits. Solution: fit distribution of scores from database search to extreme value distribution; determine p-value of hit from this fitted distribution. Example: scores fitted to extreme value distribution. 99.9% of this distribution is located below score=112 => hit with score = 112 has a p-value of 0.1% Alignment scores follow extreme value distributions
  • 95. BLAST uses precomputed extreme value distributions to calculate E- values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5). Significance
  • 96. • The distribution of scores graph of frequency of observed scores • expected curve (asterisks) according to the extreme value distribution • the theoretic curve should be similar to the observed results • deviations indicate that the fitting parameters are wrong • too weak gap penalties • compositional biases FastA Output
  • 97. < 20 222 0 :* 22 30 0 :* 24 18 1 :* 26 18 15 :* 28 46 159 :* 30 207 963 :* 32 1016 3724 := * 34 4596 10099 :==== * 36 9835 20741 :========= * 38 23408 34278 :==================== * 40 41534 47814 :=================================== * 42 53471 58447 :============================================ * 44 73080 64473 :====================================================*======= 46 70283 65667 :=====================================================*==== 48 64918 62869 :===================================================*== 50 65930 57368 :===============================================*======= 52 47425 50436 :======================================= * 54 36788 43081 :=============================== * 56 33156 35986 :============================ * 58 26422 29544 :====================== * 60 21578 23932 :================== * 62 19321 19187 :===============* 64 15988 15259 :============*= 66 14293 12060 :=========*== 68 11679 9486 :=======*== 70 10135 7434 :======*== FastA Output
  • 98. 72 8957 5809 :====*=== 74 7728 4529 :===*=== 76 6176 3525 :==*=== 78 5363 2740 :==*== 80 4434 2128 :=*== 82 3823 1628 :=*== 84 3231 1289 :=*= 86 2474 998 :*== 88 2197 772 :*= 90 1716 597 :*= 92 1430 462 :*= :===============*======================== 94 1250 358 :*= :============*=========================== 96 954 277 :* :=========*======================= 98 756 214 :* :=======*=================== 100 678 166 :* :=====*================== 102 580 128 :* :====*=============== 104 476 99 :* :===*============= 106 367 77 :* :==*========== 108 309 59 :* :==*======== 110 287 46 :* :=*======== 112 206 36 :* :=*====== 114 161 28 :* :*===== 116 144 21 :* :*==== 118 127 16 :* :*==== >120 886 13 :* :*============================== Related FastA Output
  • 100. • A summary of the statistics and of the program parameters follows the histogram. • An important number in this summary is the Kolmogorov-Smirnov statistic, which indicates how well the actual data fit the theoretical statistical distribution. The lower this value, the better the fit, and the more reliable the statistical estimates. • In general, a Kolmogorov-Smirnov statistic under 0.1 indicates a good fit with the theoretical model. If the statistic is higher than 0.2, the statistics may not be valid, and it is recommended to repeat the search, using more stringent (more negative) values for the gap penalty parameters. FastA Output
  • 101. Statistics summary • Optimal local alignment scores for pairs of random amino acid sequences of the same length follow and extreme-value distribution. For any score S, the probability of observing a score >= S is given by the Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(- lambda.S)) • k en Lambda are parameters related to the position of the maximum and the with of the distribution, • Note the long tail at the right. This means that a score serveral standard deviations above the mean has higher probability of arising by chance (that is, it is less significant) than if the scores followed a normal distribution.
  • 102. P-values • Many programs report P = the probability that the alignment is no better than random. The relationship between Z and P depends on the distribution of the scores from the control population, which do NOT follow the normal distributions • P<=10E-100 (exact match) • P in range 10E-100 10E-50 (sequences nearly identical eg. Alleles or SNPs • P in range 10E-50 10E-10 (closely related sequenes, homology certain) • P in range 10-5 10E-1 (usually distant relatives) • P > 10-1 (match probably insignificant)
  • 103.
  • 104. E • For database searches, most programs report E-values. The E-value of an alignemt is the expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. E is found by multiplying the value of P by the size of the database probed. Note that E but not P depends on the size of the database. Values of P are between 0 and 1. Values of E are between 0 and the number of sequences in the database searched: • E<=0.02 sequences probably homologous • E between 0.02 and 1 homology cannot be ruled out • E>1 you would have to expect this good a match by just chance
  • 105.
  • 106.
  • 107. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 108. BLAST is actually a family of programs: • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database. Blast
  • 109. Blast
  • 110. Blast
  • 111. Blast
  • 112. Blast
  • 113. Blast
  • 114. Blast
  • 115. Blast
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124. • Be aware of what options you have selected when using BLAST, or FASTA implementations. • Treat BLAST searches as scientific experiments • So you should try your searches with the filters on and off to see whether it makes any difference to the output Tips
  • 125. Tips: Low-complexity and Gapped Blast Algorithm • The common, Web-based ones often have default settings that will affect the outcome of your searches. By default all NCBI BLAST implementations filter out biased sequence composition from your query sequence (e.g. signal peptide and transmembrane sequences - beware!). • The SEG program has been implemented as part of the blast routine in order to mask low- complexity regions • Low-complexity regions are denoted by strings of Xs in the query sequence
  • 126. •The sequence databases contain a wealth of information. They also contain a lot of errors. Contaminants … •Annotation errors, frameshifts that may result in erroneous conceptual translations. •Hypothetical proteins ? •In the words of Fox Mulder, "Trust no one." Tips
  • 127. • Once you get a match to things in the databases, check whether the match is to the entire protein, or to a domain. Don't immediately assume that a match means that your protein carries out the same function (see above). Compare your protein and the match protein(s) along their entire lengths before making this assumption. Tips
  • 128. • Domain matches can also cause problems by hiding other informative matches. For instance if your protein contains a common domain you'll get significant matches to every homologous sequence in the database. BLAST only reports back a limited number of matches, ordered by P value. • If this list consists only of matches to the same domain, cut this bit out of your query sequence and do the BLAST search again with the edited sequence (e.g. NHR). Tips
  • 129. • Do controls wherever possible. In particular when you use a particular search software for the first time. • Suitable positive controls would be protein sequences known to have distant homologues in the databases to check how good the software is at detecting such matches. • Negative controls can be employed to make sure the compositional bias of the sequence isn't giving you false positives. Shuffle your query sequence and see what difference this makes to the matches that are returned. A real match should be lost upon shuffling of your sequence. Tips
  • 136. •BLAST's major advantage is its speed. • 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank. •When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. • Since it doesn't require a perfect sequence match in the first stage of the search. FastA vs. Blast
  • 137. Weakness of BLAST: • The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity. • For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value. • FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether. • In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time. FastA vs. Blast
  • 138. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 139. 1. Old (ungapped) BLAST 2. New BLAST (allows gaps) 3. Profile -> PSI Blast - Position Specific Iterated  Strategy:Multiple alignment of the hits Calculates a position-specific score matrix Searches with this matrix  In many cases is much more sensitive to weak but biologically relevant sequence similarities  PSSM !!! PSI-Blast
  • 140. • Patterns of conservation from the alignment of related sequences can aid the recognition of distant similarities. • These patterns have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models. For each position in the derived pattern, every amino acid is assigned a score. (1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores. (2) Weakly conserved positions: all residues receive scores near zero. (3) Position-specific scores can also be assigned to potential insertions and deletions. PSI-Blast
  • 141. Pattern •a set of alternative sequences, using “regular expressions” •Prosite (http://www.expasy.org/pr osite/)
  • 142. PSSM (Position Specific Scoring Matrice)
  • 143. PSSM (Position Specific Scoring Matrice)
  • 144. PSSM (Position Specific Scoring Matrice)
  • 145. •The power of profile methods can be further enhanced through iteration of the search procedure. • After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. • The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected. PSI-Blast
  • 146. (1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program. (2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions. (3) The profile is compared to the protein database, again seeking local alignments using the BLAST algorithm. (4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. (5) Finally, PSI-BLAST iterates, by returning to step (2), a specified number of times or until convergence. PSI-Blast
  • 152. PSI-BLAST pitfalls •Avoid too close sequences: overfit! •Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. •The E-value reflects the significance of the match to the previous training set not to the original sequence! •Choose carefully your query sequence. •Try reverse experiment to certify.
  • 153. • A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks. • Embedding consensus residues improves performance • S. Henikoff and J.G. Henikoff; Protein Science (1997) 6:698-705. Reduce overfitting risk by Cobbler
  • 154. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 156. PHI-Blast Local Blast From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
  • 160. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 161. Installing Blast Locally • 2 flavors: NCBI/WuBlast • Excutables: • ftp://ftp.ncbi.nih.gov/blast/executables/ • Database: • ftp://ftp.ncbi.nih.gov/blast/db/ • Formatdb • formatdb -i ecoli.nt -p F • formatdb -i ecoli.protein -p T • For options: blastall - • blastall -p blastp -i query -d database -o output
  • 162. DataBase Searching Dynamic Programming Reloaded Mapping short Read Bowtie / BWA Database Searching Fasta Blast Statistics Practical Guide Extentions PSI-Blast PHI-Blast Local Blast BLAT
  • 163. Main database: BLAT • BLAT: BLAST-Like Alignment Tool • Aligns the input sequence to the Human Genome • Connected to several databases, like: • mRNAs - GenScan • ESTs - TwinScan • RepeatMasker - UniGene • RefSeq - CpG Islands
  • 164. -BLAT(compared with existing tools) -more accurate -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using nonoverlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions seperately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible
  • 165.
  • 166. Weblems W5.1: Submit the amino acid sequence of papaya papein to a BLAST (gapped and ungapped) and to a PSI-BLAST search. What are the main difference in results? W5.2: Is there a relationship between Klebsiella aerogenes urease, Pseudomonas diminuta phosphotriesterase and mouse adenosine deaminase ? Also use DALI, ClustalW and T-coffee. W5.3: Yeast two-hybrid typically yields DNA sequences. How would you find the corresponding protein ? W5.4: When and why would you use tblastn ? W5.5: How would you search a database if you want to restrict the search space to those entries having a secretion signal consisting of 4 consecutive (N- terminal) basic residues ?