2. News & Views reminder (20% of your course grade, due
March 26, Reviewed April 2 (5/20), Revisions April 28
(15/20))
Meredith et al. (2014) Evidence for a single loss of
mineralized teeth in the common avian ancestor. Science
Nunez et al. (2015) Integrase-mediated spacer acquisition
during CRISPR-Cas adaptive immunity. Nature
Paul Gardner Homology Search
3. Homology search
In a huge collection of biological
sequences how can you locate
similar sequences?
by using heuristic, super fast,
sequence alignment methods
Paul Gardner Homology Search
5. BLAST
Identify all ’hits’ of at least W long
Find any hits on the same diagonal of an alignment matrix
Trigger a full alignment in that region
Basic idea: identify near-identical sub-sequences first → align any
hits in full
Paul Gardner Homology Search
6. What does that E-value (Expect) mean?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
Paul Gardner Homology Search
7. How can we evaluate the significance of a score?
Note that a bit-score of 57.2 by itself is not that useful.
It depends on the sequence & database size & composition.
To counter this we can compute an Expect-value (E-value).
This is the expected number of hits with the observed score for
the given query and database sizes.
P-values can also be used
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
Separating true from false hits
score (bits)
Num.matches
Random sequences/Negative controls
True homologs/Positive controls
Threshold
False negatives
True positives
False positives
True negatives
Paul Gardner Homology Search
8. How can we evaluate the significance of a score?
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
Separating true from false hits
score (bits)
Num.matches
Random sequences/Negative controls
True homologs/Positive controls
Threshold
False negatives
True positives
False positives
True negatives
E = κMN2−λx
E: E-value
M&N: query &
database size
κ&λ: fitting
parameters
Paul Gardner Homology Search
9. BLAST is not the only, or best tool for the job!
Paul Gardner Homology Search
10. Profile-based homology search
Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol
Biol.
Image provided by Eric Nawrocki.
Paul Gardner Homology Search
12. Profile HMM are slightly more complicated
A tree-weighting scheme takes care of unbalanced
alignments
Dirichlet-mixture priors are used to incorporate information
about amino-acid biochemistry
Effective sequence number is used to down-weight priors
when many sequences are available
Transition probabilities to Insert & Delete states are estimated
from the alignment
Paul Gardner Homology Search
13. Why not just use BLAST?
ACCURACY!
Every benchmark of homology search tools has shown that
profile methods are more accurate than single-sequence
methods.
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Paul Gardner Homology Search
14. Why not just use BLAST?
SPEED! To search a single query vs a database of all proteins:
BLAST: searches 42 million UniProt sequences
HMMER: searches 15,000 Pfam profiles
The search space is ∼ 3, 000x smaller for profiles
Save Planet Earth, use HMMER3
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Paul Gardner Homology Search
15. Pfam
What is a Pfam-A Entry?
hmmsearch
hmmbuild
hmmalign
SEED
HMM
OUTOUT
ALIGNDESC
Slide borrowed from Rob Finn.
Paul Gardner Homology Search
16. But, what about RNA?
5’
3’
0
Sequence conservation
1
A
G
U
K G
C
U
C
A
U
U
CA
C
C
K
W
Y U
U
A
U
G
W
YR
G
YCC
C
g
C
Y
V
U
U
H R G C G
G
A
A
K
A
Y
G
YG
C
U
W
C
A
U
A
A R
M
Y
A
Y
C
G
A
A
U
G
AY
G
C M
H
A
A
G
M
M
WG
G
U
G
C
C
U R
Y
C
G
U
C
C A M
C
U
W
A
a
C
Y
G
A
U
A
W Y
R
K
G
U
G
MRU
R
C
R
C
W
U
U
A
U
C
AA
V
C
A
Y
C
G
G
R
C
GA
M
A
C
G
UY
G
A G
U
K
A
G
G
C
A
C
CGC
C
U
W
5’
3’
0
Sequence conservation
1
A
A
Y
A
A
A
A
U
A
A
U
U
U
A
C
AUUCCA AG
G
A
C
C
G
G
UA
U
U
A
U
U
GU A
G
G
G
G
A
U
U
U
GU
G
AC
U
U
Y C
A
A
G
G
C
A
A
Y
G
U
C
C
U
C
U
C
U
A
C
AA
C
C
G
A
G
U
U
C R
A
G
A
A
U
A
A
R
Y
A
C
M
A
A
YG
G
C
U
C U
U
U
U
U
G
UU
A
U
U
C
G
A
A
A
G C
U
U
A
C
A
A
G
DU
V
Y
R
G
Y
R
U
M
U
U
C
U
R
U
A
U
R
C
U
C
W
C
Y
Uc
a
M
U
Y
A C
U
U
U
C
M
A
G
U
AC
U
U
C
A
C
A
C G
G
G
C
CWRACAK
M
U
5’ 3’
0
Sequence conservation
1
U
V
D
WHAUGA
U
G
A
G
Y
U
C
M
A
C
U
U
C
W
U
u
G
G
U
C
C
G
U
G U U U C U G A g a R
M
C
Y
M
R
U
G
A
U
M
U
B
W
R
U
G
a
S
A
A
a
G
U
UCUGAY
U
H
M
Paul Gardner Homology Search
18. Benchmark
Freyhult, Bollback & Gardner (2007) Exploring genomic dark matter: A critical assessment of the performance of
homology search methods on noncoding RNA. Genome Research.
Paul Gardner Homology Search
20. Relevant reading
Reviews:
Eddy SR (2004) What is a hidden Markov model? Nature
Biotechnology.
Methods:
Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs. Nucleic
acids research.
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Paul Gardner Homology Search