Dr Avril Coghlan discusses the BLAST algorithm for comparing biological sequences and searching databases of DNA and protein sequences. BLAST is a fast heuristic method for sequence alignment and database searching. It works by first finding short words that are common between the query sequence and database sequences, and then extending the alignment around these words. BLAST is able to quickly search very large databases and find significant matches by calculating E-values, which estimate the statistical significance of matches. BLAST allows researchers to determine if a new sequence is similar to any known sequences and predict potential functions.
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
BLAST
1. BLAST
Dr Avril Coghlan
alc@sanger.ac.uk
Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
2. • Sequence alignment has many uses
Sequence assembly – genome sequences are assembled by using
sequence alignment methods to find overlaps between many short
pieces of DNA
Gene finding – alignment of whole genome sequences from two or more
species can aid in discovery of previously unknown genes
Sequence divergence – the amount of sequence similarity between
sequences (which can be calculated from a sequence alignment) tells us
how closely they are related
Database searching – we use fast sequence alignment methods (eg.
BLAST) to determine whether a protein/DNA sequence is similar to any
known sequence
Prediction of function – if we know the function of a sequence, we can
predict the function of similar sequences identified by database searching
(eg. for fruitfly eyeless gene)
3. BLAST
• The number of DNA and protein sequences in public
databases is very large
NCBI Protein database has ~38,500,000 protein sequences
• Searching a database involves aligning the query sequence to each
sequence in the database, to find significant local alignments
eg. predicted
protein from a
Database sequences B
candidate gene TARQDEFGGA
(ORF) Align A to VIVADAVIS Database
IRYDDEQAKM
Query sequence A each B KQIRALQPSTQRE
GHQIALMPLKMVQRR
VIVALASVEGAS ASTILHGGQWLC
etc. etc.
4. BLAST
• Needleman-Wunsch & Smith-Waterman are too slow
for searching databases
• Fast ‘heuristic’ methods are used eg. BLAST
N.B. ‘heuristic’ means they’re not guaranteed to find the best solution
(best alignment here), but they work okay
• BLAST was developed by Stephen Altschul &
colleagues at NCBI in 1990
NCBI = National Center for Biotechnology Information (USA)
BLAST = ‘Basic Local Alignment Search Tool’
• The most used bioinformatics program
Altschul’s 1997 paper on BLAST has been cited >26,000 times!
5. There are two main steps in BLAST
1 It makes a list of words of length k (eg. k = 3 amino
acids) in the query sequence
It then looks for database sequences that share these words
Database sequences that share many words with the query are used for
the final alignments (step 2 )
Query sequence ADSKLWLLFKSLMNDKPFKKADFF
3-bp words ADS
DSK
SKL
...
Database sequence 1 HIRTHIQLEQEWDSALIAAIQLE Doesn’t
share
words
Database sequence 2
etc. PDADSTESKLAKAIQLFVCTTILCYT Shares
ADS SKL words
6. 2 For a database sequence that shares many words
with the query, it makes an alignment
A local alignment of the query & the database sequence
The alignment contains the initial region with shared words
However, the alignment may extend beyond that initial region
• BLAST finds islands of similarity between sequences
Given two sequences A and B, BLAST makes local alignments of pairs of
subsequences of A and B
A
alignment 1 alignment 2 alignment 3
B
• BLAST reports local alignments between the query
sequence A and a database sequence B
7. • You can use BLAST to search many sequence
databases (eg. NCBI or UniProt) via websites
• Compares a DNA/protein query sequence to a
sequence database and calculates the statistical
significance (P-value) of matches
• Website for searching GenBank and other NCBI
sequence databases:
http://www.ncbi.nlm.nih.gov/BLAST
Can be used to search the NCBI Nucleotide database (DNA
sequences), as well as the NCBI Protein database
• There are 4 different types of BLAST search:
BLASTP: searches a protein database with a protein query
BLASTN: searches DNA/RNA database with DNA/RNA query
BLASTX: searches a protein database with DNA/RNA query
TBLASTN: searches DNA/RNA database with protein query
8. FASTA format
• Many programs for sequence analysis/alignment (eg.
CLUSTAL) expect the input sequences to be in FASTA
format
Each sequence is preceded by a header line that starts with “>”
followed by the sequence identifier
>fruitfly
MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGR
PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLA
AQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLG
TRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENS
NGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDS
PNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPR
LNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVL
SAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSP
WV
>human
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC
TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA
LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT
MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
>mouse
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC
TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA
LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT
MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
9. • You can use BLAST to search many sequence
databases (eg. NCBI or UniProt) via websites
eg., we can use the fruitfly Eyeless protein sequence as a BLAST query
sequence to search the UniProt database:
MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGV
NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQE
NVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYE
KLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPP
Fruitfly Eyeless (898 amino acids long)
NDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLA
GKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGID
SSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSF
NHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAAS
SASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV
We go to www.uniprot.org and click on ‘Blast’ at the top:
10. • You will get a list of BLAST hits (database sequences
with good alignments to your query, ie. to fruitfly
Eyeless here):
11. • Each BLAST hit may have several local alignments to
the query sequence
eg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and
several local alignments are reported for this pair:
12. • BLAST assesses the statistical significance of high-
scoring databases matches
• For each alignment between the query and a
database protein, it calculates an E-value
• E-value: the number of database matches of a
certain alignment score expected by chance, in a
database of the size searched
• The lower the E-value, the more significant the
alignment score for the sequence match
E=1 means that we expect 1 match of that alignment score just by
chance, in a database of the size searched
E=10-5 means that we expect to see 10-5 matches of that alignment score
just by chance, in a database of that size
13. • Significant BLAST hits are possibly homologues
• We use the E-value to judge if the database
sequence is a homologue of the query
If E ≤ 10-5, we are confident that the hit is a homologue
If E is 10-5―10, we are not sure if the hit is a homologue
If E is > 10, we are doubtful that the hit is a homologue
eg. searching UniProt using fruitfly Eyeless as our query:
14. eg. searching the NCBI Protein Database using fruitfly Eyeless as our
query:
............
BLAST matches with high E-values
may not be homologues (although it
is often hard to tell if they are or not!)
15. Problem
• Here’s the output of a BLAST search using the
predicted protein for a gene prediction from
Staphylococcus aureus:
(i) What does an E value of 189 mean?
(ii) Based on the BLAST output, do you think the gene prediction is
likely to correspond to a real gene? If so, can you suggest the
biological function of that gene?
16. Answer
• Here’s the output of a BLAST search using the predicted protein for a
gene prediction from Staphylococcus aureus:
(i) What does an E value of 189 mean? An E-value of 189 means that we
expect to see 189 BLAST hits with an alignment score as high as the top
BLAST hit (ie. 28.9) by chance, when we search a database of the size
searched
(ii) Based on the BLAST output, do you think the gene prediction is likely
to correspond to a real gene? If so, can you suggest the biological function
of that gene? An E-value of 189 is high, so we can’t be confident the top
BLAST hit is a homologue of our query. We shouldn’t predict the
function of our query sequence based on such a weak BLAST hit
17. Further Reading
• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
• Chapter 6 in Deonier et al Computational Genome Analysis
Notes de l'éditeur
The figure of 28,000,000 protein sequences is from searching NCBI Protein for 1:10000000000000000000000[SLEN] on 18-Feb-2011. Got 38535878 matching protein sequences. Image credit (filing cabinet): http://etc.usf.edu/clipart/13000/13089/file_cabinet_13089_lg.gif