SlideShare une entreprise Scribd logo
1  sur  17
BLAST

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
• Sequence alignment has many uses
  Sequence assembly – genome sequences are assembled by using
  sequence alignment methods to find overlaps between many short
  pieces of DNA
  Gene finding – alignment of whole genome sequences from two or more
  species can aid in discovery of previously unknown genes
   Sequence divergence – the amount of sequence similarity between
  sequences (which can be calculated from a sequence alignment) tells us
  how closely they are related
  Database searching – we use fast sequence alignment methods (eg.
  BLAST) to determine whether a protein/DNA sequence is similar to any
  known sequence
  Prediction of function – if we know the function of a sequence, we can
  predict the function of similar sequences identified by database searching
  (eg. for fruitfly eyeless gene)
BLAST
  • The number of DNA and protein sequences in public
    databases is very large
      NCBI Protein database has ~38,500,000 protein sequences
  •   Searching a database involves aligning the query sequence to each
      sequence in the database, to find significant local alignments


eg. predicted
protein from a
                                 Database sequences B
candidate gene                          TARQDEFGGA
(ORF)            Align A to            VIVADAVIS                   Database
                                       IRYDDEQAKM
Query sequence A    each B             KQIRALQPSTQRE
                                       GHQIALMPLKMVQRR
 VIVALASVEGAS                          ASTILHGGQWLC
                                          etc. etc.
BLAST
• Needleman-Wunsch & Smith-Waterman are too slow
  for searching databases
• Fast ‘heuristic’ methods are used eg. BLAST
  N.B. ‘heuristic’ means they’re not guaranteed to find the best solution
        (best alignment here), but they work okay
• BLAST was developed by Stephen Altschul &
  colleagues at NCBI in 1990
  NCBI = National Center for Biotechnology Information (USA)
  BLAST = ‘Basic Local Alignment Search Tool’
• The most used bioinformatics program
  Altschul’s 1997 paper on BLAST has been cited >26,000 times!
There are two main steps in BLAST
1 It makes a list of words of length k (eg. k = 3 amino
  acids) in the query sequence
  It then looks for database sequences that share these words
  Database sequences that share many words with the query are used for
  the final alignments (step 2 )


       Query sequence         ADSKLWLLFKSLMNDKPFKKADFF
            3-bp words        ADS
                               DSK
                                SKL
                                  ...
  Database sequence 1         HIRTHIQLEQEWDSALIAAIQLE               Doesn’t
                                                                    share
                                                                    words
  Database sequence 2
      etc.                    PDADSTESKLAKAIQLFVCTTILCYT Shares
                                ADS SKL                  words
2 For a database sequence that shares many words
  with the query, it makes an alignment
  A local alignment of the query & the database sequence
  The alignment contains the initial region with shared words
  However, the alignment may extend beyond that initial region
• BLAST finds islands of similarity between sequences
  Given two sequences A and B, BLAST makes local alignments of pairs of
       subsequences of A and B

   A
           alignment 1          alignment 2      alignment 3
       B
• BLAST reports local alignments between the query
  sequence A and a database sequence B
• You can use BLAST to search many sequence
  databases (eg. NCBI or UniProt) via websites
• Compares a DNA/protein query sequence to a
  sequence database and calculates the statistical
  significance (P-value) of matches
• Website for searching GenBank and other NCBI
  sequence databases:
  http://www.ncbi.nlm.nih.gov/BLAST
  Can be used to search the NCBI Nucleotide database (DNA
  sequences), as well as the NCBI Protein database
• There are 4 different types of BLAST search:
  BLASTP: searches a protein database with a protein query
  BLASTN: searches DNA/RNA database with DNA/RNA query
  BLASTX: searches a protein database with DNA/RNA query
  TBLASTN: searches DNA/RNA database with protein query
FASTA format
• Many programs for sequence analysis/alignment (eg.
  CLUSTAL) expect the input sequences to be in FASTA
  format
  Each sequence is preceded by a header line that starts with                                          “>”
  followed by the sequence identifier
  >fruitfly
  MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGR
  PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLA
  AQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLG
  TRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENS
  NGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDS
  PNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPR
  LNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVL
  SAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSP
  WV
  >human
  MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC
  TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA
  LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT
  MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
  >mouse

  MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC
  TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA
  LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT
  MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
• You can use BLAST to search many sequence
  databases (eg. NCBI or UniProt) via websites
  eg., we can use the fruitfly Eyeless protein sequence as a BLAST query
  sequence to search the UniProt database:

  MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGV
  NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQE
  NVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYE
  KLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPP
                        Fruitfly Eyeless (898 amino acids long)
  NDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLA
  GKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGID
  SSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSF
  NHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAAS
  SASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV



  We go to www.uniprot.org and click on ‘Blast’ at the top:
• You will get a list of BLAST hits (database sequences
  with good alignments to your query, ie. to fruitfly
  Eyeless here):
• Each BLAST hit may have several local alignments to
  the query sequence
  eg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and
  several local alignments are reported for this pair:
• BLAST assesses the statistical significance of high-
  scoring databases matches
• For each alignment between the query and a
  database protein, it calculates an E-value
• E-value: the number of database matches of a
  certain alignment score expected by chance, in a
  database of the size searched
• The lower the E-value, the more significant the
  alignment score for the sequence match
  E=1 means that we expect 1 match of that alignment score just by
  chance, in a database of the size searched
  E=10-5 means that we expect to see 10-5 matches of that alignment score
  just by chance, in a database of that size
• Significant BLAST hits are possibly homologues
• We use the E-value to judge if the database
  sequence is a homologue of the query
  If E ≤ 10-5, we are confident that the hit is a homologue
  If E is 10-5―10, we are not sure if the hit is a homologue
  If E is > 10, we are doubtful that the hit is a homologue
  eg. searching UniProt using fruitfly Eyeless as our query:
eg. searching the NCBI Protein Database using fruitfly Eyeless as our
  query:




............




               BLAST matches with high E-values
               may not be homologues (although it
               is often hard to tell if they are or not!)
Problem
• Here’s the output of a BLAST search using the
  predicted protein for a gene prediction from
  Staphylococcus aureus:




  (i) What does an E value of 189 mean?
  (ii) Based on the BLAST output, do you think the gene     prediction is
  likely to correspond to a real gene? If so, can   you suggest the
  biological function of that gene?
Answer
•   Here’s the output of a BLAST search using the predicted protein for a
    gene prediction from Staphylococcus aureus:




    (i) What does an E value of 189 mean? An E-value of 189 means that we
    expect to see 189 BLAST hits with an alignment score as high as the top
    BLAST hit (ie. 28.9) by chance, when we search a database of the size
    searched
    (ii) Based on the BLAST output, do you think the gene prediction is likely
    to correspond to a real gene? If so, can you suggest the biological function
    of that gene? An E-value of 189 is high, so we can’t be confident the top
    BLAST hit is a homologue of our query. We shouldn’t predict the
    function of our query sequence based on such a weak BLAST hit
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Chapter 6 in Deonier et al Computational Genome Analysis

Contenu connexe

Tendances

Tendances (20)

Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence Alignment
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
Sequence database
Sequence databaseSequence database
Sequence database
 
Fasta
FastaFasta
Fasta
 
Scop database
Scop databaseScop database
Scop database
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
BLAST
BLASTBLAST
BLAST
 
Cath
CathCath
Cath
 
Distance based method
Distance based method Distance based method
Distance based method
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Fasta
FastaFasta
Fasta
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
UPGMA
UPGMAUPGMA
UPGMA
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 

Similaire à BLAST

Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
Abhik Seal
 
blast presentation beevragh muneer.pptx
blast presentation  beevragh muneer.pptxblast presentation  beevragh muneer.pptx
blast presentation beevragh muneer.pptx
home
 
Database similarity searching blast and fasta
Database similarity searching blast and fastaDatabase similarity searching blast and fasta
Database similarity searching blast and fasta
Swathi764350
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
atmapandey
 

Similaire à BLAST (20)

blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Blast
BlastBlast
Blast
 
Blasta
BlastaBlasta
Blasta
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234
 
BLAST
BLASTBLAST
BLAST
 
BLAST
BLASTBLAST
BLAST
 
BLAST
BLASTBLAST
BLAST
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdfBIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
 
Basic BLAST (BLASTn)
Basic BLAST (BLASTn)Basic BLAST (BLASTn)
Basic BLAST (BLASTn)
 
blast presentation beevragh muneer.pptx
blast presentation  beevragh muneer.pptxblast presentation  beevragh muneer.pptx
blast presentation beevragh muneer.pptx
 
Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Database similarity searching blast and fasta
Database similarity searching blast and fastaDatabase similarity searching blast and fasta
Database similarity searching blast and fasta
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02
 
Ncbi
NcbiNcbi
Ncbi
 
Data base searching tool
Data base searching toolData base searching tool
Data base searching tool
 
Blast
BlastBlast
Blast
 

Plus de avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
avrilcoghlan
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
avrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
avrilcoghlan
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
avrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
avrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
avrilcoghlan
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
avrilcoghlan
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
avrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
avrilcoghlan
 

Plus de avrilcoghlan (11)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Dernier

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

BLAST

  • 1. BLAST Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. • Sequence alignment has many uses Sequence assembly – genome sequences are assembled by using sequence alignment methods to find overlaps between many short pieces of DNA Gene finding – alignment of whole genome sequences from two or more species can aid in discovery of previously unknown genes Sequence divergence – the amount of sequence similarity between sequences (which can be calculated from a sequence alignment) tells us how closely they are related Database searching – we use fast sequence alignment methods (eg. BLAST) to determine whether a protein/DNA sequence is similar to any known sequence Prediction of function – if we know the function of a sequence, we can predict the function of similar sequences identified by database searching (eg. for fruitfly eyeless gene)
  • 3. BLAST • The number of DNA and protein sequences in public databases is very large NCBI Protein database has ~38,500,000 protein sequences • Searching a database involves aligning the query sequence to each sequence in the database, to find significant local alignments eg. predicted protein from a Database sequences B candidate gene TARQDEFGGA (ORF) Align A to VIVADAVIS Database IRYDDEQAKM Query sequence A each B KQIRALQPSTQRE GHQIALMPLKMVQRR VIVALASVEGAS ASTILHGGQWLC etc. etc.
  • 4. BLAST • Needleman-Wunsch & Smith-Waterman are too slow for searching databases • Fast ‘heuristic’ methods are used eg. BLAST N.B. ‘heuristic’ means they’re not guaranteed to find the best solution (best alignment here), but they work okay • BLAST was developed by Stephen Altschul & colleagues at NCBI in 1990 NCBI = National Center for Biotechnology Information (USA) BLAST = ‘Basic Local Alignment Search Tool’ • The most used bioinformatics program Altschul’s 1997 paper on BLAST has been cited >26,000 times!
  • 5. There are two main steps in BLAST 1 It makes a list of words of length k (eg. k = 3 amino acids) in the query sequence It then looks for database sequences that share these words Database sequences that share many words with the query are used for the final alignments (step 2 ) Query sequence ADSKLWLLFKSLMNDKPFKKADFF 3-bp words ADS DSK SKL ... Database sequence 1 HIRTHIQLEQEWDSALIAAIQLE Doesn’t share words Database sequence 2 etc. PDADSTESKLAKAIQLFVCTTILCYT Shares ADS SKL words
  • 6. 2 For a database sequence that shares many words with the query, it makes an alignment A local alignment of the query & the database sequence The alignment contains the initial region with shared words However, the alignment may extend beyond that initial region • BLAST finds islands of similarity between sequences Given two sequences A and B, BLAST makes local alignments of pairs of subsequences of A and B A alignment 1 alignment 2 alignment 3 B • BLAST reports local alignments between the query sequence A and a database sequence B
  • 7. • You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites • Compares a DNA/protein query sequence to a sequence database and calculates the statistical significance (P-value) of matches • Website for searching GenBank and other NCBI sequence databases: http://www.ncbi.nlm.nih.gov/BLAST Can be used to search the NCBI Nucleotide database (DNA sequences), as well as the NCBI Protein database • There are 4 different types of BLAST search: BLASTP: searches a protein database with a protein query BLASTN: searches DNA/RNA database with DNA/RNA query BLASTX: searches a protein database with DNA/RNA query TBLASTN: searches DNA/RNA database with protein query
  • 8. FASTA format • Many programs for sequence analysis/alignment (eg. CLUSTAL) expect the input sequences to be in FASTA format Each sequence is preceded by a header line that starts with “>” followed by the sequence identifier >fruitfly MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGR PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLA AQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLG TRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENS NGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDS PNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPR LNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVL SAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSP WV >human MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ >mouse MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
  • 9. • You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites eg., we can use the fruitfly Eyeless protein sequence as a BLAST query sequence to search the UniProt database: MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGV NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQE NVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYE KLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPP Fruitfly Eyeless (898 amino acids long) NDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLA GKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGID SSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSF NHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAAS SASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV We go to www.uniprot.org and click on ‘Blast’ at the top:
  • 10. • You will get a list of BLAST hits (database sequences with good alignments to your query, ie. to fruitfly Eyeless here):
  • 11. • Each BLAST hit may have several local alignments to the query sequence eg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and several local alignments are reported for this pair:
  • 12. • BLAST assesses the statistical significance of high- scoring databases matches • For each alignment between the query and a database protein, it calculates an E-value • E-value: the number of database matches of a certain alignment score expected by chance, in a database of the size searched • The lower the E-value, the more significant the alignment score for the sequence match E=1 means that we expect 1 match of that alignment score just by chance, in a database of the size searched E=10-5 means that we expect to see 10-5 matches of that alignment score just by chance, in a database of that size
  • 13. • Significant BLAST hits are possibly homologues • We use the E-value to judge if the database sequence is a homologue of the query If E ≤ 10-5, we are confident that the hit is a homologue If E is 10-5―10, we are not sure if the hit is a homologue If E is > 10, we are doubtful that the hit is a homologue eg. searching UniProt using fruitfly Eyeless as our query:
  • 14. eg. searching the NCBI Protein Database using fruitfly Eyeless as our query: ............ BLAST matches with high E-values may not be homologues (although it is often hard to tell if they are or not!)
  • 15. Problem • Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus: (i) What does an E value of 189 mean? (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene?
  • 16. Answer • Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus: (i) What does an E value of 189 mean? An E-value of 189 means that we expect to see 189 BLAST hits with an alignment score as high as the top BLAST hit (ie. 28.9) by chance, when we search a database of the size searched (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene? An E-value of 189 is high, so we can’t be confident the top BLAST hit is a homologue of our query. We shouldn’t predict the function of our query sequence based on such a weak BLAST hit
  • 17. Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Chapter 6 in Deonier et al Computational Genome Analysis

Notes de l'éditeur

  1. The figure of 28,000,000 protein sequences is from searching NCBI Protein for 1:10000000000000000000000[SLEN] on 18-Feb-2011. Got 38535878 matching protein sequences. Image credit (filing cabinet): http://etc.usf.edu/clipart/13000/13089/file_cabinet_13089_lg.gif
  2. Image credit (Stephen Altschul): http://www.iscb.org/cms_addon/conferences/ismb2002/images/stephen.jpg