23. Base composition aids genome analysis GC skew (G-C)/G+C) Identifies origin of replication and leading lagging strands Genes coded by location & function %G+C Genes shared with E. coli Genes unique to S. typhi
24.
25.
26.
27. The problem of conflicting ORFs Non-coding ORFs CDSs (note ORF can extend upstream of start codon)
28. The Problem of Frameshift Errors Actual sequence 10 20 30 40 50 60 70 | | | | | | | ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAA M S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K 10 20 30 40 50 60 70 | | | | | | | ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAA M S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K Frameshifted sequence after single base error
29. CDS Prediction: Graphical Plots GC content by reading frame Amino-acid composition by reading frame, compared to average for globular proteins
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41. Bit Scores high is good E-values low is good http://www.ncbi.nlm.nih.gov/BLAST/tutorial/
47. The Annotation Catastrophe Signal Peptide A protease B Coiled coil domain C Homology lies in one domain Signal Peptide Protein A “ a protease” Protein B Protein C But functional assignment for whole of protein A comes from another domain, carried across in error, so proteins B and C get misannotated as proteases