CODON TABLE:
Codon Full_Name 3_Letter 1_Letter
TTT Phenylalanine Phe F
TTC Phenylalanine Phe F
TTA Leucine Leu L
TTG Leucine Leu L
TCT Serine Ser S
TCC Serine Ser S
TCA Serine Ser S
TCG Serine Ser S
TAT Tyrosine Tyr Y
TAC Tyrosine Tyr Y
TAA Termination (ochre) Ter *
TAG Termination (amber) Ter *
TGT Cysteine Cys C
TGC Cysteine Cys C
TGA Termination (opal or umber) Ter *
TGG Tryptophan Trp W
CTT Leucine Leu L
CTC Leucine Leu L
CTA Leucine Leu L
CTG Leucine Leu L
CCT Proline Pro P
CCC Proline Pro P
CCA Proline Pro P
CCG Proline Pro P
CAT Histidine His H
CAC Histidine His H
CAA Glutamine Gln Q
CAG Glutamine Gln Q
CGT Arginine Arg R
CGC Arginine Arg R
CGA Arginine Arg R
CGG Arginine Arg R
ATT Isoleucine Ile I
ATC Isoleucine Ile I
ATA Isoleucine Ile I
ATG Methionine Met M
ACT Threonine Thr T
ACC Threonine Thr T
ACA Threonine Thr T
ACG Threonine Thr T
AAT Asparagine Asn N
AAC Asparagine Asn N
AAA Lysine Lys K
AAG Lysine Lys K
AGT Serine Ser S
AGC Serine Ser S
AGA Arginine Arg R
AGG Arginine Arg R
GTT Valine Val V
GTC Valine Val V
GTA Valine Val V
GTG Valine Val V
GCT Alanine Ala A
GCC Alanine Ala A
GCA Alanine Ala A
GCG Alanine Ala A
GAT Aspartate Asp D
GAC Aspartate Asp D
GAA Glutamate Glu E
GAG Glutamate Glu E
GGT Glycine Gly G
GGC Glycine Gly G
GGA Glycine Gly G
GGG Glycine Gly G
sfa.gff:
sfa.fasta: Fasta format The most commonly used biological sequence format is known as fasta. It
can be used for both nucleotide and amino acid sequences. From the NCBI website: A sequence
in FASTA format begins with a single-line description, followed by lines of sequence data. The
description line (defline) is distinguished from the sequence data by a greater-than (">") symbol
at the beginning. It is recommended that all lines of text be shorter than 80 characters in length.
Examples of FASTA (can be protein or nucleotide) format:
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFH
>gi|142864|gb|M10040.1|BACDNAE B. subtilis dnaE gene encoding DNA primase
GTACGACGGAGTGTTATAAGATGGGAAATCGGATACCAGATGAAATTGTGGATCAG
GTGCAAAAGTCGGC
AGATATCGTTGAAGTCATAGGTGATTATGTTCAATTAAAGAAGCAAGGCCGAAACT
ACTTTGGACTCTGT
CCTTTTCATGGAGAAAGCACACCTTCGTTTTCCGTATCGCCCGACAAACAGATTTTTC
ATTGCTTTGGCT GCGGAGCGGGCGGCAATGTTTTCTCTTTTTTAAGGCAGATGGAA
GFF format Sequences are only useful if we annotate them ie indicate what features they encode,
where those features are etc. A relatively new but widely used annotation format is GFF - the
General Feature Format. It is quite a terse, concise format used primarily by automated parsing
systems. Each field in the file is tab separated and represents a specific aspect of the sequence
feature being annotated.
Codon usage Most amino acids are encoded by more than a single codon. Codon usage bias can
reflect evolutionary forces as well as the overall GC content of a genome. It has many
consequences in terms of protein expression and is very important in modern synthetic biology.
The task Your task is .
1. CODON TABLE:
Codon Full_Name 3_Letter 1_Letter
TTT Phenylalanine Phe F
TTC Phenylalanine Phe F
TTA Leucine Leu L
TTG Leucine Leu L
TCT Serine Ser S
TCC Serine Ser S
TCA Serine Ser S
TCG Serine Ser S
TAT Tyrosine Tyr Y
TAC Tyrosine Tyr Y
TAA Termination (ochre) Ter *
TAG Termination (amber) Ter *
TGT Cysteine Cys C
TGC Cysteine Cys C
TGA Termination (opal or umber) Ter *
TGG Tryptophan Trp W
CTT Leucine Leu L
CTC Leucine Leu L
CTA Leucine Leu L
CTG Leucine Leu L
CCT Proline Pro P
CCC Proline Pro P
CCA Proline Pro P
CCG Proline Pro P
CAT Histidine His H
CAC Histidine His H
CAA Glutamine Gln Q
CAG Glutamine Gln Q
CGT Arginine Arg R
CGC Arginine Arg R
CGA Arginine Arg R
CGG Arginine Arg R
ATT Isoleucine Ile I
2. ATC Isoleucine Ile I
ATA Isoleucine Ile I
ATG Methionine Met M
ACT Threonine Thr T
ACC Threonine Thr T
ACA Threonine Thr T
ACG Threonine Thr T
AAT Asparagine Asn N
AAC Asparagine Asn N
AAA Lysine Lys K
AAG Lysine Lys K
AGT Serine Ser S
AGC Serine Ser S
AGA Arginine Arg R
AGG Arginine Arg R
GTT Valine Val V
GTC Valine Val V
GTA Valine Val V
GTG Valine Val V
GCT Alanine Ala A
GCC Alanine Ala A
GCA Alanine Ala A
GCG Alanine Ala A
GAT Aspartate Asp D
GAC Aspartate Asp D
GAA Glutamate Glu E
GAG Glutamate Glu E
GGT Glycine Gly G
GGC Glycine Gly G
GGA Glycine Gly G
GGG Glycine Gly G
sfa.gff:
sfa.fasta: Fasta format The most commonly used biological sequence format is known as fasta. It
can be used for both nucleotide and amino acid sequences. From the NCBI website: A sequence
3. in FASTA format begins with a single-line description, followed by lines of sequence data. The
description line (defline) is distinguished from the sequence data by a greater-than (">") symbol
at the beginning. It is recommended that all lines of text be shorter than 80 characters in length.
Examples of FASTA (can be protein or nucleotide) format:
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFH
>gi|142864|gb|M10040.1|BACDNAE B. subtilis dnaE gene encoding DNA primase
GTACGACGGAGTGTTATAAGATGGGAAATCGGATACCAGATGAAATTGTGGATCAG
GTGCAAAAGTCGGC
AGATATCGTTGAAGTCATAGGTGATTATGTTCAATTAAAGAAGCAAGGCCGAAACT
ACTTTGGACTCTGT
CCTTTTCATGGAGAAAGCACACCTTCGTTTTCCGTATCGCCCGACAAACAGATTTTTC
ATTGCTTTGGCT GCGGAGCGGGCGGCAATGTTTTCTCTTTTTTAAGGCAGATGGAA
GFF format Sequences are only useful if we annotate them ie indicate what features they encode,
where those features are etc. A relatively new but widely used annotation format is GFF - the
General Feature Format. It is quite a terse, concise format used primarily by automated parsing
systems. Each field in the file is tab separated and represents a specific aspect of the sequence
feature being annotated.
Codon usage Most amino acids are encoded by more than a single codon. Codon usage bias can
reflect evolutionary forces as well as the overall GC content of a genome. It has many
consequences in terms of protein expression and is very important in modern synthetic biology.
The task Your task is to write a Python program to read a GFF file and associated fasta sequence
file, extract the relevant information and calculate the codon usage for all the annotated genes
along with some other questions below. You are also provided with a codon table file. Your
answers should be written to an output file. As ever, your solution should be generally applicable
to any input files in this format. Your program will need to take several command line arguments
- the gff filename, the fasta filename, the codon table filename and an output filename. You must
use this order of arguments in your code. You are provided with a sample GFF file and
associated fasta DNA sequence file, along with a codon table file. NB: The format of the GFF
file provided has been modified for simplified parsing. We have removed some fields, modified
some and added a header line to indicate the content of each field. NB: For our purposes, CDS is
a gene and all features in the file are annotated as CDS (ie each line in the file represents a gene)
so your code does not need to check whether a line represents a gene.
Questions 1. If there is a hemagglutinin (check for 'hemagglutination' in the annotations)
4. encoded on this sequence, what is the name of the gene 2. How many genes are annotated in the
gff file? 3. What is the length (number of nucleotides in the annotated gene) and translated
sequence of each gene? 4. Calculate the codon usage for all the genes and report the following: -
A single codon usage table for the entire set of genes - not a codor usage table for each gene See
specification of output formatting below
Output Your output file should be in the following format: ##1 Your answer (if you don't find
any haemagglutinin report 'None') ##Q2: Your answer ##Q3 Gene_name_1: length nt
Translated sequence Gene_name_2: length nt Translated sequence Gene_name_3: length nt etc...
Codon Usage Table
1. Your program must take command line arguments - all input filenames and an output
filename. Your program should check that the correct number of arguments have been provided
and exit gracefully (use quit0 or sys.exit() with a usage message if the script was called
incorrectly. 2. Use a main0 function and conditional execution (if_name_== "_main_") as we've
discussed in the section on modules. 3. Print your answers to no more than 2 places of decimals.
4. Functions must return the result of the computation and not write the result. Writing must be
implemented in main0 - not in your calculation function(s). 5. Use meaningful function names
and variable names. 6. Put your student number on a comment line at the top of your code, under
the shebang line and also as the first line of the output file. 7. Your solution will need to import
the sys module and can import textwrap for limiting line length in output. The latter is not
required (if you want to do this it's much nicer output!). YOU SHOULD NOT IMPORT ANY
OTHER MODULES.
1. Coordinates are according to biological counting - ie 1-based counting. The start and end
coordinates should be included in the gene. 2. For this assignment, dictionaries are your best
friend - don't be afraid to make liberal use of them. 3. Translation: a. Use the standard genetic
code. Use the standard single letter amino acid code in your translations (eg Phenylalanine is
represented by F). For stop codons use the asterisk symbol () You should use the tab-delimited
file codon_table.txt which is available on Brightspace in this exercise. 4. Your codon usage table
output should have one line per codon consisting of 3 tab separated fields as shown below. The
first is the codon; the second column should be an integer since it is a count; the third column
should be a float, less than or equal to 1 since it is a proportion. Order of codons is unimportant.
For example, if there are 5 occurrences of Phenylalanine (F), 4 of which are TTT and 1 being
TTC the relevant lines would look like this: Codon Usage Table ie: There are 2 Phenylalanine
(F) codons - TTT and TTC. The proportion field shows the number of TTT codons as a
5. proportion of the total number of F codons and similarly for TTC. There should be 1 line like
this for each possible codon. As a sanity check, adding up the values for the codons which code
for a given amino acid should sum to 1 (as shown above for Phe which has 2 codons:
0.8+0.2==1 ).