Introduction to Multilingual Retrieval Augmented Generation (RAG)
Practical 7 dna, rna and the flow of genetic information5
1. HBC1019 Biochemistry 1 Trimester 1, 2010/2011
Page 1 of 8
Faculty of Information ScienceTechnology
LAB REPORT
HBC 1019 - Biochemistry I
Practical 7
DNA, RNA and the Flow of Genetic
Information
Name : Osama Barayan
ID : 1091105869
2. HBC1019 Biochemistry 1 Trimester 1, 2010/2011
Page 2 of 8
Introduction
Biological databases are always referred as sequence or structure libraries that contained
huge amount of information about the sequence and structure of nucleic acids (DNA,
RNA) and proteins. This practical will introduce to you some of the relevant databases.
There are very useful and becoming important resources for the study of biochemistry and
bioinformatics as well at all levels.
Finding databases
a. What are the major online databases that contain DNA and protein
sequences?0
1. http://www.ncbi.nlm.nih.gov/
2. http://www.cellbiol.com/
3. http://www.biochemweb.org/
4. http://nar.oxfordjournals.org/
a. Which databases contain entire genomes?
We can find many sites in the internet for example
http://www.ncbi.nlm.nih.gov/
b. Define and understand the meaning of the following terms; once you
defined them, please provide the link(s) as well.
i. BLAST
Basic Local Alignment Search Tool, or BLAST, is an algorithm for
comparing primary biological sequence information, such as the amino-acid sequences of
different proteins or the nucleotides of DNA sequences.
ii. Taxonomy
the science of the classification of living things, grouped by similarity: species are
grouped into genera, genera into families, families into orders, orders into classes,
classes into phyla, and phyla with similar characteristics at the top level of the
classification .
Gene ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the
representation of gene and gene product attributes across all specie
iii. Phylogenetic tree
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing
the inferred evolutionary relationships among various biological species or other
entities based upon similarities and differences in their physical and/or genetic
characteristics
iv. Multiple sequence alignment
A multiple sequence alignment (MSA) is a sequence alignment of three or more
biological sequences, generally protein, DNA, or RNA.
3. HBC1019 Biochemistry 1 Trimester 1, 2010/2011
Page 3 of 8
5. Analyzing DNA sequence
You will learn how to analyze a given DNA sequence by identify an open reading
frame, determine the protein that it will express and find the bacterial source for
that protein.
This is the DNA sequence:
TACGCAATGCGTATCATTCTGCTGGGCGCTCCGGGCGCAGGTAAAGGTACTCAGGCTCAATTCATC
ATGGAGAAATACGGCATTCCGCAAATCTCTACTGGTGACATGTTGCGCGCCGCTGTAAAAGCAGGT
TCTGAGTTAGGTCTGAAAGCAAAAGAAATTATGGATGCGGGCAAGTTGGTGACTGATGAGTTAGTT
ATCGCATTACTCAAAGAACGTATCACACAGGAAGATTGCCGCGATGGTTTTCTGTTAGACGGGTTC
CCGCGTACCATTCCTCAGGCAGATGCCATGAAAAAGAAGCCGGTATCAGTTGATTATGTGCTGGAG
TTTGATGTTCCAGACGAGCTGATTGTTGAGCGCATTGTCGGCCGTCGGGTACATGCTGCTTCAGGC
CGTGTTTATCACGTTAAATTCAACCCACCTAAAGTTGAAGATAAAGATGATGTTACCGGTGAAGAG
CTGACTATTCGTAAAGATGATCAGGAAGCGACTGTCCGTAAGCGTCTTATCGAATATCATCAACAA
ACTGCACCATTGGTTTCTTACTATCATAAAGAAGCGGATGCAGGTAATACGCAATATTTTAAACTG
GACGGAACCCGTAATGTAGCAGAAGTCAGTGCTGAACTGGCGACTATTCTCGGTTAATTCTGGATG
GCCTTATAGCTAAGGCGGTTTAAGGCCGCCTTAGCTATTTCAAGTAAGAAGGGCGTAGTACCTACA
AAAGGAGATTTGGCATGATGCAAAGCAAACCCGGCGTATTAATGGTTAATTTGGGGACACCAGATG
CTCCAACGTCGAAAGCTATCAAGCGTTATTTAGCTGAGTTTTTGAGTGACCGCCGGGTAGTTGATA
CTTCCCCATTGCTATGGTGGCCATTGCTGCATGGTGTTATTTTACCGCTTCGGTCACCACGTGTAG
CAAAACTTTATCAATCCGTTTGGATGGAAGAGGGCTCTCCTTTATTGGTTTATAGCCGCCGCCAGC
AGAAAGCACTGGCAGCAAGAATGCCTGATATTCCTGTAGAATTAGGCATGAGCTATGGTTCAC
a. What is an Open Reading Frame (ORF) and reading frame?
any region of DNA or RNA where a protein could be encoded. There
must be a string of nucleotides in which one of the three reading frames
has no stop codons
b. Try to find an ORF from the segment of DNA above by finding the first
start codon and the first in frame stop codon.
Basically, in bacteria, an open reading frame on a piece of mRNA almost
always begins with AUG, which corresponds to ATG in the DNA segment
that code for the mRNA. According to the standard genetic code, there are
three Stop codons on mRNA: UAA, UAG, and UGA, which correspond to
TAA, TAG, and TGA in the parent DNA segment. Here are the rules for
finding an open reading frame in this piece of bacterial DNA:
i. It must start with ATG. In this exercise, the first ATG is the start codon.
ii. It must end with TAA, TAG, or TGA.
iii. It must be at least 300 nucleotides long (coding for 100 amino acids).
iv. The ATG start codon and the stop codon must be in frame. This means that the total
number of bases in the sequence from the start to the stop codon must be evenly
divisible by 3.
c. Copy the entire sequence again and go to the Translate tool on the ExPASy
server (http://www.expasy.org/tools/dnal.htm). Paste the sequence in the
box and select “Verbose (“Met”, “Stop”, spaces between residues)” as the
Output format and click on “Translate Sequence”.
4. HBC1019 Biochemistry 1 Trimester 1, 2010/2011
Page 4 of 8
What are the results of translation? Identify the reading frame that
contain a protein (more than 100 continuous amino acids with no
interruptions by a stop codon) and its name.
Y A Met R I I L L G A P G A G K G T Q A Q F I Met E K Y G I P Q I S T G D Met L R A A V
K A G S E L G L K A K E I Met D A G K L V T D E L V I A L L K E R I T Q E D C R D G F L
L D G F P R T I P Q A D A Met K K K P V S V D Y V L E F D V P D E L I V E R I V G R R
V H A A S G R V Y H V K F N P P K V E D K D D V T G E E L T I R K D D Q E A T V R K
R L I E Y H Q Q T A P L V S Y Y H K E A D A G N T Q Y F K L D G T R N V A E V S A E L
A T I L G Stop F W Met A L Stop L R R F K A A L A I S S K K G V V P T K G D L A
Now change the Output format from the early page to “Compact (“M”, “-”,
no spaces)”. Go to the same reading frame as before and copy the protein
sequence (by one-letter abbreviations) starting with “M” for start codon.
Paste the sequence in your answer.
MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKAGSELGLKAKEIM
DAGKL
VTDELVIALLKERITQEDCRDGFLLDGFPRTIPQADAMKKKPVSVDYVLEFDV
PDELIVE
RIVGRRVHAASGRVYHVKFNPPKVEDKDDVTGEELTIRKDDQEATVRKRLIEY
HQQTAPL
VSYYHKEADAGNTQYFKLDGTRNVAEVSAELATILG
d. Now you will identify the protein and the bacterial source. Go to the NCBI
BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/).
What are the different types of BLAST program and what are their
functions?
Nucleotide blast : Search a nucleotide database
blastx : Search protein database using a translated nucleotide query
Protein blast : Search protein database
tblastn : Search translated nucleotide database using a protein query
tblastx : Search translated nucleotide database using a translated nucleotide
query
You will do a simple BLAST search using your protein sequence, but you
can do much more with BLAST. You are encouraged to try the Tutorials
on the BLAST
(http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html).
On the BLAST page, select “Protein-protein BLAST.” Enter your protein
sequence in the “Search” box. Use the default values for the rest of the page
and click on the “BLAST!” button. You will be taken to the “formatting
BLAST” page. Click on the “Format!” button. You may have to wait for
the results. Your protein should be the first one listed in the BLAST output.
5. HBC1019 Biochemistry 1 Trimester 1, 2010/2011
Page 5 of 8
6. Sequence homology
You will use BLAST to look for sequences that are homologous to the protein that you
identified in problem 2.
a. Define homolog, ortholog and paralog.
A homology in chemistry refers to a chemical compound from a series
of compounds that differ only in the number of repeated structural units.
A homolog is a special case of an analog.
either of two or more homologous gene sequences found in different
species is called ortholog
either of a pair of genes that derive from the same ancestral gene is
called_paralog
b. Go to the NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/) and
choose “Protein-protein BLAST.” Paste your protein sequence into the
“Search” box. Before clicking on the “BLAST!” button, narrow the search
by kingdom. As you look down the BLAST page, you'll see an Options
section under “choose search set” (followed by an empty box) or “select
from:” key in “Eukaryota.” Now click on the “BLAST!” button. Click on
the “Format!” button on the next page. Can you find a homologous
sequence from yeast? YES (Hint: Use your browser's Find tool to search
for the term “Saccharomyces.”) Note the Score and E value given at the
right of the entry. Can you find a homologous sequence from humans?
(Hint: Search for the term “Homo.”) Note its Score and E value.
Yes ,,max 98% from Cytidylate kinase,,,,total 90.5,,, and E
value is 4e-18. Cytidine monophosphate, Score is 90.1, query
coverage 98%, and E value is 5e-18
UMP-CMP kinase isoform a, Score is 89.7, query coverage
98%, and E value is 6e-18.
Most biochemists consider 25% identity the cutoff for sequence homology,
meaning that if two proteins are less than 25% identical in sequence, more
evidence is needed to determine whether they are homologs. Click on the
Score values for the yeast and human proteins to see each sequence aligned
with your query sequence and to see the percent sequence identity. Are the
yeast and human sequences homologous to your query sequence? yes
6. HBC1019 Biochemistry 1 Trimester 1, 2010/2011
Page 6 of 8
c. What are Score and E-value stand for? Use the BLAST online tutorial
(http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html) to
discover the meaning. What is the difference between an identity and a
conservative substitution? From the result of BLAST you gained, provide an
example from the comparison of your sequence and a homologous sequence.
Score = a measure of the similarity of the query to the sequence shown.
E−value is a measure of the reliability of the S score.
BLAST uses a substitution matrix to assign values in the alignment process,
based on the analysis of amino acid substitutions in a wide variety of
protein sequences. Make sure you understand the meaning of the term
“substitution matrix.” What is the default substitution matrix on the
BLAST page? BLOSUM62.
What other matrices are available?
PAM1, PAM250, PAM30, PAM70, BLOSUM45, BLOSUM80
What is the source of the names for these substitution matrices?
PAM = Point Accepted Mutation. This matrix work by observing
differences between closely related proteins.
-BLOSUM = BLOck SUbstitution Matrix. Matrix that can calculate small
changes in sequences which could happen during evolution process. This
matrix works by using multiple alignments of evolutionarily divergent
proteins
Repeat the BLAST search in Problem 3(b) using a different substitution matrix.
(Look for algorithm parameters). Do you find different answers?yes