Gene prediction and expression

Submitted by-
Ishi tandon
CT-IV

Gene:
• Asequence of nucleotides coding for protein.
CentralDogma:
• Proposed in 1958 by Francis Crick.
• Hepostulated that all possibleinformation
transferred, are not viable.
• Hepublished apaper in 1970.
CODONS:
• Discovered by Sydney Brenner and Francis Crickin
1961.
• In every triplet of nucleotides, each codoncodesfor
one amino acid in aprotein.

DNA RNA PROTEIN PHENOTYPE
2
4
cDNA
1 3
1. TRANSCRIPTION
2. TRANSLATION
3. GENE EXPRESSION
4. REVERSETRANSCRIPTION

DEfiniTION
• It is aprerequisite for detailed functionalannotation
of genesand genomes.
• It candetect location of ORFs(Open Reading
Frames), structures of introns andexons.
• It describes all the genescomputationally withnear
100% accuracy.
• It canreduce the amount ofexperimental
verification work required.

TYPES
• Abinitio- gene signals, intron splice, transcription
factor binding site, ribosomal binding site, poly-
adenylation site, triplet codon structure and gene
content.
• Homology- significant matches of query sequence
with sequence of knowngenes.
• Probabilistic models like Markov model or Hidden
Markov Models (HMMs).
Abinitio-based
Homology-
based

Translation
Protein
Splicing
mRNA Cap- -Poly(A)
Transcription
pre-mRNA Cap- -Poly(A)
Genomic DNA
Stop codon
GT AG
exon intron
Splice sites
Donor site Acceptor site
SEQUENCE
SIGNALS
Start codon
Exonsare usually
shorter thanintrons.

Prokaryoticgene
prediction
• Geneprediction is easier in microbialgenomes.
• Smaller genomes, high gene density, very few
repetitive sequence, more sequenced genomes.
• Start codon is ATG.
• Ribosomal binding site/Shine Dalgarno sequence.

Openreadingframes
• A sequence defined by in-frame start and stop
codon, which in turn defines aputative amino acid
sequence.
• Agenome of length n is comprised of (n/3)codons.
• Stop codons break genome into segments between
consecutive stop codons.
• Thesub-segments of these that start from the Start
codon (ATG)areORFs.
• DNA is translated in all six possible frames,
three frames forward and three reverse.
ATG TGA
Genomic Sequence
Open reading frame

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

Probabilisticmodels
• Statistical description of agene.
• Markov Models &Hidden Markov Models.
• Usedto distinguish oligonucleotide distributions in
the coding regions from those for non-coding
regions.
• Probability of distribution of nucleotides inDNA
sequence depends on the order k.
• Typesof order- zero,first and second.
• Order , gene canpredicted more accurately.

Genecontent and length distribution of
prokaryotic genes
TYPICAL ATYPICAL
Ranges from100
to 500amino
acids with a
nucleotide
distribution
typical ofthe
organism.
Shorter or longer
with different
nucleotidestatistics.
Genes tend toescape
detection when
typical gene modelis
used.

Genefindingprogramsin
prokaryotes
• Theprograms are based on HMM/IMM.
 GeneMark.hmm (microbial genomes)
 Glimmer (UNIX program from TIGR). Computation
involves two steps viz. model building & gene
prediction.
 FGENESB (bacterial sequences). It uses Vertibi
algorithm & linear discriminant analysis(LDA).
 RBSfinder- Searches from ribosomal binding site or
shine dalgarno sequence for prediction of translation
initiation site.

Sensitivity Ability to include correct predictions. It is the
fraction of known genescorrectlypredicted.
Specificity Ability to exclude incorrect predictions. It is the
fraction of predicted genes that correspond to true genes.
 Both are the proportion of true signals.

Eukaryoticgeneprediction
• Genomes are much larger than prokaryotes(10Mbp to
670 Gbp).
• Low gene density.
• Spacebetween genesis very large and rich in
repetitive sequences & transposableelements.
• Splitting of genesby intervening noncodingsequences
(introns) and joining of coding sequences(exons).

• Splice junctions follow GT-AGrule.
• An intron at the 5’ splice junction hasaconsensus
motif GTAAGTand that at 3’ endNCAG.
exon 1 exon 2
• Geneshave ahigh density of CGdinucleotides near
the transcription start site. Thisregion is CpGisland. It
helps to identify the transcription initiation site of an
eukaryotic gene.
• Somepost-transcriptional modification occur with the
transcript to become mature mRNAviz. Capping,
Splicing and Polyadenylation.
Acceptor
Site
Donor
Site
GT AG

o CAPPING: Occurs at the 5’ end of the transcript. It
involves methylation at the initial residue of the
RNA.
o SPLICING: Processof removal of intronsand
joining of exons. It involves alargeRNA-protein
complex called spliceosome.
o POLYADENYLATION:Addition of astretch ofAs
(~250) at the 3’ end of the RNA.Theprocessis
accomplished by poly-Apolymerase.

Genefindingprogramsin
EUkaryotes
• Three categories of algorithms
 Ab Initiobased-
It joins the exonsin correct order.Twosignals->
a) Genesignals: asmall pattern within the genomic
DNAincluding putative splice sites, start and stop
sites of transcription or translation, branchpoints,
transcription factor binding sites, recognizable
consensus sequences.
b) Genecontent: aregion of genomic DNAincluding
nucleotide and amino acid distribution, Synonymous
codon usageand hexamer frequencies.

 Neural network based algorithm
-Composed of network of mathematicalvariables.
-Multiple layers like input, output and hiddenlayers.
-GRAIL (Splice junctions, start and stop codons, poly-A
sites, promoters and CpGislands). It scansthe query
sequence with windows of variable lengths &scores.
 Discriminant analysis
-Linear Discriminant Analysis (LDA) represents 2D
graph of coding signals vs. all possible 3’ splice site
positions; adiagonal line.
-Quadratic DiscriminantAnalysis (QDA)represents
quadratic function; acurved line.
-FGENES (LDA)

-FGENESH [Find Genes] (HMMs)
-FGENESH_C (Similarity based)
-FGENESH+ (Combination of ab initio &similarity
based)
-MZEF [Michael Zhang’s Exon Finder](QDA)
 HMMs
-GENSCAN (Fifth order HMMs); combination of
hexamer frequencies with coding signals;probability
score P>0.5
-HMMgene (Conditional Maximum Likelihood);
combination of ab initio & homology-basedalgorithm

 Homology-based-
Exonstructures and sequencesof related speciesare
highly conserved.
Comparison of homologous sequences derived from
cDNAor ExpressedSequenceTags (ESTs).
-GenomeScan (Combination of GENSCANprediction
results with BLASTXsimilaritysearches)
-EST2Genome (Intron-exon boundaries); Comparison
of an ESTsequence with agenomic DNAsequence
-SGP-1 [Syntenic Gene Prediction] (Similar to EST2)
-TwinScan (gene-finding server; similar to
GenomeScan)

 Consensus-based-
Combination of results of multiple programsbased
on consensus.
Improvement of specificity by correctingfalse
positives & problem ofoverprediction.
Lowered sensitivity & missedpredictons.
-GeneComber (Combination of HMMgene&
GenScanprediction results)
-DIGIT (Combination of FGENESH,GENSCAN&
HMMgene)

GENE EXPRESSION
Two steps are required
1. Translation
The synthesis of a polypeptide chain using the genetic
code on the mRNA molecule as its guide.
1. Transcription
The synthesis of mRNA uses the gene on the DNA
molecule as a template
This happens in the nucleus of eukaryotes

Types OF RNA
Messenger RNA (mRNA) <5%
Ribosomal RNA (rRNA) Up to 80%
Transfer RNA (tRNA) About 15%
In eukaryotes small nuclear ribonucleoproteins (snRNP aka
spliceosomes
Structural characteristics of RNA molecules
Single polynucleotide strand which may be looped or
coiled (not a double helix)
Sugar Ribose (not deoxyribose)
Bases used: Adenine, Guanine, Cytosine and Uracil (not
Thymine

Transcription: The synthesis of a strand of mRNA (and
other RNAs)
Uses an enzyme RNA polymerase
Proceeds in the same direction as replication (5’ to 3’)
Forms a complementary strand of mRNA
It begins at a promotor site, which signals that the beginning of
the gene is near (about 20 to 30 nucleotides away)
After the end of the gene is reached, there is a terminator
sequence that tells RNA polymerase to stop transcribing
NB Terminator sequence ≠ terminator codon
RNA POLYMERASE

Editing the mRNA
In prokaryotes, transcribed mRNA
goes straight to the ribosomes in the
cytoplasm
In eukaryotes, freshly transcribed
mRNA in the nucleus is about 5000
nucleotides long
When the same mRNA is used for
translation at the ribosome it is only
1000 nucleotides long
The mRNA has been edited
The parts which are kept for gene
expression are called EXONS (exons =
expressed)
The parts which are edited out (by
spliceosomes) are called INTRONS.

Translation
 Location: The ribosomes in the cytoplasm
that provide the environment for translation
 The genetic code is brought by the mRNA
molecule.

An important discovery Retro viruses (e.g. HIV)
carry RNA as their
genetic information
 When they invade their
host cell they convert
their RNA into a DNA
copy using reverse
transcriptase
 Thus the central dogma is modified:
DNA↔RNAProtein
 This has helped to explain an important paradox in the
evolution of life.
Reverse transcriptase

The paradox of DNA
 DNA is a very stable molecule
 It is a good medium for storing genetic material
but…
 DNA can do nothing for itself
 It requires enzymes for replication
 It requires enzymes for gene expression
 The information in DNA is required to synthesise
enzymes (proteins) but enzymes are require to
make DNA function
 Which came first in the origin of life DNA or
enzymes?

RIBOZYMES: Both genetic and
catalytic
 Certain forms of RNA have catalytic properties
 RIBOZYMES
 Ribosomes and spliceosomes are ribozymes
 RNA could have been the first genetic information
synthesizing proteins…
 …and at the same time a biocatalyst
 Reverse transcriptase provides the possibility of
producing DNA copies from RNA.

REFERENCES
 http://www.4ulr.com/products/currentprotocols/bioinformatics.html
 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
 Xiong J.;Essential bioinformatics; QH324.2.X56 2006

Gene prediction and expression

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Gene prediction and expression

Similaire à Gene prediction and expression (20)

Dernier

Dernier (20)

Gene prediction and expression