This document discusses various methods for predicting protein function from sequence and structure. It begins by explaining the importance of predicting protein function for applications like disease diagnosis and drug discovery. It then outlines different types of data that can be used for functional prediction, including sequence, structure, expression profiles, and interactions. Both sequence-based methods like homology searches and domain identification as well as structure-based approaches are covered. Specific tools discussed include BLAST, Pfam, SCOP, CATH, and ProFunc. The document emphasizes that functional prediction is challenging given proteins can have multiple functions and homology does not always imply similar function. It also notes limitations of simple homology searches.
2. Points to remember
Proteins are single, unbranched chains of
amino acid monomers.
There are 20 different amino acids
There are four levels of protein structureprimary,secondary,tertiary and quaternary.
A protein’s amino acid sequence determines
its three-dimensional structure
(conformation).
3. Proteins Functional Classes
Why do we care about protein
function?
• Diagnose reasons for the disease.
• Discover new drugs.
• Understand Mechanism of action of
processes in the system.
4. Data used for prediction of protein function
•
•
•
•
•
•
Amino acid sequences
Protein structure
Genome sequences
Phylogenetic data
Microarray expression data
Protein interaction networks and protein
complexes
• Biomedical literature
5. The concept of protein function is highly
context-sensitive and not very well-defined.
infact, this concept typically acts as an umbrella
term for all types of activities that a protein is
involved in, be it cellular, molecular or
physiological.
6. Characterization on protein function
Molecular function, cellular function and
Phenotypic function are hierarchically
related.
Predicting function: from genes to genomes. Bork etal
1998.
7. Gene Ontology classification scheme categorizes
protein function into cellular component,
molecular function and biological process.
In computer science and information science, an ontology formally represents
knowledge as a set of concepts within a domain, and the relationships between pairs of
concepts. It can be used to model a domain and support reasoning about entities
Read more at http://www.answers.com/topic/ontology-computer-science
http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html
http://www.geneontology.org/
8. GO Format
•
•
•
•
•
•
Figure adapted from [Ashburner et al. 2000])
Wide coverage
Standardized format
Hierarchical structure
Disjoint Categories
Multiple functions
Dynamic nature
9. Molecular function
• Molecular function describes activities, such as
catalytic or binding activities, at the molecular level
• GO molecular function terms represent activities rather
than the entities that perform the actions, and do not
specify where or when, or in what context, the action
takes place
• Examples of broad functional terms are catalytic
activity or transporter activity; an example of a
narrower term is adenylate cyclase activity
10. Biological process
• A biological process is series of events
accomplished by one or more ordered assemblies
of molecular functions
• An example of a broad GO biological process
terms is signal transduction; examples of more
specific terms are pyrimidine metabolism or
alpha-glucoside transport.
• It can be difficult to distinguish between a
biological process and a molecular function.
11. Cellular component
• A cellular component is just that, a component of
a cell that is part of some larger object
• It may be an anatomical structure (for example,
the rough endoplasmic reticulum or the nucleus)
or a gene product group (for example, the
ribosome, the proteasome or a protein dimer)
• The cellular component categories are probably
the best defined categories since they correspond
to actual entities
12. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene
lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.
13. DAVID (Gene Ontology Enrichment)
Youtube Videos
http://www.youtube.com/watch?v=xIu9mm6b7N0
http://www.youtube.com/watch?v=zedjRViji2c
Try out the microarray list given below for analyzing Proteins.
31741_at 31734_at 32696_at 37559_at 41400_at 35985_at
39304_g_at 41438_at 35067_at 32919_at 35429_at 36674_at
967_g_at 36669_at 39242_at 39573_at 39407_at 33346_r_at
40319_at 2043_s_at 1788_s_at 36651_at 41788_i_at 35595_at
36285_at 39586_at 35160_at 39424_at 36865_at 2004_at
36728_at 37218_at 40347_at 36226_r_at 33012_at 37906_at
32872_at
15. Basic Set of Protein Annotations
• Protein name
- descriptive common name for the protein
eg. “kinase”
• Gene symbol
-mnemonic abbreviation for the gene
- eg “recA”
• EC number
-what the protein is doing in the cell and why
-eg “involved in glycolysis”
• Supporting evidence
- accession numbers of BER and HMM matches
- whatever information you used to make the annotation
• Unique Identifier
- eg locus ids
16. Sequence Similarity Evidence
• pairwise alignments -two protein’s amino acid sequences aligned next to
each other so that the maximum number of amino acids match
• Multiple alignment - 3 or more amino acid sequences aligned to each other
so that the maximum number of amino acids match in each column
• Protein families - clusters of proteins that all share sequence similarity and
presumably similar function
• Motifs -short regions of amino acid sequence shared by many proteins. A
motif can be found in number of different proteins where it carries out
similar functions.
17. Important terms to understand
• Homologs – two sequences have evolved from the same
common ancestor they may not share same function
• Orthologs – a type of homolog where two sequences are
in different species that arose from a common ancestor.
Speciation have created the tow copies of the sequence.
• Paralogs- a type of homolog where the two sequences
have arisen due to a gene duplication within one
species.They initially have the same function but as time
goes byone copy will be free to evolve new functions, as
the other copy will maintain the original function.
• Xenologs – a type of ortholog where two gene
sequences have arisen due to horizontal transfer (by
means of reproduction)
19. Sequence similarity, sequence homology, and
functional homology
• Sequence similarity means that the sequences
are similar – no more, no less
• Sequence homology implies that the proteins are
encoded by genes that share a common ancestry.
• Functional homology means that two proteins
from two organisms have the same function.
• Sequence similarity or sequence homology does
not guarantee functional homology
20. Existing Sequence based function prediction methods
Homology based approaches
•
•
•
•
BLAST
FASTA
SSEARCH
PSI-BLAST-iterates searches by using a sequence profile computed from a multiple
sequence alignment obtained from the search from the previous round.
Subsequence based approaches
•
Motifs and domains
http://molbiol-tools.ca/Motifs.htm
Feature based approaches
• normalized Van der Waals volume, polarity, charge and surface tension, which are
averaged over all the residues to in the sequence obtain the feature-value vector for
the protein to train a classifier
• SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi)
21. Drawbacks of BLAST and FASTA
• Provide functional annotation typically to half
of the genes in a genome since homologous
sequences are not found at accepted
significance thresholds.
• Automated methods of annotation transfer
between similar sequences contribute to error
propagation.
22. Enhanced Sequence based methods
•
PFP – Kihara lab. (http://kiharalab.org/web/pfp.php)
•
The PFP algorithm uses PSI-BLAST (version 2.2.6) to predict probable GO function annotations in
three categories—molecular function, biological process, and cellular component—with statistical
significance scores (Pvalue)
For each sequence retrieved by PSI-BLAST ,the associated GO terms are scored.
GO terms are scored according to
a) frequency of association to similar sequences
b) degree of similarity those sequences share with the query
•
•
where s(fa) is the final score assigned to the GO term fa, N is the number of similar sequences retrieved by PSIBLAST, Nfunc(i) is the number of GO terms annotating sequence i, E_value(i) is the E-value given to the sequence i,
fj is a GO term annotating sequence i, and b is the constant value, 2 = (log10100), which keeps the score
positive. P(fa|fj) is the association score for fa given fj obtained from the function association matrix (FAM).
c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt, and c(fj) is the total number of
times fj appeared in Uni- Prot, l is the size of one dimension of the FAM (i.e. the total number of unique GO terms), and ε is the
pseudocount.
23. When Homology searches fail
• Sometimes no orthologs or even paralogs can be
identified by sequence similarity searches, or they are
all of unknown function.
• No functional information can thus be transferred
based on simple sequence homology
• By instead analyzing the various parts that make up the
complete protein, it is nonetheless often possible to
predict the protein function
24. Protein domains
• Many eukaryotic proteins consist of multiple
globular domains that can fold independently
• These domains have been mixed and matched
through evolution
• Each type of domain contributes towards the
molecular function of the complete protein
• Numerous resources are able to identify such
domains from sequence alone using HMMs
25.
26.
27.
28.
29. Which domain resource should I use?
• SMART is focused on signal transduction domains
• Pfam is very actively developed and thus tends to
have the most up-to-date domain collection
• InterPro is useful for genome annotation since
the domains are annotated with GO terms
• CDD is conveniently integrated with the NCBI
BLAST web interface
30. Function prediction from post translational
modifications
• Proteins with similar function may
not be related in sequence
• Still they must perform their
function in the context of the same
cellular machinery
• Similarities in features such like
PTMs and physical/chemical
properties could be expected for
proteins
with similar function
31. The concept of ProtFun
http://www.cbs.dtu.dk/services/ProtFun/
34. ProtFun data sets
• Labeling of training and test data
– Cellular role categories: human SwissProt sequences
were categorizes using EUCLID
– Enzyme categories: top-level enzyme classifications
were extract from human SwissProt description lines
– Gene Ontology terms were transferred from InterPro
• The sequences were divided into training and test
sets without significant sequence similarity
• Binary predictors were for each category
35. Structure based methods
Three standard databases dominate the structure data
landscape:
PDB-Structure data from NMR and ,X-ray
SCOP- organizes the available structures in a hierarchy so as to
elicit the evolutionary relationships between them.Family,
Superfamily and Fold
CATH-(Class, Architecture, Topology and Homologous
superfamily)
36. Structure based methods
Protein Folds Super Secondary Structures
Biological function
Adapted from Martin 1998. Protein folds and
functions
37. Approaches for deriving functional information from 3D
structure
Adapted from From Structure to function.Thorton etal, 2000,Nature .