The document discusses several techniques for automated protein classification including ProtoMap and PRODISTIN. ProtoMap clusters proteins based on sequence similarity scores, while PRODISTIN clusters based on protein-protein interaction data. Both techniques have limitations as biological networks, such as protein and domain interaction networks, have been found to have scale-free topologies. Any improved protein classification algorithm needs to account for this scale-free network structure.
3. Why do we need automated
classification?
Sequencing a genome is only the first
step.
Between 35-50% of the proteins in
sequenced genomes have no assigned
functionality.
Direct observation of function is costly,
time consuming, and difficult.
4. Protein Domains
The tertiary structure of many proteins is built from
several domains.
Often each domain has a separate function to
perform for the protein, such as:
•binding a small ligand (e.g., a peptide in the
molecule shown here)
•spanning the plasma membrane (
transmembrane proteins)
•containing the catalytic site (enzymes)
•DNA-binding (in transcription factors)
•providing a surface to bind specifically to another
protein
In some (but not all) cases, each domain in a
protein is encoded by a separate exon in the gene
encoding that protein.
5. Inference through sequence
similarity
ProtoMap: Automatic Classification
of Protein Sequences, a Hierarchy
of Protein Families, and Local
Maps of the Protein Space (1999)
7. Observations
Sometimes you don’t know where the
domains are.
It is generally accepted that two
sequences with over 30% identity are
likely to have the same fold.
Homologous proteins have similar
functions.
Homology is a transitive relationship.
8. Departures
Authors do not attempt to define protein
domains or motifs.
Not dependant on predefined groups or
classifications.
Chart the space of all proteins in
SWISSPROT, as opposed to individual
families
Produce global organization of sequences.
9. Algorithm Overview
We construct a weighted graph where
the nodes are protein sequences and
the edges are similarity scores.
Cluster the network considering only
those edges above some threshold.
Decrease similarity threshold and
repeat.
10. Measuring Sequence Similarity
Expectation value used. This the
normalized probability of the similarity
occurring at random.
Lower value implies logarithmically
stronger similarity.
λS − ln K
S'=
ln 2
E = N /2 S'
12. Finding Homologies
Very difficult to distinguish a clear
threshold between homology and
chance similarity.
Authors chose e = .1, .1, and .001 for
SW, FASTA, and BLAST, respectively.
Spent a lot of time empirically
determining these thresholds.
13. Clustering
Clustering is done
iteratively.
Start with a threshold
of E < 10-100
Cluster and increase
threshold by a factor
of 105
Sublinear threshold
prevents the collapse
of sequence space
14. ProtoMap: Results
Produces well-defined groups which
correlate strongly to protein families in
PROSITE and Pfam.
18. Inference through protein
interaction networks
Functional Classification of
Proteins for the Prediction of
Cellular Function from a Protein-
Protein Interaction Network
(2003)
19. PRODISTIN
• Very similar to ProtoMap,
only the data used to
produce the graph is a list
of binary protein-protein
interactions instead of
sequence similarity scores
• Sequence similarity not a
dominating factor in
PRODISTIN clusters
23. Problems with PRODISTIN
• Paucity of
protein-protein
interaction data
(average # of
connections =
2.6)
• Either very
robust or very
indiscriminant
24. Problems: Multidomain and
Nonlocal Proteins
• protein kinases
• hydrolases
• ubiquitin…
PRODISTIN: Present problems in clustering by
biochemical function
ProtoMap: Can create undesired connection among
unrelated groups
25. Scale-Free Networks
• Node connection probability follows
a power law distribution
• Maximum degree of separation
grows as O(lg n)
• Highly robust under noise, except
at hubs and superhubs.
ki
P(linking to node i) ~
∑kj j
28. Metabolic Networks
• The E. coli metabolic network is scale-free.
• Actually, the metabolic networks of all organisms in
all three domains of life appear to be scale-free (43
examined)
• The network diameter of all 43 metabolic networks is
the same, irrespective of the number of proteins
involved.
• Is this counter-intuitive? Yes.
http://biocomplexity.indiana.edu/research/bionet/
29. Protein Domain Networks
• Protein Domains – Nature’s take on writing
modular code
• Reconciles apparent paradox of a fixed network
diameter across species – despite vast differences in
complexity (some human proteins have 130
domains)
• Occurrence of specific protein domains in
multidomain proteins is scale-free.
http://mbe.oupjournals.org/cgi/content/full/18/9/1694
30. Protein Domain Graphs
• Prosite domains have a distribution following the
power-law function f(x) = a(b + x)-c, with c = .89.
There are few highly connected domains and many
rarely connected ones.
• ProDom and Pfam domains follow the power
function P ( k ) ≈ k − γ
y = 2.5 for ProDom
y = 1.7 for Pfam
32. Conclusions
• The accuracy of both ProtoMap and PRODISTIN is
limited because they make the tacit assumption of a
random network topology.
• Protein-Protein interaction networks have scale-
free topology, foiling PRODISTIN
• Protein Domain networks have scale-free topology,
foiling ProtoMap
• Any protein classification algorithm that performs
better than ProtoMap is probably going to have to
address this issue.