1. Combining large-scale evolutionary analyses
with multiple biological data sources to predict
human protein function
David Jones
UCL Depts. of Computer Science and Structural and Molecular Biology
2. Background
In Uniprot, 30% of human … and only 0.5% have
proteins still have no completely specific ones
functional annotations at for all aspects
all
CC MF
30%
MF BP
CC
BP
3. Main approaches for function annotation
• Annotation transfers by homology
e.g. BLAST, HMMER
Only applicable to a subset of the data
Has reached a plateau in terms of novel function
annotation but provides highest quality information
• Model-classifier based using sequence features
Limited to common and broad functions for which there
are many examples
4. FFPRED - Function Prediction Pipeline
Novel sequence Amino acid sequence
Characteristics
structure disorder aa transmem motifs localisation
Classification GO Term
SVM
posterior probability
estimate
5.
6.
7. Going further – computing gene function
from multiple data sources
• FFPRED is a currently available server for human
(and vertebrate) proteins
• It works well but is limited to predicting only the
functional classes that it was trained to recognize
• Extending the library requires time consuming
training of new SVM models
• It also cannot be applied to rare functional classes
due to limited training sets
8. Desirable features of a new approach
• Able to annotate all sequences
• Able to predict rare functions
• Able to offer something more than simple
homology-based approaches
• Amenable to easy and quick updating
9. FunctionSpace Data Sources for H. sapiens
• Sequence similarity
• Signal peptides and other local features
• Predicted secondary structure
• Transmembrane segments
• Predicted disordered regions
• Domain architecture patterns
• Gene fusion information
• Gene co-expression
• Protein-protein interactions
For each sequence 49,231 features were derived
10. Aim
To estimate the functional similarity (a.k.a. semantic distance)
between two human proteins from their sequence features
plus available high throughput data.
Protein A
Functional
Similarity
Score
Protein B
11. Large-scale (domain-based) evolutionary
features
• Patterns of domain occurrence can provide
valuable functional clues
• “Deeper” homology detection allows greater
coverage
• We make use of our in-house fold/domain
recognition method and several public domain
libraries
12. pDomTHREADER Domain Coverage
Residues 35.7% Gene3d
CATH Domain annotations
81.6% 7000000
threading 6000000
5000000
Sequences 4000000
3000000
2000000
1000000
64.8% 59.4% Gene3d 0
Public domain Threading
threading
37.56 % increase in domain annotations across 5.5M sequences
~ 1.7 million novel domain assignments over public domain data
13. Computational Practicalities
Legion Nodes
5.5M Query
sequences Sequence
2Gb database
(5.5M seqs)
PSIBLAST Find
matches &
1min – 3 hours generate
alignments
Store &
post process
“Embarrassingly parallel” application: one sequence = one job.
Ideal capacity filling task for a modern supercomputer like Legion.
14. Gene Fusion Events can Predict Protein-
Protein Interactions from Sequence Data
H1 3.90.850.10 3.60.15.10 H2
fumaryl aceto acetase beta lactamase
Bi-functional enzyme
3.90.850.10 3.60.15.10
Mycobacterium tuberculosis
Mycobacterium paratuberculosis
Mycobacterium avium
Hydrolase activity
Hydrolysis of C-N bonds Hydrolysis of C-C bonds
15. A Novel Gene Fusion Discovered using CATH
domain fusion analysis
Phosphoglyceromutase DNA repair (RAD50)
3.40.120.10 3.40.50.300
Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases
3.40.120.10
Transcription coupling repair factor
3.40.120.10 3.40.50.300
Saccharopolyspora erythraea
Syntrophomonas wolfei
Oxidative stress
D-glucose metabolism DNA repair
16. Novel Gene Fusion Discovery
3.40.120.10 3.40.50.300 3.40.50.300
Saccharopolyspora erythraea
3.40.50.300 3.40.120.10
Syntrophomonas wolfei
Novel annotations
• Rice PGM1 gene annotated as GO:0006950 response to
stress
• PGM3 has relationship with DNA repair sequence
Kanazawa K, Ashida H (1991) Relationship between oxidative stress
and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
17. Domain based features
Score
architectures
Score
complexes
7960 features
11210 features
18. Fusion scoring
Each domain is a feature, score has 2 components
1. Prediction quality (logistic transform of feature)
2. Promiscuity weight related to the number of times the sequence
occurs as part of a fused product wi = log fus
i
19. Integration of “External” Features:
Microarray Expression Data
Gene Gene 14
A B
Probe Signal (log2)
12
Normalised Microarray Datasets
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Experiment (conditions)
Pearson Correlation (R)
20. Biclustering Microarray Expression Data
Zinc binding sequences A set of transcription factors
global correlation 0.42 global correlation 0.48
23912 features generated from biclustering of 2346 publicly available microarrays
(81 experiments) using BIMAX algorithm
21. FunctionSpace: Two-stage Integration of Data
SVMsw
SVMloc
SVMss
Feature
Protein A vectors
SVMtm
SVMdis
Functional
SVMdpc SVMfsc Similarity
SVMgfc
Score
SVMdpp
Feature SVMgfp
Protein B vectors
SVMge
SVMppi
22. A 3-D Projection of Annotated Human Proteins
• 49,231 dimensions first
reduced to 11 dimensions by
SVM regression with 11
different groups of features
• Each protein is here
represented as a point in this
derived 11-D feature space
projected into 3-D
• Colouring is according to
functional similarity which
shows that proteins with similar
functions (warmer colours)
cluster strongly in this space
• 75% of nearest neighbour pairs
share common GO terms
24. Function Annotation Results for 20674
Unannotated IPI Human Sequences
Each sequence is classed “Easy”, “Medium” or “Hard” depending on
degree of homology to functionally annotated proteins in UNIPROT.
25. Preliminary Results
In 2009 FunctionSpace produced GO term predictions for 19678 IPI
uncharacterized human sequences. 2746 have been annotated since.
MF Measure BP
16% % Exact Matches 9%
-1.3 Mean semantic distance -1.7
Less More Less More
specific specific specific specific
26. Initial considerations for CAFA
• 50,000 sequences
• 11 eukaryotic & 7 prokaryotic species
• High specificity annotations needed
• Partial descriptive text already in Swiss-Prot/Uniprot for some
entries
• FFPRED/FunctionSpace would not be enough
• Need to incorporate textual information from databases
and comprehensive homology(orthology)-derived labels
• Need to get all this working in a few months!
27. Best Laid Plans for CAFA
• Plan A
– Build separate annotation pipelines for missing data
– Calibrate each pipeline according to precision values derived from
benchmark on 500 highly annotated Swiss-Prot entries
– Combine pipeline annotations using high-level classifier (SVM or Naive
Bayes)
• Plan B
– No time to build high-level classifier!
– Combine annotation sources using heuristic graphical approach
• Hope for the best!
(and expect the worst...)
28. GO term prediction from Swiss-Prot
text-mining
• For targets which already had
descriptive text, keywords or
comments in Swiss-Prot, GO terms
were assigned using a naive Bayes
text-classification approach
• Single words and groups of 2 and 3
words were counted
• Words occurring in different Swiss-Prot
record types were distinguished in the
analysis, and some simple pre-parsing
of feature (FT) records was carried out
in addition.
29. Homology-based annotation sources
• PSI-BLAST searches against Uniprot
– Low E-value threshold to ensure close homologues used for
annotation transfer
– Alignment length threshold to avoid domain problem
• Transfer of annotations from orthologues
– EggNOG 2.0
– More reliable GO term transfer than for PSI-BLAST but lower
coverage
• Profile-profile searches against Swiss-Prot
– Low reliability transfer from very distant homologues
– Improves coverage where needed (at expense of specificity)
30. Heuristic back-propagation of precision
estimates
Back-propagation
repeated for each
annotation source
Back-propagation
to define a
of precision
consensus for
estimates
each node
P’ = 1 - (1 – P) (1 – Q)
31. Final steps
• After back-propagation, all referenced GO terms
are ranked according to final confidence scores
• To reduce conflicting annotations, pairs of terms
with zero observed co-occurrence frequency in
GOA are subjected to pairwise tournament
selection.
• Results submitted to server using the
mouse-window-cut-paste-click-submit
algorithm
32. CASP vs CAFA from a Predictor’s Point of
View
• Number of targets
– Manual vs automated approaches
• Difficulty of targets
– A major limit in driving CASP forwards
• Assessment
– Hard to pre-judge impact of decisions made during
prediction season
• Tools for the community
– Standards and methods in CASP have been very useful
• Getting the word out to the wider community