Seminar of February 9, 2012 for the ICOS group in the University of Nottingham.
Abstract: The Protein Structure Prediction (PSP) problem is to determine the three-dimensional structure of a protein, using only information contained in its amino acid sequence. The PSP problem is one of the most important open problems in structural bioinformatics. This is because the 3D structures determine the protein function and would be of enormous help for designing new drugs for diseases such as cancer or Alzheimer. Among the main data structures to represent protein structures, there are two widely used: contact maps and distance maps. Contact maps represent binary proximities (contact or non-contact) between each pair of amino acids of a protein. Distance maps represent distances between these amino acids pairs. However, contact and distance maps are very difficult to predict. In fact, the accuracy achieved by protein contact map predictors at Top L/5 in the last Critical Assessment of Techniques for Protein Structure Prediction competition (CASP9) is up to 22% approximately, and clearly must be improved. In this seminar, the author will present an approach to predict protein structures based on a nearest neighbors scheme. In this approach protein fragments are assembled according to their physico-chemical similarities, using information extracted from known protein structures. This method produces a distance map, which provides more information about the structure of a protein than a contact map, and which can be converted into contact map with different thresholds. The prediction procedure starts with a feature selection on the 544 amino acid physico-chemical properties of the AAindex repository, resulting different properties set which were used to predictions. The author will show some recent results using his approach and, finally, he will outline some of his current researching and future works.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Protein Distance Map Prediction using Nearest Neighbors
1. Gualberto Asencio Cortés
Supervisor: Jesús S. Aguilar Ruíz
Bioinformatics Group
School of Engineering
Pablo de Olavide University, Seville, Spain
Host: Jaume Bacardit
Protein Distance Map Prediction based on a
Nearest NeighborsApproach
Current state of research
February 9, 2012
2. 1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
2 / 33
3. 1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
3 / 33
4. Proteins and amino acids
Motivation Our proposal Recent results Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
4 / 33
5. Protein structure representations
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
3D model Distance map Contact map
(threshold = 8A)
1M3Y (4 chains of 413 amino acids)
Motivation Our proposal Recent results Conclusions and future work 5 / 33
6. Protein Structure Prediction (PSP)
Protein
structures
Training
New sequence
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 6 / 33
7. Why PSP is important?
• Knowing protein functions
• Drug design for diseases such as cancer
orAlzheimer
• Protein docking and virtual screening
• Protein engineering
• …
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 7 / 33
8. Why another contact/distance map
predictor?
• Currently a hot topic in bioinformatics journals
• Current results are up to 22% of precision for
contact prediction in the last CASP9, and
clearly must be improved
▫ CASP competition
CASP10 in 2012!
http://predictioncenter.org/
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 8 / 33
9. Why distance maps?
• Why a threshold?Why 8 angstroms?
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
• Distance maps store more information
• Conversion to contact maps is very easy
Motivation Our proposal Recent results Conclusions and future work 9 / 33
10. 1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
33
11. PDMpred: Prediction process
Training set
of protein
structures
Training data Test data
Distance
maps
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V
Insert your t it le here
Do you have a subt it le?
I f so, writ e it here
First A ut hor · Second A ut hor
Received: date / Accepted: date
A bst ract Insert your abstract here. Include keywords, PACS and mathematic
subject classification numbers as needed.
K eywords First keyword · Second keyword · More
si ∈ { A, R, N, D, B, C, Q, E, Z, G, H, I , L, K , M , F, P, S, T, W, Y, V}
1 I nt roduct ion
TheProtein StructurePrediction (PSP) problem consistsin determining thethre
dimensional model of a protein, using only information contained in the amin
acid sequence of the protein. The PSP problem is one of the most importan
open problems in computational biology [53]. This is because the 3D structure
determine the protein function. It follows that knowing the 3D structure of
protein would be of enormous help for designing new drugs for diseases such a
cancer or Alzheimer. Although there exist experimental methods for determinin
protein structures, e.g., X-ray crystallography and nuclear magnetic resonanc
Motivation Our proposal Recent results Conclusions and future work 11 / 33
12. PDMpred: Prediction process
Training set
of protein
structures
Training data Test data
Distance
maps
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
L
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (
Note that prediction vectors represent fragments of different lengths, bu
these lengths is not included in them. The physico-chemical properties include
in the prediction vectors are explained in the next subsection. From the point
view of data mining, Bi and Ei are the attributes of training instances and
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m}
Note that prediction vectors represent fragments of different length
these lengths is not included in them. The physico-chemical properties in
in the prediction vectors are explained in the next subsection. From the p
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Profile 1 L P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Motivation Our proposal Recent results Conclusions and future work 33
13. PDMpred: Prediction process
Training set
of protein
structures
Test set of
protein
sequences
Training data Test data
Test
profiles
?
? ?
?
?
?
?
?
?
?
Distance
maps ?
? ?
?
?
?
?
?
?
?
Distance
maps
d
All test
fragments
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V A-A V-V
Motivation Our proposal Recent results Conclusions and future work 13 / 33
14. PDMpred: Prediction process
Training set
of protein
structures
Test set of
protein
sequences
Training data Test data
Test
profiles
?
? ?
?
?
?
?
?
?
?
Distance
maps ?
? ?
?
?
?
?
?
?
?
Distance
maps
d
All test
fragments
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V A-A V-V
average
average
Pi (se) +
j = 1
j = e
L|e− j |
, ∀i ∈ { 1..m} (4)
test t1 . . . tn ?
training
...
a1 . . . an Da
...
b1 . . . bn Db
...
neighbor search for each test prediction vector
vectors represent fragments of different lengths, but
Motivation Our proposal Recent results Conclusions and future work 33
15. PDMpred: Evaluation measures
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
second was a measure of recall, which has been used in other protein predic-
tion methods [14]. Finally, we have obtained measures of accuracy, specificity
and Matthews Correlation Coefficient, that may often provide a much more bal-
anced evaluation of the prediction than, for instance, the percentages [15]. The
following formulas (2,3,4,5,6) define these five measures.
Precision =
TP
TP + F P
(2)
Recall =
TP
TP + F N
(3)
Accuracy =
TP + TN
TP + F P + F N + TN
(4)
Specif icity =
TN
TN + F P
(5)
M CC =
TP × TN − F P × F N
(TP + F P)(TP + F N )(TN + F P)(TN + F N )
(6)
Motivation Our proposal Recent results Conclusions and future work 15 / 33
16. 1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
33
17. Recent results
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
http://www.upo.es/eps/asencio/asppred
① G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction. In: 9th European Conference
on Evolutionary Computation, Machine Learning and Data Mining in
Bioinformatics (EvoBio 2011)Torino, Italia. Lecture Notes in Computer Science
6623, p. 69-76, Springer 2011, ISBN 978-3-642-20388-6.
① G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. In: 10th European Conference on
Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
(EvoBio 2012) Málaga, Spain (accepted).
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 17 / 33
18. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
BUNA790101 alpha-NH chemical shifts (Bundi-Wuthrich, 1979)
BUNA790103 Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979)
CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982)
FAUJ880111 Positive charge (Fauchere et al., 1988)
FAUJ880112 Negative charge (Fauchere et al., 1988)
GARJ730101 Partition coefficient (Garel et al., 1973)
JOND750102 pK (-COOH) (Jones, 1975)
KARP850103 Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985)
KHAG800101 The Kerr-constant increments (Khanarian-Moore, 1980)
MAXF760103 Normalized frequency of zeta R (Maxfield-Scheraga, 1976)
PRAM820101 Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982)
QIAN880139 Weights for coil at the window position of 6 (Qian-Sejnowski, 1988)
RICJ880101 Relative preference value at N" (Richardson-Richardson, 1988)
RICJ880104 Relative preference value at N1 (Richardson-Richardson, 1988)
RICJ880114 Relative preference value at C1 (Richardson-Richardson, 1988)
RICJ880117 Relative preference value at C" (Richardson-Richardson, 1988)
SUEM840102 Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984)
TANS770102 Normalized frequency of isolated helix (Tanaka-Scheraga, 1977)
TANS770108 Normalized frequency of zeta R (Tanaka-Scheraga, 1977)
VASM830101 Relative population of conformational state A (Vasquez et al., 1983)
VELV850101 Electron-ion interaction potential (Veljkovic et al., 1985)
WERD780102 Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978)
WERD780103 Free energy change of alpha(Ri)to alpha(Rh)(Wertz-Scheraga, 1978)
YUTK870103 Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987)
AURR980120 Normalized positional residue frequency at helix termini C4' (Aurora-Rose,
NADH010107 Hydropathy scale based on self-information values in the two-state model
MONM990201 Averaged turn propensities in a transmembrane helix (Monne et al., 1999)
MITS020101 Amphiphilicity index (Mitaku et al., 2002)
WILM950104 Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O
DIGM050101 Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005)
Profile Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (2)
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (3)
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m}
Note that prediction vectors represent fragments of diffe
these lengths is not included in them. The physico-chemical pr
in the prediction vectors are explained in the next subsection. F
view of data mining, Bi and Ei are the attributes of training
the class to predict.
2.3 Physico-chemical feat ure select ion
To the aim of using the smallest and most effective set of
Profile 1 L P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Feature selection (FS): 3o properties from 544 of AAindex ( http://www.genome.jp/aaindex/ )
FS Algorithm: Relief evaluation algorithm + Ranker search algorithm
Motivation Our proposal Recent results Conclusions and future work 33
19. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
Datasets and
configuration
10-fold cross validation
Beta-carbon distances
No minimum sequence separation
Distance threshold (cut-off) of 8 angstroms
Five datasets:
1. 20 random proteins from PDB, identity ≤ 30%
2. 118 proteins from CullPDB, identity ≤ 10%
3. 170 proteins from PDBselect, identity ≤ 25%
4. 221 proteins from CullPDB, identity ≤ 5%
5. 5130 proteins from PDBselect, identity ≤ 25%
PDB (Protein Data Bank): http://www.rcsb.org
CullPDB: http://dunbrack.fccc.edu/PISCES.php
PDBselect: http://bioinfo.tg.fh-giessen.de/pdbselect/
http://www.upo.es/eps/asencio/asppred
Motivation Our proposal Recent results Conclusions and future work 33
20. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
Results
3 0.48± 0.04 0.43± 0.05 0.99± 0.01 0.99± 0.01
4 0.40± 0.05 0.41± 0.05 0.99± 0.01 0.99± 0.01
5 0.14± 0.08 0.14± 0.08 0.99± 0.05 0.99± 0.05
Table 2: Efficiency of our method at 4 ˚A of distance threshold (µ ± σ values).
Dataset Recall Precision Accuracy Specificity
1 0.39± 0.06 0.41± 0.08 0.97± 0.03 0.98± 0.01
2 0.39± 0.07 0.40± 0.07 0.95± 0.01 0.97± 0.02
3 0.38± 0.02 0.38± 0.02 0.95± 0.02 0.97± 0.01
4 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01
5 0.51± 0.11 0.51± 0.11 0.92± 0.06 0.95± 0.07
Table 3: Efficiency of our method at 8 ˚A of distance threshold (µ ± σ values).
For 8 ˚A of threshold, recall and precision are basically the same in experiments 1 to 4. We found
that our predictor no needs many proteins as training. Seems to it find good similar fragments in
poor trainings. In experiment 5 we achieved better recall and precision than other experiments;
however, standard deviation values are higher. This may be due to the great number of different
types of proteins (structural classes or number of domains, for instance) in all the PDBselect.
We included detailed information about each protein in five experiments as supplemental ma-
terial at http://www.upo.es/eps/asencio/asppred. We indicate for each protein
Journal of Integrative Bioinformatics 2011 http://journal.imbio.de/
Dataset K Recall Precision Accuracy Specificity
4 1 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01
3 0.40± 0.04 0.72± 0.01 0.96± 0.01 0.97± 0.00
5 0.39± 0.04 0.81± 0.01 0.97± 0.00 0.98± 0.00
7 0.39± 0.04 0.84± 0.00 0.99± 0.00 0.99± 0.00
9 0.38± 0.04 0.86± 0.00 0.99± 0.00 0.99± 0.00
11 0.38± 0.04 0.87± 0.00 0.99± 0.00 0.99± 0.00
13 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00
15 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00
Table 4: Study of the number (K ) of nearest training profiles.
Motivation Our proposal Recent results Conclusions and future work 20 / 33
21. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction (EvoBio 2011)
Residue accessible surface area in folded protein (Chothia, 1976)
Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982)
RF value in high salt chromatography (Weber-Lacey, 1978)
Profile Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (2)
Ei = Pi (se) +
L
j = 1
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (3)
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m}
Note that prediction vectors represent fragments of diffe
these lengths is not included in them. The physico-chemical pr
in the prediction vectors are explained in the next subsection. F
view of data mining, Bi and Ei are the attributes of training
the class to predict.
2.3 Physico-chemical feat ure select ion
To the aim of using the smallest and most effective set of
Profile 1 L P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Roberto Ruiz Sánchez, José Cristóbal Riquelme Santos, Jesús S. Aguilar-Ruiz. Incremental wrapper-based gene
selection from microarray data for cancer classification. Pattern Recognition 39(12): 2383-2392 (2006)
Feature selection (FS): 3 properties from 544 of AAindex
FS Algorithm: BIRS algorithm + CFS search algorithm
Motivation Our proposal Recent results Conclusions and future work 21 / 33
22. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Datasets and
configuration
Leave-one-out CrossValidation (previously 10-fold CV)
Minimum sequence separation of 7 amino acids (previously not used)
Distance threshold (cut-off) of 8 angstroms
Beta-carbon distances
Training/test protein dataset:
Viral capsid proteins, from PDB, identity ≤ 30%, 63 proteins with maximum
length of 1284 amino acids.
② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction (EvoBio 2011)
Motivation Our proposal Recent results Conclusions and future work 22 / 33
23. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction (EvoBio 2011)
Motivation Our proposal Recent results Conclusions and future work 23 / 33
24. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Profile
Table 1. The 16 physico-chemical properties of amino acids considered from AAindex
CHOC760104 Proportion of residues 100% buried
LEVM760104 Side chain torsion angle phi(AAAR)
MEIH800103 Average side chain orientation angle
PALJ810107 Normalized frequency of alpha-helix in all-alpha class
QIAN880112 Weights for alpha-helix at the window position of 5
WOLS870101 Principal property value z1
ONEK900101 Delta G values for the peptides extrapolated to 0 M urea
BLAM930101 Alpha helix propensity of position 44 in T4 lysozyme
PARS000101 p-Values of mesophilic proteins based on the distributions of B values
NADH010102 Hydropathy scale based on self-information values in the two-state
model (9% accessibility)
SUYM030101 Linker propensity index
WOLR790101 Hydrophobicity index
JACR890101 Weights from the IFH scale
MIYS990103 Optimized relative partition energies - method B
MIYS990104 Optimized relative partition energies - method C
MIYS990105 Optimized relative partition energies - method D
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (3)
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (4)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ {
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ {
Note that prediction vectors represent fragment
these lengths is not included in them. The physico-ch
in the prediction vectors are explained in the next sub
view of data mining, Bi and Ei are the attributes of
the class to predict.
2.3 Physico-chemical feat ure select ion
To the aim of using the smallest and most effectiv
properties, we performed a feature selection from t
physico-chemical properties of amino acids. This rep
544 amino acid properties.
We used BARS to perform the feature selection
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (3)
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (4)
hat prediction vectors represent fragments of different lengths, but
Roberto Ruiz, José C. Riquelme, and Jesús S. Aguilar-Ruiz. Best agglomerative
ranked subset for feature selection. Journal of Machine Learning Research –
ProceedingsTrack, 4:148–162, 2008.
Feature selection (FS): 16 properties from 544 of AAindex
FS Algorithm: BARS algorithm + CFS search algorithm
Motivation Our proposal Recent results Conclusions and future work 24 / 33
25. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Datasets and
configuration
Leave-one-out CrossValidation
Minimum sequence separation of 7 amino acids
Distance threshold (cut-off) of 8 angstroms
Beta-carbon distances
Training/test protein dataset:
Mitochondrial matrix proteins, from PDB, identity ≤ 30%, 74 proteins with a
maximum length of 1094 amino acids.
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Motivation Our proposal Recent results Conclusions and future work 25 / 33
26. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
Table 3. Efficiency of our method predicting mitochondrial matrix proteins
Protein set Recall Precision Accuracy Specificity MCC
All proteins (74) 0.80 0.79 0.97 0.97 0.82
L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75
300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83
L > 450 (27) 0.77 0.76 0.95 0.95 0.82
a cross validation, cut-off of 8 angstroms and minimum sequence separation of
7 amino acids, achieved a precision value of 0.11 for proteins of more than 300
amino acids.
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Table 3. Efficiency of our method predicting mitochondrial matrix proteins
Protein set Recall Precision Accuracy Specificity MCC
All proteins (74) 0.80 0.79 0.97 0.97 0.82
L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75
300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83
L > 450 (27) 0.77 0.76 0.95 0.95 0.82
a cross validation, cut-off of 8 angstroms and minimum sequence separation of
7 amino acids, achieved a precision value of 0.11 for proteins of more than 300
amino acids.
(a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale
Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and
3BLX (b) with their color scale (c).
Figure 2 shows the predicted distance maps for protein 1TG6 (277 amino
Protein set Recall Precision Accuracy Specificity MCC
All proteins (74) 0.80 0.79 0.97 0.97 0.82
L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75
300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83
L > 450 (27) 0.77 0.76 0.95 0.95 0.82
a cross validation, cut-off of 8 angstroms and minimum sequence separation of
7 amino acids, achieved a precision value of 0.11 for proteins of more than 300
amino acids.
(a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale
Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and
3BLX (b) with their color scale (c).
Motivation Our proposal Recent results Conclusions and future work 26 / 33
27. Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Table 4. Comparison at 8 ˚A with RBFNN on the same benchmark
PDB code (length)
RBFNN PDMpred
Np Nd Ap Np Nd Ap
1TTF (94) 376 1421 26.46 1307 1421 91.96
1E88 (160) 1006 3352 30.01 3075 3352 91.73
1NAR (290) 3346 10524 31.79 1797 10524 17.07
1BTJ B (337) 3796 14283 26.58 14026 14283 98.20
1J7E (458) 6589 25026 26.33 23407 25026 93.53
Average 27.67 78.49
N p : predict ed numbers; N d : desired numbers; A p : predict ion recall (%).
is the count of the predicted contacts by the algorithm and desired numbers Nd
is the total number of contacts. The contact threshold was set at 8 ˚A.
In Table 4 we show the results of this experimentation. As we can see in
Motivation Our proposal Recent results Conclusions and future work 27 / 33
28. 1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
28 / 33
29. Conclusions
• New protein distance map predictor has performed using a nearest
neighbors-based approach and feature selections of physico-chemical
properties.
• We predict distance maps, which provide more information than
contact maps and which conversion to contact maps is very easy.
• We achieved up to 0.80 of recall and 0.79 of precision with minimum
separation of 7 amino acids on some non-homologous protein sets.
• Our results are a large improvement (5o.82% better) in recall over the
results of a previous study (Zhang et al. 2005).
A nearest neighbour-based approach for viral protein structure prediction
Gualberto Asencio Cortés, Jesús S. Aguilar-Ruíz and Alfonso E. Márquez Chamorro
Motivation Our proposal Recent results Conclusions and future work 29 / 33
30. Current research
Five new measures implemented:
Recursive Convex Hull of amino acids (RCH)
Solvent Accessibility (SA)
Secondary Structure (SS) from PSI-PRED
Coordination Number (CN)
Position-Specific Scoring Matrix (PSSM) from PSI-BLAST
Using this protein set in the current experiments (from ICOS, named INF2010):
3262 proteins from PDB-REPRDB, identity ≤ 30%. Using 90% of the set for
training and 10% for test.
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 30 / 33
31. Current research
Distance map post-proccessing
New feasibility measures
Based on the geometry of the predicted distance maps
Using triangular inequalities
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 31 / 33
32. Future work
Perform feature selections over RCH, SA, SS, CN and PSSM with different
statistics and windows sizes.
Build aTop L/x ranking (as in CASP) of predicted contacts (from predicted
distances) using the standard deviation of distances of the nearest neighbors
(profiles).
Use 24 amino acids as minimum sequence separation and CASP9 target
domains in free modelling category.
Divide training profiles in bags according to evolutionary information (PSSM) of
amino acid pairs.Then predict distances using only the appropiate bag according
to the test fragment.
Use protein domains as training instead whole sequences.
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 32 / 33
33. Thank you
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Acknowledgements:
Bioinformatics group in Pablo de Olavide University
ICOS group in the University of Nottingham
Contact:
guaasecor@upo.es
33 / 33
Editor's Notes
- Proteins are macromecules composed by one or more chains of amino acids. There are 20 natural amino acids.
- Proteins are macromecules composed by one or more chains of amino acids. There are 20 natural amino acids.
PSP is very useful to known protein functions, because protein functions are determined by the protein structure.
It is
CASP is a very important bianual competition among protein structure predictors
Our goal is to compete in CASP, using the same evaluation scheme, and try to overcome this 22%.