SlideShare a Scribd company logo
1 of 33
Gualberto Asencio Cortés
Supervisor: Jesús S. Aguilar Ruíz
Bioinformatics Group
School of Engineering
Pablo de Olavide University, Seville, Spain
Host: Jaume Bacardit
Protein Distance Map Prediction based on a
Nearest NeighborsApproach
Current state of research
February 9, 2012
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
2 / 33
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
3 / 33
Proteins and amino acids
Motivation Our proposal Recent results Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
4 / 33
Protein structure representations
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
3D model Distance map Contact map
(threshold = 8A)
1M3Y (4 chains of 413 amino acids)
Motivation Our proposal Recent results Conclusions and future work 5 / 33
Protein Structure Prediction (PSP)
Protein
structures
Training
New sequence
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 6 / 33
Why PSP is important?
• Knowing protein functions
• Drug design for diseases such as cancer
orAlzheimer
• Protein docking and virtual screening
• Protein engineering
• …
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 7 / 33
Why another contact/distance map
predictor?
• Currently a hot topic in bioinformatics journals
• Current results are up to 22% of precision for
contact prediction in the last CASP9, and
clearly must be improved
▫ CASP competition
CASP10 in 2012!
http://predictioncenter.org/
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 8 / 33
Why distance maps?
• Why a threshold?Why 8 angstroms?
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
• Distance maps store more information
• Conversion to contact maps is very easy
Motivation Our proposal Recent results Conclusions and future work 9 / 33
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
33
PDMpred: Prediction process
Training set
of protein
structures
Training data Test data
Distance
maps
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V
Insert your t it le here
Do you have a subt it le?
I f so, writ e it here
First A ut hor · Second A ut hor
Received: date / Accepted: date
A bst ract Insert your abstract here. Include keywords, PACS and mathematic
subject classification numbers as needed.
K eywords First keyword · Second keyword · More
si ∈ { A, R, N, D, B, C, Q, E, Z, G, H, I , L, K , M , F, P, S, T, W, Y, V}
1 I nt roduct ion
TheProtein StructurePrediction (PSP) problem consistsin determining thethre
dimensional model of a protein, using only information contained in the amin
acid sequence of the protein. The PSP problem is one of the most importan
open problems in computational biology [53]. This is because the 3D structure
determine the protein function. It follows that knowing the 3D structure of
protein would be of enormous help for designing new drugs for diseases such a
cancer or Alzheimer. Although there exist experimental methods for determinin
protein structures, e.g., X-ray crystallography and nuclear magnetic resonanc
Motivation Our proposal Recent results Conclusions and future work 11 / 33
PDMpred: Prediction process
Training set
of protein
structures
Training data Test data
Distance
maps
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
L
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (
Note that prediction vectors represent fragments of different lengths, bu
these lengths is not included in them. The physico-chemical properties include
in the prediction vectors are explained in the next subsection. From the point
view of data mining, Bi and Ei are the attributes of training instances and
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m}
Note that prediction vectors represent fragments of different length
these lengths is not included in them. The physico-chemical properties in
in the prediction vectors are explained in the next subsection. From the p
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Profile 1 L P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Profile 2 B1 E1 ... Bm Em D
Motivation Our proposal Recent results Conclusions and future work 33
PDMpred: Prediction process
Training set
of protein
structures
Test set of
protein
sequences
Training data Test data
Test
profiles
?
? ?
?
?
?
?
?
?
?
Distance
maps ?
? ?
?
?
?
?
?
?
?
Distance
maps
d
All test
fragments
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V A-A V-V
Motivation Our proposal Recent results Conclusions and future work 13 / 33
PDMpred: Prediction process
Training set
of protein
structures
Test set of
protein
sequences
Training data Test data
Test
profiles
?
? ?
?
?
?
?
?
?
?
Distance
maps ?
? ?
?
?
?
?
?
?
?
Distance
maps
d
All test
fragments
All training
fragments
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
d
Training
profiles
A-A A-R V-V A-A V-V
average
average
Pi (se) +
j = 1
j = e
L|e− j |
, ∀i ∈ { 1..m} (4)
test t1 . . . tn ?
training
...
a1 . . . an Da
...
b1 . . . bn Db
...
neighbor search for each test prediction vector
vectors represent fragments of different lengths, but
Motivation Our proposal Recent results Conclusions and future work 33
PDMpred: Evaluation measures
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
second was a measure of recall, which has been used in other protein predic-
tion methods [14]. Finally, we have obtained measures of accuracy, specificity
and Matthews Correlation Coefficient, that may often provide a much more bal-
anced evaluation of the prediction than, for instance, the percentages [15]. The
following formulas (2,3,4,5,6) define these five measures.
Precision =
TP
TP + F P
(2)
Recall =
TP
TP + F N
(3)
Accuracy =
TP + TN
TP + F P + F N + TN
(4)
Specif icity =
TN
TN + F P
(5)
M CC =
TP × TN − F P × F N
(TP + F P)(TP + F N )(TN + F P)(TN + F N )
(6)
Motivation Our proposal Recent results Conclusions and future work 15 / 33
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
33
Recent results
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
http://www.upo.es/eps/asencio/asppred
① G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction. In: 9th European Conference
on Evolutionary Computation, Machine Learning and Data Mining in
Bioinformatics (EvoBio 2011)Torino, Italia. Lecture Notes in Computer Science
6623, p. 69-76, Springer 2011, ISBN 978-3-642-20388-6.
① G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. In: 10th European Conference on
Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
(EvoBio 2012) Málaga, Spain (accepted).
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 17 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
BUNA790101 alpha-NH chemical shifts (Bundi-Wuthrich, 1979)
BUNA790103 Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979)
CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982)
FAUJ880111 Positive charge (Fauchere et al., 1988)
FAUJ880112 Negative charge (Fauchere et al., 1988)
GARJ730101 Partition coefficient (Garel et al., 1973)
JOND750102 pK (-COOH) (Jones, 1975)
KARP850103 Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985)
KHAG800101 The Kerr-constant increments (Khanarian-Moore, 1980)
MAXF760103 Normalized frequency of zeta R (Maxfield-Scheraga, 1976)
PRAM820101 Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982)
QIAN880139 Weights for coil at the window position of 6 (Qian-Sejnowski, 1988)
RICJ880101 Relative preference value at N" (Richardson-Richardson, 1988)
RICJ880104 Relative preference value at N1 (Richardson-Richardson, 1988)
RICJ880114 Relative preference value at C1 (Richardson-Richardson, 1988)
RICJ880117 Relative preference value at C" (Richardson-Richardson, 1988)
SUEM840102 Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984)
TANS770102 Normalized frequency of isolated helix (Tanaka-Scheraga, 1977)
TANS770108 Normalized frequency of zeta R (Tanaka-Scheraga, 1977)
VASM830101 Relative population of conformational state A (Vasquez et al., 1983)
VELV850101 Electron-ion interaction potential (Veljkovic et al., 1985)
WERD780102 Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978)
WERD780103 Free energy change of alpha(Ri)to alpha(Rh)(Wertz-Scheraga, 1978)
YUTK870103 Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987)
AURR980120 Normalized positional residue frequency at helix termini C4' (Aurora-Rose,
NADH010107 Hydropathy scale based on self-information values in the two-state model
MONM990201 Averaged turn propensities in a transmembrane helix (Monne et al., 1999)
MITS020101 Amphiphilicity index (Mitaku et al., 2002)
WILM950104 Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O
DIGM050101 Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005)
Profile Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (2)
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (3)
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m}
Note that prediction vectors represent fragments of diffe
these lengths is not included in them. The physico-chemical pr
in the prediction vectors are explained in the next subsection. F
view of data mining, Bi and Ei are the attributes of training
the class to predict.
2.3 Physico-chemical feat ure select ion
To the aim of using the smallest and most effective set of
Profile 1 L P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Feature selection (FS): 3o properties from 544 of AAindex ( http://www.genome.jp/aaindex/ )
FS Algorithm: Relief evaluation algorithm + Ranker search algorithm
Motivation Our proposal Recent results Conclusions and future work 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
Datasets and
configuration
 10-fold cross validation
 Beta-carbon distances
 No minimum sequence separation
 Distance threshold (cut-off) of 8 angstroms
 Five datasets:
1. 20 random proteins from PDB, identity ≤ 30%
2. 118 proteins from CullPDB, identity ≤ 10%
3. 170 proteins from PDBselect, identity ≤ 25%
4. 221 proteins from CullPDB, identity ≤ 5%
5. 5130 proteins from PDBselect, identity ≤ 25%
 PDB (Protein Data Bank): http://www.rcsb.org
 CullPDB: http://dunbrack.fccc.edu/PISCES.php
 PDBselect: http://bioinfo.tg.fh-giessen.de/pdbselect/
http://www.upo.es/eps/asencio/asppred
Motivation Our proposal Recent results Conclusions and future work 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181.
Results
3 0.48± 0.04 0.43± 0.05 0.99± 0.01 0.99± 0.01
4 0.40± 0.05 0.41± 0.05 0.99± 0.01 0.99± 0.01
5 0.14± 0.08 0.14± 0.08 0.99± 0.05 0.99± 0.05
Table 2: Efficiency of our method at 4 ˚A of distance threshold (µ ± σ values).
Dataset Recall Precision Accuracy Specificity
1 0.39± 0.06 0.41± 0.08 0.97± 0.03 0.98± 0.01
2 0.39± 0.07 0.40± 0.07 0.95± 0.01 0.97± 0.02
3 0.38± 0.02 0.38± 0.02 0.95± 0.02 0.97± 0.01
4 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01
5 0.51± 0.11 0.51± 0.11 0.92± 0.06 0.95± 0.07
Table 3: Efficiency of our method at 8 ˚A of distance threshold (µ ± σ values).
For 8 ˚A of threshold, recall and precision are basically the same in experiments 1 to 4. We found
that our predictor no needs many proteins as training. Seems to it find good similar fragments in
poor trainings. In experiment 5 we achieved better recall and precision than other experiments;
however, standard deviation values are higher. This may be due to the great number of different
types of proteins (structural classes or number of domains, for instance) in all the PDBselect.
We included detailed information about each protein in five experiments as supplemental ma-
terial at http://www.upo.es/eps/asencio/asppred. We indicate for each protein
Journal of Integrative Bioinformatics 2011 http://journal.imbio.de/
Dataset K Recall Precision Accuracy Specificity
4 1 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01
3 0.40± 0.04 0.72± 0.01 0.96± 0.01 0.97± 0.00
5 0.39± 0.04 0.81± 0.01 0.97± 0.00 0.98± 0.00
7 0.39± 0.04 0.84± 0.00 0.99± 0.00 0.99± 0.00
9 0.38± 0.04 0.86± 0.00 0.99± 0.00 0.99± 0.00
11 0.38± 0.04 0.87± 0.00 0.99± 0.00 0.99± 0.00
13 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00
15 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00
Table 4: Study of the number (K ) of nearest training profiles.
Motivation Our proposal Recent results Conclusions and future work 20 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction (EvoBio 2011)
Residue accessible surface area in folded protein (Chothia, 1976)
Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982)
RF value in high salt chromatography (Weber-Lacey, 1978)
Profile Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (2)
Ei = Pi (se) +
L
j = 1
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (3)
Profile 1 P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj )
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m}
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m}
Note that prediction vectors represent fragments of diffe
these lengths is not included in them. The physico-chemical pr
in the prediction vectors are explained in the next subsection. F
view of data mining, Bi and Ei are the attributes of training
the class to predict.
2.3 Physico-chemical feat ure select ion
To the aim of using the smallest and most effective set of
Profile 1 L P1 P2 ... Pm D
Fig. 1. Prueba
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Roberto Ruiz Sánchez, José Cristóbal Riquelme Santos, Jesús S. Aguilar-Ruiz. Incremental wrapper-based gene
selection from microarray data for cancer classification. Pattern Recognition 39(12): 2383-2392 (2006)
Feature selection (FS): 3 properties from 544 of AAindex
FS Algorithm: BIRS algorithm + CFS search algorithm
Motivation Our proposal Recent results Conclusions and future work 21 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Datasets and
configuration
 Leave-one-out CrossValidation (previously 10-fold CV)
 Minimum sequence separation of 7 amino acids (previously not used)
 Distance threshold (cut-off) of 8 angstroms
 Beta-carbon distances
 Training/test protein dataset:
Viral capsid proteins, from PDB, identity ≤ 30%, 63 proteins with maximum
length of 1284 amino acids.
② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction (EvoBio 2011)
Motivation Our proposal Recent results Conclusions and future work 22 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based
approach for viral protein structure prediction (EvoBio 2011)
Motivation Our proposal Recent results Conclusions and future work 23 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Profile
Table 1. The 16 physico-chemical properties of amino acids considered from AAindex
CHOC760104 Proportion of residues 100% buried
LEVM760104 Side chain torsion angle phi(AAAR)
MEIH800103 Average side chain orientation angle
PALJ810107 Normalized frequency of alpha-helix in all-alpha class
QIAN880112 Weights for alpha-helix at the window position of 5
WOLS870101 Principal property value z1
ONEK900101 Delta G values for the peptides extrapolated to 0 M urea
BLAM930101 Alpha helix propensity of position 44 in T4 lysozyme
PARS000101 p-Values of mesophilic proteins based on the distributions of B values
NADH010102 Hydropathy scale based on self-information values in the two-state
model (9% accessibility)
SUYM030101 Linker propensity index
WOLR790101 Hydrophobicity index
JACR890101 Weights from the IFH scale
MIYS990103 Optimized relative partition energies - method B
MIYS990104 Optimized relative partition energies - method C
MIYS990105 Optimized relative partition energies - method D
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (3)
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (4)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ {
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ {
Note that prediction vectors represent fragment
these lengths is not included in them. The physico-ch
in the prediction vectors are explained in the next sub
view of data mining, Bi and Ei are the attributes of
the class to predict.
2.3 Physico-chemical feat ure select ion
To the aim of using the smallest and most effectiv
properties, we performed a feature selection from t
physico-chemical properties of amino acids. This rep
544 amino acid properties.
We used BARS to perform the feature selection
Pi =
1
e− b− 1
e− 1
j = b+ 1
Pi (sj ) (2)
Profile 2 B1 E1 ... Bm Em D
Fig. 2. Prueba
Bi = Pi (sb) +
L
j = 1
j = b
Pi (sj )
L|b− j |
, ∀i ∈ { 1..m} (3)
Ei = Pi (se) +
L
j = 1
j = e
Pi (sj )
L|e− j |
, ∀i ∈ { 1..m} (4)
hat prediction vectors represent fragments of different lengths, but
Roberto Ruiz, José C. Riquelme, and Jesús S. Aguilar-Ruiz. Best agglomerative
ranked subset for feature selection. Journal of Machine Learning Research –
ProceedingsTrack, 4:148–162, 2008.
Feature selection (FS): 16 properties from 544 of AAindex
FS Algorithm: BARS algorithm + CFS search algorithm
Motivation Our proposal Recent results Conclusions and future work 24 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Datasets and
configuration
 Leave-one-out CrossValidation
 Minimum sequence separation of 7 amino acids
 Distance threshold (cut-off) of 8 angstroms
 Beta-carbon distances
 Training/test protein dataset:
Mitochondrial matrix proteins, from PDB, identity ≤ 30%, 74 proteins with a
maximum length of 1094 amino acids.
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Motivation Our proposal Recent results Conclusions and future work 25 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
Table 3. Efficiency of our method predicting mitochondrial matrix proteins
Protein set Recall Precision Accuracy Specificity MCC
All proteins (74) 0.80 0.79 0.97 0.97 0.82
L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75
300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83
L > 450 (27) 0.77 0.76 0.95 0.95 0.82
a cross validation, cut-off of 8 angstroms and minimum sequence separation of
7 amino acids, achieved a precision value of 0.11 for proteins of more than 300
amino acids.
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Table 3. Efficiency of our method predicting mitochondrial matrix proteins
Protein set Recall Precision Accuracy Specificity MCC
All proteins (74) 0.80 0.79 0.97 0.97 0.82
L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75
300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83
L > 450 (27) 0.77 0.76 0.95 0.95 0.82
a cross validation, cut-off of 8 angstroms and minimum sequence separation of
7 amino acids, achieved a precision value of 0.11 for proteins of more than 300
amino acids.
(a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale
Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and
3BLX (b) with their color scale (c).
Figure 2 shows the predicted distance maps for protein 1TG6 (277 amino
Protein set Recall Precision Accuracy Specificity MCC
All proteins (74) 0.80 0.79 0.97 0.97 0.82
L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75
300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83
L > 450 (27) 0.77 0.76 0.95 0.95 0.82
a cross validation, cut-off of 8 angstroms and minimum sequence separation of
7 amino acids, achieved a precision value of 0.11 for proteins of more than 300
amino acids.
(a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale
Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and
3BLX (b) with their color scale (c).
Motivation Our proposal Recent results Conclusions and future work 26 / 33
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012)
Prediction of mitochondrial matrix protein structures based on feature
selection and fragment assembly. (EvoBio 2012) (accepted)
Table 4. Comparison at 8 ˚A with RBFNN on the same benchmark
PDB code (length)
RBFNN PDMpred
Np Nd Ap Np Nd Ap
1TTF (94) 376 1421 26.46 1307 1421 91.96
1E88 (160) 1006 3352 30.01 3075 3352 91.73
1NAR (290) 3346 10524 31.79 1797 10524 17.07
1BTJ B (337) 3796 14283 26.58 14026 14283 98.20
1J7E (458) 6589 25026 26.33 23407 25026 93.53
Average 27.67 78.49
N p : predict ed numbers; N d : desired numbers; A p : predict ion recall (%).
is the count of the predicted contacts by the algorithm and desired numbers Nd
is the total number of contacts. The contact threshold was set at 8 ˚A.
In Table 4 we show the results of this experimentation. As we can see in
Motivation Our proposal Recent results Conclusions and future work 27 / 33
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
28 / 33
Conclusions
• New protein distance map predictor has performed using a nearest
neighbors-based approach and feature selections of physico-chemical
properties.
• We predict distance maps, which provide more information than
contact maps and which conversion to contact maps is very easy.
• We achieved up to 0.80 of recall and 0.79 of precision with minimum
separation of 7 amino acids on some non-homologous protein sets.
• Our results are a large improvement (5o.82% better) in recall over the
results of a previous study (Zhang et al. 2005).
A nearest neighbour-based approach for viral protein structure prediction
Gualberto Asencio Cortés, Jesús S. Aguilar-Ruíz and Alfonso E. Márquez Chamorro
Motivation Our proposal Recent results Conclusions and future work 29 / 33
Current research
 Five new measures implemented:
 Recursive Convex Hull of amino acids (RCH)
 Solvent Accessibility (SA)
 Secondary Structure (SS) from PSI-PRED
 Coordination Number (CN)
 Position-Specific Scoring Matrix (PSSM) from PSI-BLAST
 Using this protein set in the current experiments (from ICOS, named INF2010):
 3262 proteins from PDB-REPRDB, identity ≤ 30%. Using 90% of the set for
training and 10% for test.
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 30 / 33
Current research
 Distance map post-proccessing
 New feasibility measures
 Based on the geometry of the predicted distance maps
 Using triangular inequalities
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 31 / 33
Future work
 Perform feature selections over RCH, SA, SS, CN and PSSM with different
statistics and windows sizes.
 Build aTop L/x ranking (as in CASP) of predicted contacts (from predicted
distances) using the standard deviation of distances of the nearest neighbors
(profiles).
 Use 24 amino acids as minimum sequence separation and CASP9 target
domains in free modelling category.
 Divide training profiles in bags according to evolutionary information (PSSM) of
amino acid pairs.Then predict distances using only the appropiate bag according
to the test fragment.
 Use protein domains as training instead whole sequences.
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Motivation Our proposal Recent results Conclusions and future work 32 / 33
Thank you
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Acknowledgements:
Bioinformatics group in Pablo de Olavide University
ICOS group in the University of Nottingham
Contact:
guaasecor@upo.es
33 / 33

More Related Content

Similar to Protein Distance Map Prediction using Nearest Neighbors

Trabajo de ingles (5)
Trabajo de ingles (5)Trabajo de ingles (5)
Trabajo de ingles (5)sasmaripo
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vstQiang Kou
 
An automatic test data generation for data flow
An automatic test data generation for data flowAn automatic test data generation for data flow
An automatic test data generation for data flowWafaQKhan
 
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...MLconf
 
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...jaumebp
 
Reconstruction and Clustering with Graph optimization and Priors on Gene netw...
Reconstruction and Clustering with Graph optimization and Priors on Gene netw...Reconstruction and Clustering with Graph optimization and Priors on Gene netw...
Reconstruction and Clustering with Graph optimization and Priors on Gene netw...Laurent Duval
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkUniversidade de São Paulo
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminargvesom
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...eSAT Journals
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programmingeSAT Publishing House
 
Pre-computation for ABC in image analysis
Pre-computation for ABC in image analysisPre-computation for ABC in image analysis
Pre-computation for ABC in image analysisMatt Moores
 
An optimal design of current conveyors using a hybrid-based metaheuristic alg...
An optimal design of current conveyors using a hybrid-based metaheuristic alg...An optimal design of current conveyors using a hybrid-based metaheuristic alg...
An optimal design of current conveyors using a hybrid-based metaheuristic alg...IJECEIAES
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...Alexander Decker
 
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...Md Rahman
 
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
 
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ijcseit
 

Similar to Protein Distance Map Prediction using Nearest Neighbors (20)

Trabajo de ingles (5)
Trabajo de ingles (5)Trabajo de ingles (5)
Trabajo de ingles (5)
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
 
An automatic test data generation for data flow
An automatic test data generation for data flowAn automatic test data generation for data flow
An automatic test data generation for data flow
 
GDRR Opening Workshop - Bayesian Inference for Common Cause Failure Rate Base...
GDRR Opening Workshop - Bayesian Inference for Common Cause Failure Rate Base...GDRR Opening Workshop - Bayesian Inference for Common Cause Failure Rate Base...
GDRR Opening Workshop - Bayesian Inference for Common Cause Failure Rate Base...
 
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
 
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
 
Reconstruction and Clustering with Graph optimization and Priors on Gene netw...
Reconstruction and Clustering with Graph optimization and Priors on Gene netw...Reconstruction and Clustering with Graph optimization and Priors on Gene netw...
Reconstruction and Clustering with Graph optimization and Priors on Gene netw...
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminar
 
50120130406014
5012013040601450120130406014
50120130406014
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programming
 
Pre-computation for ABC in image analysis
Pre-computation for ABC in image analysisPre-computation for ABC in image analysis
Pre-computation for ABC in image analysis
 
An optimal design of current conveyors using a hybrid-based metaheuristic alg...
An optimal design of current conveyors using a hybrid-based metaheuristic alg...An optimal design of current conveyors using a hybrid-based metaheuristic alg...
An optimal design of current conveyors using a hybrid-based metaheuristic alg...
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...
 
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
 
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
 
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Protein Distance Map Prediction using Nearest Neighbors

  • 1. Gualberto Asencio Cortés Supervisor: Jesús S. Aguilar Ruíz Bioinformatics Group School of Engineering Pablo de Olavide University, Seville, Spain Host: Jaume Bacardit Protein Distance Map Prediction based on a Nearest NeighborsApproach Current state of research February 9, 2012
  • 2. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 2 / 33
  • 3. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 3 / 33
  • 4. Proteins and amino acids Motivation Our proposal Recent results Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 4 / 33
  • 5. Protein structure representations Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 3D model Distance map Contact map (threshold = 8A) 1M3Y (4 chains of 413 amino acids) Motivation Our proposal Recent results Conclusions and future work 5 / 33
  • 6. Protein Structure Prediction (PSP) Protein structures Training New sequence Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 6 / 33
  • 7. Why PSP is important? • Knowing protein functions • Drug design for diseases such as cancer orAlzheimer • Protein docking and virtual screening • Protein engineering • … Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 7 / 33
  • 8. Why another contact/distance map predictor? • Currently a hot topic in bioinformatics journals • Current results are up to 22% of precision for contact prediction in the last CASP9, and clearly must be improved ▫ CASP competition CASP10 in 2012! http://predictioncenter.org/ Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 8 / 33
  • 9. Why distance maps? • Why a threshold?Why 8 angstroms? Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés • Distance maps store more information • Conversion to contact maps is very easy Motivation Our proposal Recent results Conclusions and future work 9 / 33
  • 10. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 33
  • 11. PDMpred: Prediction process Training set of protein structures Training data Test data Distance maps All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V Insert your t it le here Do you have a subt it le? I f so, writ e it here First A ut hor · Second A ut hor Received: date / Accepted: date A bst ract Insert your abstract here. Include keywords, PACS and mathematic subject classification numbers as needed. K eywords First keyword · Second keyword · More si ∈ { A, R, N, D, B, C, Q, E, Z, G, H, I , L, K , M , F, P, S, T, W, Y, V} 1 I nt roduct ion TheProtein StructurePrediction (PSP) problem consistsin determining thethre dimensional model of a protein, using only information contained in the amin acid sequence of the protein. The PSP problem is one of the most importan open problems in computational biology [53]. This is because the 3D structure determine the protein function. It follows that knowing the 3D structure of protein would be of enormous help for designing new drugs for diseases such a cancer or Alzheimer. Although there exist experimental methods for determinin protein structures, e.g., X-ray crystallography and nuclear magnetic resonanc Motivation Our proposal Recent results Conclusions and future work 11 / 33
  • 12. PDMpred: Prediction process Training set of protein structures Training data Test data Distance maps All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} L Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) ( Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} ( Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} ( Note that prediction vectors represent fragments of different lengths, bu these lengths is not included in them. The physico-chemical properties include in the prediction vectors are explained in the next subsection. From the point view of data mining, Bi and Ei are the attributes of training instances and Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} Note that prediction vectors represent fragments of different length these lengths is not included in them. The physico-chemical properties in in the prediction vectors are explained in the next subsection. From the p Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Profile 1 L P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Motivation Our proposal Recent results Conclusions and future work 33
  • 13. PDMpred: Prediction process Training set of protein structures Test set of protein sequences Training data Test data Test profiles ? ? ? ? ? ? ? ? ? ? Distance maps ? ? ? ? ? ? ? ? ? ? Distance maps d All test fragments All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V A-A V-V Motivation Our proposal Recent results Conclusions and future work 13 / 33
  • 14. PDMpred: Prediction process Training set of protein structures Test set of protein sequences Training data Test data Test profiles ? ? ? ? ? ? ? ? ? ? Distance maps ? ? ? ? ? ? ? ? ? ? Distance maps d All test fragments All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V A-A V-V average average Pi (se) + j = 1 j = e L|e− j | , ∀i ∈ { 1..m} (4) test t1 . . . tn ? training ... a1 . . . an Da ... b1 . . . bn Db ... neighbor search for each test prediction vector vectors represent fragments of different lengths, but Motivation Our proposal Recent results Conclusions and future work 33
  • 15. PDMpred: Evaluation measures Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés second was a measure of recall, which has been used in other protein predic- tion methods [14]. Finally, we have obtained measures of accuracy, specificity and Matthews Correlation Coefficient, that may often provide a much more bal- anced evaluation of the prediction than, for instance, the percentages [15]. The following formulas (2,3,4,5,6) define these five measures. Precision = TP TP + F P (2) Recall = TP TP + F N (3) Accuracy = TP + TN TP + F P + F N + TN (4) Specif icity = TN TN + F P (5) M CC = TP × TN − F P × F N (TP + F P)(TP + F N )(TN + F P)(TN + F N ) (6) Motivation Our proposal Recent results Conclusions and future work 15 / 33
  • 16. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 33
  • 17. Recent results ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. http://www.upo.es/eps/asencio/asppred ① G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction. In: 9th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBio 2011)Torino, Italia. Lecture Notes in Computer Science 6623, p. 69-76, Springer 2011, ISBN 978-3-642-20388-6. ① G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. In: 10th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBio 2012) Málaga, Spain (accepted). Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 17 / 33
  • 18. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. BUNA790101 alpha-NH chemical shifts (Bundi-Wuthrich, 1979) BUNA790103 Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979) CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982) FAUJ880111 Positive charge (Fauchere et al., 1988) FAUJ880112 Negative charge (Fauchere et al., 1988) GARJ730101 Partition coefficient (Garel et al., 1973) JOND750102 pK (-COOH) (Jones, 1975) KARP850103 Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985) KHAG800101 The Kerr-constant increments (Khanarian-Moore, 1980) MAXF760103 Normalized frequency of zeta R (Maxfield-Scheraga, 1976) PRAM820101 Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982) QIAN880139 Weights for coil at the window position of 6 (Qian-Sejnowski, 1988) RICJ880101 Relative preference value at N" (Richardson-Richardson, 1988) RICJ880104 Relative preference value at N1 (Richardson-Richardson, 1988) RICJ880114 Relative preference value at C1 (Richardson-Richardson, 1988) RICJ880117 Relative preference value at C" (Richardson-Richardson, 1988) SUEM840102 Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984) TANS770102 Normalized frequency of isolated helix (Tanaka-Scheraga, 1977) TANS770108 Normalized frequency of zeta R (Tanaka-Scheraga, 1977) VASM830101 Relative population of conformational state A (Vasquez et al., 1983) VELV850101 Electron-ion interaction potential (Veljkovic et al., 1985) WERD780102 Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978) WERD780103 Free energy change of alpha(Ri)to alpha(Rh)(Wertz-Scheraga, 1978) YUTK870103 Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987) AURR980120 Normalized positional residue frequency at helix termini C4' (Aurora-Rose, NADH010107 Hydropathy scale based on self-information values in the two-state model MONM990201 Averaged turn propensities in a transmembrane helix (Monne et al., 1999) MITS020101 Amphiphilicity index (Mitaku et al., 2002) WILM950104 Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O DIGM050101 Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005) Profile Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (2) Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (3) Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} Note that prediction vectors represent fragments of diffe these lengths is not included in them. The physico-chemical pr in the prediction vectors are explained in the next subsection. F view of data mining, Bi and Ei are the attributes of training the class to predict. 2.3 Physico-chemical feat ure select ion To the aim of using the smallest and most effective set of Profile 1 L P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Feature selection (FS): 3o properties from 544 of AAindex ( http://www.genome.jp/aaindex/ ) FS Algorithm: Relief evaluation algorithm + Ranker search algorithm Motivation Our proposal Recent results Conclusions and future work 33
  • 19. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. Datasets and configuration  10-fold cross validation  Beta-carbon distances  No minimum sequence separation  Distance threshold (cut-off) of 8 angstroms  Five datasets: 1. 20 random proteins from PDB, identity ≤ 30% 2. 118 proteins from CullPDB, identity ≤ 10% 3. 170 proteins from PDBselect, identity ≤ 25% 4. 221 proteins from CullPDB, identity ≤ 5% 5. 5130 proteins from PDBselect, identity ≤ 25%  PDB (Protein Data Bank): http://www.rcsb.org  CullPDB: http://dunbrack.fccc.edu/PISCES.php  PDBselect: http://bioinfo.tg.fh-giessen.de/pdbselect/ http://www.upo.es/eps/asencio/asppred Motivation Our proposal Recent results Conclusions and future work 33
  • 20. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. Results 3 0.48± 0.04 0.43± 0.05 0.99± 0.01 0.99± 0.01 4 0.40± 0.05 0.41± 0.05 0.99± 0.01 0.99± 0.01 5 0.14± 0.08 0.14± 0.08 0.99± 0.05 0.99± 0.05 Table 2: Efficiency of our method at 4 ˚A of distance threshold (µ ± σ values). Dataset Recall Precision Accuracy Specificity 1 0.39± 0.06 0.41± 0.08 0.97± 0.03 0.98± 0.01 2 0.39± 0.07 0.40± 0.07 0.95± 0.01 0.97± 0.02 3 0.38± 0.02 0.38± 0.02 0.95± 0.02 0.97± 0.01 4 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01 5 0.51± 0.11 0.51± 0.11 0.92± 0.06 0.95± 0.07 Table 3: Efficiency of our method at 8 ˚A of distance threshold (µ ± σ values). For 8 ˚A of threshold, recall and precision are basically the same in experiments 1 to 4. We found that our predictor no needs many proteins as training. Seems to it find good similar fragments in poor trainings. In experiment 5 we achieved better recall and precision than other experiments; however, standard deviation values are higher. This may be due to the great number of different types of proteins (structural classes or number of domains, for instance) in all the PDBselect. We included detailed information about each protein in five experiments as supplemental ma- terial at http://www.upo.es/eps/asencio/asppred. We indicate for each protein Journal of Integrative Bioinformatics 2011 http://journal.imbio.de/ Dataset K Recall Precision Accuracy Specificity 4 1 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01 3 0.40± 0.04 0.72± 0.01 0.96± 0.01 0.97± 0.00 5 0.39± 0.04 0.81± 0.01 0.97± 0.00 0.98± 0.00 7 0.39± 0.04 0.84± 0.00 0.99± 0.00 0.99± 0.00 9 0.38± 0.04 0.86± 0.00 0.99± 0.00 0.99± 0.00 11 0.38± 0.04 0.87± 0.00 0.99± 0.00 0.99± 0.00 13 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00 15 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00 Table 4: Study of the number (K ) of nearest training profiles. Motivation Our proposal Recent results Conclusions and future work 20 / 33
  • 21. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction (EvoBio 2011) Residue accessible surface area in folded protein (Chothia, 1976) Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982) RF value in high salt chromatography (Weber-Lacey, 1978) Profile Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (2) Ei = Pi (se) + L j = 1 Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (3) Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} Note that prediction vectors represent fragments of diffe these lengths is not included in them. The physico-chemical pr in the prediction vectors are explained in the next subsection. F view of data mining, Bi and Ei are the attributes of training the class to predict. 2.3 Physico-chemical feat ure select ion To the aim of using the smallest and most effective set of Profile 1 L P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Roberto Ruiz Sánchez, José Cristóbal Riquelme Santos, Jesús S. Aguilar-Ruiz. Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognition 39(12): 2383-2392 (2006) Feature selection (FS): 3 properties from 544 of AAindex FS Algorithm: BIRS algorithm + CFS search algorithm Motivation Our proposal Recent results Conclusions and future work 21 / 33
  • 22. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Datasets and configuration  Leave-one-out CrossValidation (previously 10-fold CV)  Minimum sequence separation of 7 amino acids (previously not used)  Distance threshold (cut-off) of 8 angstroms  Beta-carbon distances  Training/test protein dataset: Viral capsid proteins, from PDB, identity ≤ 30%, 63 proteins with maximum length of 1284 amino acids. ② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction (EvoBio 2011) Motivation Our proposal Recent results Conclusions and future work 22 / 33
  • 23. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Results ② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction (EvoBio 2011) Motivation Our proposal Recent results Conclusions and future work 23 / 33
  • 24. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Profile Table 1. The 16 physico-chemical properties of amino acids considered from AAindex CHOC760104 Proportion of residues 100% buried LEVM760104 Side chain torsion angle phi(AAAR) MEIH800103 Average side chain orientation angle PALJ810107 Normalized frequency of alpha-helix in all-alpha class QIAN880112 Weights for alpha-helix at the window position of 5 WOLS870101 Principal property value z1 ONEK900101 Delta G values for the peptides extrapolated to 0 M urea BLAM930101 Alpha helix propensity of position 44 in T4 lysozyme PARS000101 p-Values of mesophilic proteins based on the distributions of B values NADH010102 Hydropathy scale based on self-information values in the two-state model (9% accessibility) SUYM030101 Linker propensity index WOLR790101 Hydrophobicity index JACR890101 Weights from the IFH scale MIYS990103 Optimized relative partition energies - method B MIYS990104 Optimized relative partition energies - method C MIYS990105 Optimized relative partition energies - method D Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (3) Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (4) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { Note that prediction vectors represent fragment these lengths is not included in them. The physico-ch in the prediction vectors are explained in the next sub view of data mining, Bi and Ei are the attributes of the class to predict. 2.3 Physico-chemical feat ure select ion To the aim of using the smallest and most effectiv properties, we performed a feature selection from t physico-chemical properties of amino acids. This rep 544 amino acid properties. We used BARS to perform the feature selection Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (3) Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (4) hat prediction vectors represent fragments of different lengths, but Roberto Ruiz, José C. Riquelme, and Jesús S. Aguilar-Ruiz. Best agglomerative ranked subset for feature selection. Journal of Machine Learning Research – ProceedingsTrack, 4:148–162, 2008. Feature selection (FS): 16 properties from 544 of AAindex FS Algorithm: BARS algorithm + CFS search algorithm Motivation Our proposal Recent results Conclusions and future work 24 / 33
  • 25. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Datasets and configuration  Leave-one-out CrossValidation  Minimum sequence separation of 7 amino acids  Distance threshold (cut-off) of 8 angstroms  Beta-carbon distances  Training/test protein dataset: Mitochondrial matrix proteins, from PDB, identity ≤ 30%, 74 proteins with a maximum length of 1094 amino acids. ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Motivation Our proposal Recent results Conclusions and future work 25 / 33
  • 26. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Results Table 3. Efficiency of our method predicting mitochondrial matrix proteins Protein set Recall Precision Accuracy Specificity MCC All proteins (74) 0.80 0.79 0.97 0.97 0.82 L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75 300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83 L > 450 (27) 0.77 0.76 0.95 0.95 0.82 a cross validation, cut-off of 8 angstroms and minimum sequence separation of 7 amino acids, achieved a precision value of 0.11 for proteins of more than 300 amino acids. ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Table 3. Efficiency of our method predicting mitochondrial matrix proteins Protein set Recall Precision Accuracy Specificity MCC All proteins (74) 0.80 0.79 0.97 0.97 0.82 L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75 300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83 L > 450 (27) 0.77 0.76 0.95 0.95 0.82 a cross validation, cut-off of 8 angstroms and minimum sequence separation of 7 amino acids, achieved a precision value of 0.11 for proteins of more than 300 amino acids. (a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and 3BLX (b) with their color scale (c). Figure 2 shows the predicted distance maps for protein 1TG6 (277 amino Protein set Recall Precision Accuracy Specificity MCC All proteins (74) 0.80 0.79 0.97 0.97 0.82 L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75 300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83 L > 450 (27) 0.77 0.76 0.95 0.95 0.82 a cross validation, cut-off of 8 angstroms and minimum sequence separation of 7 amino acids, achieved a precision value of 0.11 for proteins of more than 300 amino acids. (a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and 3BLX (b) with their color scale (c). Motivation Our proposal Recent results Conclusions and future work 26 / 33
  • 27. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Results ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Table 4. Comparison at 8 ˚A with RBFNN on the same benchmark PDB code (length) RBFNN PDMpred Np Nd Ap Np Nd Ap 1TTF (94) 376 1421 26.46 1307 1421 91.96 1E88 (160) 1006 3352 30.01 3075 3352 91.73 1NAR (290) 3346 10524 31.79 1797 10524 17.07 1BTJ B (337) 3796 14283 26.58 14026 14283 98.20 1J7E (458) 6589 25026 26.33 23407 25026 93.53 Average 27.67 78.49 N p : predict ed numbers; N d : desired numbers; A p : predict ion recall (%). is the count of the predicted contacts by the algorithm and desired numbers Nd is the total number of contacts. The contact threshold was set at 8 ˚A. In Table 4 we show the results of this experimentation. As we can see in Motivation Our proposal Recent results Conclusions and future work 27 / 33
  • 28. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 28 / 33
  • 29. Conclusions • New protein distance map predictor has performed using a nearest neighbors-based approach and feature selections of physico-chemical properties. • We predict distance maps, which provide more information than contact maps and which conversion to contact maps is very easy. • We achieved up to 0.80 of recall and 0.79 of precision with minimum separation of 7 amino acids on some non-homologous protein sets. • Our results are a large improvement (5o.82% better) in recall over the results of a previous study (Zhang et al. 2005). A nearest neighbour-based approach for viral protein structure prediction Gualberto Asencio Cortés, Jesús S. Aguilar-Ruíz and Alfonso E. Márquez Chamorro Motivation Our proposal Recent results Conclusions and future work 29 / 33
  • 30. Current research  Five new measures implemented:  Recursive Convex Hull of amino acids (RCH)  Solvent Accessibility (SA)  Secondary Structure (SS) from PSI-PRED  Coordination Number (CN)  Position-Specific Scoring Matrix (PSSM) from PSI-BLAST  Using this protein set in the current experiments (from ICOS, named INF2010):  3262 proteins from PDB-REPRDB, identity ≤ 30%. Using 90% of the set for training and 10% for test. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 30 / 33
  • 31. Current research  Distance map post-proccessing  New feasibility measures  Based on the geometry of the predicted distance maps  Using triangular inequalities Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 31 / 33
  • 32. Future work  Perform feature selections over RCH, SA, SS, CN and PSSM with different statistics and windows sizes.  Build aTop L/x ranking (as in CASP) of predicted contacts (from predicted distances) using the standard deviation of distances of the nearest neighbors (profiles).  Use 24 amino acids as minimum sequence separation and CASP9 target domains in free modelling category.  Divide training profiles in bags according to evolutionary information (PSSM) of amino acid pairs.Then predict distances using only the appropiate bag according to the test fragment.  Use protein domains as training instead whole sequences. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 32 / 33
  • 33. Thank you Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Acknowledgements: Bioinformatics group in Pablo de Olavide University ICOS group in the University of Nottingham Contact: guaasecor@upo.es 33 / 33

Editor's Notes

  1. - Proteins are macromecules composed by one or more chains of amino acids. There are 20 natural amino acids.
  2. - Proteins are macromecules composed by one or more chains of amino acids. There are 20 natural amino acids.
  3. PSP is very useful to known protein functions, because protein functions are determined by the protein structure. It is
  4. CASP is a very important bianual competition among protein structure predictors
  5. Our goal is to compete in CASP, using the same evaluation scheme, and try to overcome this 22%.