Publicité
Publicité

Contenu connexe

Similaire à Prib2014(20)

Publicité

Prib2014

  1. DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Advanced Circuits, Architecture, and Computing Lab BgN-Score & BsN-Score: Bagging & Boosting Based Ensemble Neural Networks Scoring Functions for Accurate Binding Affinity Prediction of Protein-Ligand Complexes Hossam M. Ashtawy ashtawy@egr.msu.edu The 9th IAPR conference on Pattern Recognition in Bioinformatics (PRIB 2014) August 21, 2014 Nihar R. Mahapatra nrm@egr.msu.edu Department of Electrical & Computer Engineering Michigan State University, East Lansing, MI, U.S.A. © 2014
  2. Motivation 2 BA is the principal determinant of many vital biological processes Accurate prediction of BA is challenging & remains unresolved prb. Conventional SFs have limited predictive power
  3. Outline 3 •Scoring Functions •Our Approach Background & Scope of Work •Compound Characterization •Ensemble Neural Networks Materials & Methods •SFs’ Tuning, Training, & Testing •SFs’ evaluation & comparison Experiments & Discussion
  4. Background and Scope of Work 4
  5. Docking & Scoring 5 Ensemble NNs = SF2
  6. Scoring Challenges 6 Lack of accurate accounting of intermolecular physicochemical interactions More descriptors = Curse of dimensionality Relationship between descriptors & BA could be highly nonlinear
  7. Multi-Layer Neural Network 7  Theoretically, it can model any nonlinear continuous function; however:  Hard to tune its weights to optimal values  Does not handle high dimensional data well  Has high variance errors  Is a black box model  lacks descriptive power
  8. Our Approach & Scope of Work 8 Collect large number of PLCs with known BA Extract a diverse set of features Train ensemble NN models based on boosting and bagging Evaluate resulting SFs on diverse and homogeneous protein families
  9. Materials and Methods 9
  10. Compound Database: PDBbind [3] 10  Protein-ligand complexes obtained from PDBbind 2007  PDBbind is a selective compilation of the Protein Data Bank (PDB) database
  11. Compound Database: PDBbind 11 1a30 9hvp1d5j Protein-ligand complexes from PDB PDB key 1a30 1f9g 2qtg 2usn 1f9g 2qtg PDBbind filters and Binding Affinity Collection 2usn Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools) ExperimentalData
  12. PDB Ligand’s MW ≤ 1000 # non-hydrogen atoms of the ligand ≥ 6 Only one ligand is bound to the protein Protein & ligand non- covalently bound Resolution of the complex crystal structure ≤ 2.5Å Elements in complex must be C, N, O, P, S, F, Cl, Br, I, H Known Kd or Ki Hydrogenation Protonation & deprotonation Refined set of PDBbind PDBbind: Refined Set 12
  13. PDBbind: Core Set 13 Refined set Similarity search using BLAST Similarity cutoff of 90% Clusters with ≥ 4 complexes Binding affinity of highest- affinity complex is 100-fold the affinity of lowest one First, middle, and lowest affinity complexes from each cluster Core set of PDBbind
  14.  Extracted features calculated for the following scoring functions: X-Score (6 features) AffiScore (30 features). RF-Score (36 features) GOLD (14 features) Compound Characterization 14 1a30 9hvp1d5j Protein-ligand complexes from PDB PDB key 1a30 Training and Test Datasets 1f9g 2qtg 2usn 1f9g 2qtg PDBbind filters and Binding Affinity Collection 2usn Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools) ExperimentalData
  15.  Primary training set (1105): Pr  Core test set (195): Cr Training and Test Datasets 15 1a30 Training and Test Datasets 1f9g 2qtg 2usn Primary Training Dataset Pr Core Test Dataset Cr Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools) Ensemble NN boosting & bagging Algorithms Test Complex to Score ExperimentalDat X Dataset A Dataset R Dataset G Dataset XA Dataset ...XARG Dataset X Dataset A Dataset R Dataset G Dataset XA Dataset ...XARG Dataset
  16. Base Learner: A Neural Network 16  Prediction of each network is calculated as follows:  Network weights are optimized to minimize the fitting criterion E: Input layer Hidden layer Output layer wh,owi,h +1 x1 x2 xP Bindingaffinity +1 Featuresofacomplex
  17. BgN-Score 17  An ensemble of MLP ANNs grown  Inputs to each ANN are a random subset of p features  Each ANN trained on a bootstrap dataset randomly sampled with replacement from training data  After building the ensemble model, the BA of a new protein-ligand complex X is computed by applying the formula: wh,owi,h +1 x3 x21 x13 Bindingaffinity +1 wh,owi,h +1 x8 x51 x6 +1 wh,owi,h +1 x6 x2 x37 +1 Featuresofacomplex Average
  18. BsN-Score 18 wh,owi,h +1 x30 x19 x64 Binding affinity +1 wh,owi,h x39 +1 wh,owi,h +1 x11 x2 x57 +1 x5 x8 Featuresofacomplex +1 Featuresofacomplex Featuresofacomplex
  19. Conventional SFs 19 Empirical SFs (9) DS::PLP DS::Jain DS::Ludi GLIDE::GlideScore SYBYL::ChemScore SYBYL::F-Score GOLD::ChemScore GOLD::ASP X-Score Knowledge Based SFs (4) DS::LigScore DS::PMF SYBYL::PMF DrugScore Force-field SFs (3) SYBYL::D-Score SYBYL::G-Score GOLD::GoldScore
  20. Experiments, Results, and Discussion 20
  21. SF Construction & Application Workflow 21 Scoring Function Building and Evaluation Collecting Data Feature Generation Training Set Formation Model Building BsN-Score & BgN-Score Binding Affinity Protein 3D structure Ligand Feature Generation Build Validate Parameter Tuning Training Data Optimal Parameters
  22. Parameter Tuning: BgN-Score 22 Optimal parameters: H~20, p ~ P/3, λ ~ 0.001 , N ~ 2000 Training Net. 1 Training Net. 851 Training Net. 2000Parameter Set 1 Parameter Set i Parameter Set θ Generated Parameter Sets Build an BgN-Score model and test on OOB examples 1.56 1.04 3.17 OOBMSE An example of a parameter set: H = 23, p = 15 , λ= 0.031 Choose the parameter set that corresponds to the minimum OOBMSE
  23. Parameter Tuning: BsN-Score 23 Optimal parameters: H~20, p ~ P/3, λ ~ 0.001 , N ~ 2000 Training BsN-Score 1 Training BsN-Score 4 Training BsN-Score 10Parameter Set 1 Parameter Set i Parameter Set θ Generated Parameter Sets Build 10 BsN-Score models and test on their respective validation examples 1.56 1.04 3.17 Average CV MSE An example of a parameter set: H = 23, p = 15 , λ= 0.031 Choose the parameter set that corresponds to the minimum CV MSE Validation Validation Validation
  24. 24 0.644 0.657 0.804 0.816 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SYBYL::F-Score SYBYL::PMF-Score GOLD::GoldScore DS::Jain SYBYL::D-Score GOLD::ChemScore DS::PMF GlidScore-XP DS::LigScore2 DS::LUDI3 SYBYL::G-Score GOLD::ASP DS::PLP1 SYBYL::ChemScore DrugScoreCSD X-Score::HMScore SNN-Score::X BgN-Score::XARG BsN-Score::XARG Correlation Coefficient Rp ScoringFunctionsEnsemble NN vs. Conventional SFs on Diverse Complexes
  25. Ensemble NN vs. Conventional SFs on HIV Complexes 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  26. Ensemble NN vs. Conventional SFs on Trypsin Complexes 26 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  27. Ensemble NN vs. Conventional SFs on Carbonic Anhydrase Cmpxs. 27 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  28. Ensemble NN vs. Conventional SFs on Thrombin Complexes 28 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  29. Conclusion 29
  30. Concluding Remarks 30 BsN-Score & BgN-Score are the most accuare SFs BsN-Score & BgN-Score are ~20% more accurate (0.804 & 0.816 vs. 0.675) compared to SNN-Score BsN-Score & BgN-Score are ~25% more accurate (0.804 & 0.816 vs. 0.644) compared to the best conventional SF, X-Score. Moreover, their accuracies are even higher when they are used to predict BAs of protein-ligand complexes that are related to their training sets.
  31. Future Work 31 Collect more PLC from other databases Consider other techniques to extract more descriptors Analyze variable importance and descriptor interactions Consider other types & topologies of ANNs such as Recurrent NNs and Deep NNs.
  32. Thank You! 32

Notes de l'éditeur

  1. We used the same complex database that Cheng et al used as a benchmark in their comparative assessment of 16 popular SFs. This DB, PDBbind, is a popular benchmark that has been cited and used to evaluate SFs in hundreds of other studies (from google scholar). PDBbind is a high-quality and comprehensive compilation of biomolecular complexes deposited in the Protein Data Bank (PDB).
  2. [The slide itself has the talk.]
  3. Boosting is an ensemble machine-learning technique based on a stage-wise fitting of base learners. The technique attempts to minimize the overall loss by boosting the complexes having highest predicted errors, i.e., by fitting NNs to (accumulated) residuals made by previous networks in the ensemble model. The algorithm starts by fitting the first network to all training complexes. A small fraction (ν < 1) of the first network’s predictions is used to calculate the first iteration of residuals Y1. The network f1 is the first term in the boosting additive model. In each subsequent stage, a network is trained on a bootstrap sample of the training complexes described by a random subset of p < P features. The values of the dependent variable of the training data for the network l are the current residuals corresponding to the sampled protein ligand complexes. The residuals for each network are the differences between previous residuals and a small fraction of its predicted errors. This fraction is controlled by the shrinkage parameter ν < 1 to avoid any overfitting. Network generation continues as long as their number does not exceed a predefined limit L. Each network joins the ensemble with a shrunk version of itself. In our experiments, we fixed the shrinkage parameter to 0.001 which gave the lowest out-of-sample error. The final prediction of a PLC x^P is : [read the formula given in the slide]
  4. A total of sixteen popular SFs are compared to NN SFs in this study. The sixteen SFs are either used in mainstream commercial docking tools and/or have been developed in academia. The functions were recently compared against each other in a study conducted by Cheng et al. The set includes 9 Empirical SFs, 4 Knowledge-based SFs, and 3 Force-field SFs.
Publicité