Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.
DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.
Liu R, Hu J.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
20131019 生物物理若手 Journal Club
1. DNABind: A hybrid algorithm for structure-based prediction of
DNA-binding residues by combining machine learning- and
template-based approaches. Proteins. 2013 Jun 5.
20131019
生物物理若手関西支部 Journal Club
4. Result: DNABind, a hybrid method of machine learning and template-based
approaches showed excellent performance on predicting DNA-binding residues.
Template
DNABind
EcoRV(1RVE:A)
CprK (3E6C:C)
Machine learning
True positive residues.
DNABind improves classification.
Query protein, Template protein, TP,
, FN
5. Aim
Protein-DNA interactions is important for cell biology.
Its determination by experiments is time- and cost-consuming.
Computational approaches are desirable.
6. Computational approaches
Data bank (PDB)
Binding residues characters
Exposed solvents
Higher electrostatics potential
More conserved
Hotspots as clusters of conserved residues
Structural properties (DNA-binding residue vs surface)
Packing density
Surface curvature
B-factor
Residue fluctuation
Hydrogen bond donor
http://www.rcsb.org/pdb/home/home.do
10. Features used in machine learning
Structure-based
PSSM (position specific scoring matrix)
Evolutionally conservation
Solvent accessibility
Local geometry (depth and protrusion index)
Topological features
degree, closeness, betweenness, clustering coefficient
Relative position (distance to centroid)
Statistical potential (Boltzmann distribution)
Sequence-based (more difficult than structure)
Amino acid identity
Residue physicochemical properties
polarity, secondary structure, molecular volume, codon diversity, electrostatic charge
Predicted structure (Not need 3D structure !!)
11. Features used in machine learning
Structure-based
PSSM
Relative solvent accessibility
Depth and protrusion index
Topological features
Distance to centroid
Statistical potentials
Sequence-based
PSSM
Predicted structures
Amino acid indices
Statistical potentials
Construct machine learning (SVM)
15. Network
Degree is a commonly used measure to reflect the local
connectivity of a node.
Closeness is a global centrality metric used to determine
how critical a residue is in a residue interaction network.
Betweenness of residue i is defined to be the sum of the
fraction of shortest paths between all pairs of residues
that pass through residue i.
Motif, hub, and community
are also important…
Clustering coefficient (transitivity) quantifies how close
its neighbors are to being a clique. Probability that the
adjacent vertices of a vertex are connected.
16. Network sample; human protein interactome
Scale-free
Small-world
Cluster
Power law (Pareto distribution)
Bioinformatics. 2012 Jan 1;28(1):84-90.
18. Machine learning
Support vector machine (SVM)
Decision tree
RandomForest
Logistic regression
LASSO (Elastic net and Ridge)
Neural networks (Deep learning)
Evolutionary algorithm
Gaussian processing
k nearest neighbor
Clustering
Bayesian networks
Association rule learning
Inductive logic programming (ILP)
19. Support vector machine (SVM)
Make hyperplane to divide groups.
Kernel method; non-linear to linear
Easy to do.
Much computational time.
Tuning is very difficult.
23. LASSO, Elastic net, and Ridge regression
Least Absolute Shrinkage and Selection Operator
LASSO
Elastic Net
Ridge
24. Neural networks
Artificial mammal brain (perceptron).
Hidden multi-layer.
Deep learning is hot topic!!
(hard to understand…)
http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html
25. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
26. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
27. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
28. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
29. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
30. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
31. n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
Test 1
One-leave out CV
34. Statistical features of structure
A: Binding residues are highly solvent
accessible.
B, C: Binding residues have low depth and
high protrusion.
D-G: Not so much difference in networks.
H: Binding residues are less distant to the
centroid.
36. Performance
Higher TM score is required for good prediction.
TM-score is a measure of similarity between two protein structures with different tertiary
structures. < 0.2 is random relation and > 0.5 is highly related.
Proteins. 2004 Dec 1;57(4):702-10.
Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.