20131019 生物物理若手 Journal Club

DNABind: A hybrid algorithm for structure-based prediction of
DNA-binding residues by combining machine learning- and
template-based approaches. Proteins. 2013 Jun 5.

20131019
生物物理若手関西支部 Journal Club

Topics
Prediction of protein-DNA binding residues
Statistics of network
Machine learning

Result: DNABind, a hybrid method of machine learning and template-based
approaches showed excellent performance on predicting DNA-binding residues.
Template

DNABind

EcoRV(1RVE:A)

CprK (3E6C:C)

Machine learning

True positive residues.
DNABind improves classification.
Query protein, Template protein, TP,

, FN

Aim

Protein-DNA interactions is important for cell biology.
Its determination by experiments is time- and cost-consuming.

Computational approaches are desirable.

Computational approaches
Data bank (PDB)
Binding residues characters
Exposed solvents
Higher electrostatics potential
More conserved
Hotspots as clusters of conserved residues

Structural properties (DNA-binding residue vs surface)
Packing density
Surface curvature
B-factor
Residue fluctuation
Hydrogen bond donor
http://www.rcsb.org/pdb/home/home.do

Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best match

Template!!

Features used in machine learning
Structure-based
PSSM (position specific scoring matrix)
Evolutionally conservation
Solvent accessibility
Local geometry (depth and protrusion index)
Topological features
degree, closeness, betweenness, clustering coefficient

Relative position (distance to centroid)
Statistical potential (Boltzmann distribution)

Sequence-based (more difficult than structure)
Amino acid identity
Residue physicochemical properties
polarity, secondary structure, molecular volume, codon diversity, electrostatic charge

Predicted structure (Not need 3D structure !!)

Features used in machine learning
Structure-based
PSSM
Relative solvent accessibility
Depth and protrusion index
Topological features
Distance to centroid
Statistical potentials

Sequence-based
PSSM
Predicted structures
Amino acid indices
Statistical potentials

Construct machine learning (SVM)

Template-based approach
Used in image recognition, etc…
Recognition of faces in the camera.
Template!!

Template-based approach
Used in image recognition, etc…
Recognition of faces in the camera.
Match!!

Template!!

Template-based prediction
Template-based
Structural alignment and statistical potential
The binding residue prediction will be conducted only if the
target protein was considered as a DNA-binding protein.

312 templates were selected.

Network

Degree is a commonly used measure to reflect the local
connectivity of a node.
Closeness is a global centrality metric used to determine
how critical a residue is in a residue interaction network.
Betweenness of residue i is defined to be the sum of the
fraction of shortest paths between all pairs of residues
that pass through residue i.
Motif, hub, and community
are also important…

Clustering coefficient (transitivity) quantifies how close
its neighbors are to being a clique. Probability that the
adjacent vertices of a vertex are connected.

Network sample; human protein interactome
Scale-free
Small-world
Cluster
Power law (Pareto distribution)

Bioinformatics. 2012 Jan 1;28(1):84-90.

Machine learning
Example; spam
4601 samples, 57 parameters.
Classification; spam or nonspam

Machine learning
Support vector machine (SVM)
Decision tree
RandomForest
Logistic regression
LASSO (Elastic net and Ridge)
Neural networks (Deep learning)
Evolutionary algorithm
Gaussian processing
k nearest neighbor
Clustering
Bayesian networks
Association rule learning
Inductive logic programming (ILP)

Support vector machine (SVM)
Make hyperplane to divide groups.
Kernel method; non-linear to linear
Easy to do.
Much computational time.
Tuning is very difficult.

Decision tree
Make many trees.
Easy to understand graphically.
Performance is not so good.

RandomForest
Make many decision trees.
Much precise.
A little time consumer.

Logistic regression
Many medical researchers use…
Easy to use but tuning is very difficult.
(to tell the truth…)

LASSO, Elastic net, and Ridge regression
Least Absolute Shrinkage and Selection Operator

LASSO
Elastic Net
Ridge

Neural networks
Artificial mammal brain (perceptron).
Hidden multi-layer.
Deep learning is hot topic!!
(hard to understand…)

http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html

n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.

Train data

Train data

Test 1
One-leave out CV

Performance

SVM

Tree

RandomForest

LASSO

Elastic net

Ridge

Logistic

nnet

Recall

0.917

0.872

0.927

0.894

0.892

0.852

0.893

0.930

Precision

0.948

0.914

0.954

0.932

0.926

0.926

0.930

0.935

F

0.932

0.893

0.940

0.913

0.911

0.887

0.911

0.932

MMC

0.890

0.826

0.902

0.858

0.856

0.821

0.856

0.888

Statistical features of structure
A: Binding residues are highly solvent
accessible.
B, C: Binding residues have low depth and
high protrusion.
D-G: Not so much difference in networks.
H: Binding residues are less distant to the
centroid.

Performance

Higher TM score is required for good prediction.

TM-score is a measure of similarity between two protein structures with different tertiary
structures. < 0.2 is random relation and > 0.5 is highly related.
Proteins. 2004 Dec 1;57(4):702-10.
Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.

Performance
Comparison among ML, TL, and DNABind.

Comparison between DNABind and other software.

20131019 生物物理若手 Journal Club

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 20131019 生物物理若手 Journal Club

Similaire à 20131019 生物物理若手 Journal Club (20)

Plus de Med_KU

Plus de Med_KU (20)

Dernier

Dernier (20)

20131019 生物物理若手 Journal Club