Barcelona sabatica

Protein loop classification
using Artificial Neural
Networks

Armando Vieira1 and Baldomero Oliva2
1
ISEP and Centro de Física Computacional, Coimbra, Portugal
www.defi.isep.ipp.pt/~asv
2
Structural Bioinformatics Laboratory (GRIB)
IMIM/Universitat Pompeu Fabra, Barcelona, Spain

BIOINFORMATICS
joining two worlds apart

Outline
Brief review of protein structure
Statement of problem and why is so hard
Data pre-processing, corrections, updates
and beyond multiple alignments…
Neural Networks in protein structure
prediction
HLVQ
Results and future work

Proteins

All proteins are chains of 20 amino acids
Not all chains of amino acids are proteins
Fold rapidly and repeatedly
Proteins are the machinery of live
Essential to all (known) organisms

The Gist of it

Amino acid Physical Function
sequence structure

Typical globular protein
ME
MEKKEFHIVAETG
MEKKEFHIVA
MEKK
MEK
IHARPATLLVQTA
SLFNSDINLETLG
KSVNLKSIMGVMS
LGVGQGSDVTITV
DGADEADGMAAI
VETLQLQGLAQ

+180
b b b p o M e e e

b b b p o M M e e

b b b p . l l s e

a a a T . l l g N Ψ
N a a a . U l g N

N a a a . U g g N

I a a a . G G G I

e F F F o e e e e

b b b p o e e e e

-180
-180 φ +180

Ramachandran Alphabet
180°
B
90°

ψ 0°
A G

-90°
E
-180°
-180° -90° 0° 90° 180°
φ

5-letter alphabet
Residue Sequence 3° Structure

MEKKEFHIVAET ACCDECBAABDE
GIHARPATLLVQT CBDABCDBEABD
ASLFNSDINLETL BCBDBAEBDBDB
GKSVNLKSIMGV AEBABDCBBDBA
MSLGVGQGSDVT DDCBDBCBDBEB
ITVDGADEADGM DBCBBDCAABDE
AAIVETLQLQGLA DCDCEAABACAA
Q... AADC…

What shall we do?
• Ab initio:
Quantum Mechanics +
big computers +
large # configurations
= huge problems…

• Machine Learning:
Use known cases to learn a suitable
map:
sequence→ structure

Artificial Neural Networks
• A problem-solving paradigm modeled after the
physiological functioning of the human brain.

• Synapses in the brain are modeled by computational nodes.

• The firing of a synapse is modeled by input, output, and
threshold functions.

• The network “learns” based on problems to which answers
are known (supervised learning).

• The network can then produce answers to entirely new
problems of the same type.

Neural Networks
Input
Layer
Hidden
Layers

Output
Layer

Overfitting – high risk!

Less complicated hypothesis has lower error rate

Hidden Layer Vector
Quantization- HLVQ

Traditional NN HLVQ
z
o o
o o o o
x x
o oo o oo
x x
ox x ox x
xxx xxx
x x

Main advantage: detect and correct prediction for
outliers

Loop Types

α−α : α -helix - α -helix
α−β : α -helix – β strand
β−α : β strand - α -helix

β -hairpin: β strand - β strand
β - link: β strand - β strand

α−α
Similar conformation aa{b}aa / aa{p}aa
Identical geometry (4,6)(0,45)(45,90)(180,225)

Pro 75%

Ser 75%

1.3.1 aa{p}aa
1.1.2 aa{b}aa

© Baldomero Oliva

ArchDB database

~ 20 000 loops classified into ~ 3000 classes.
EE-3.4.1
Loop type - loop size . consensus . motif
TASK: classify a loop from sequence alone
If not possible, get as much information as
possible

Problems

• Coding of aminoacids

• Huge searching space, sparsely populated

• How to assign the loop classes?

• High dimensionality → Large Networks → poor
generalization

Aminoacid coding
the classical way
A → (1, 0, …0)
C → (0, 1, …0)
Y → (0, 0, …1)

Useful but not efficient!!!
I am working to improve it…

Theory; but how about applications?!

β-β link and β-β harpins from
sequence

HLVQ Predicted Predicted
(MLP) β-β link β-β harpin
Real 88.4 11.6
β-β link (79.4) (20.6)
Real 12.5 87.5
β-β harpin (16.1) (83.9)

Prediction of all loop types
from sequence alone

β-β lk α-β β-β hp β-α α-α

β-β lk 45.9 28.5 3.7 19.8 2.1

α-β 8.8 67.4 1.2 18.0 4.6

β-β hp 0.4 0.9 96.1 2.1 0.5

β-α 4.4 6.2 2.4 79.5 7.6
α-α 4.0 15.7 1.3 20.3 58.6

What’s it all mean?
Given a loop residue sequence, we can
(usually) identify its native structure.
Not ab initio: We cannot tell the structure
of a novel sequence.
HLVQ is superior to MLP

Future Work

Better coding of aminoacids
Larger sequences / low complexity
Going beyond structure
Clever alphabet that explore similarities
Multiobjective Genetic Algorithms

Beyond Multiple Alignments

• Alligments are good … but expensive and
boring ...
• Information contained in a multiple
alignment can, in principle, be expressed
using an adequate aminoacid coding
scheme Sensibility

• How? Genetic Algorithm

Coded Amino Acids

Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C)

Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I)

Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P)

Serine (S) Threonine (T) Tryptophan Tyrosine (Y) Valine (V)
http://www.chemie.fu-berlin.de/chemistry/bio/

ArchDB database
Protein Data Bank (PDB)
http://www.rcsb.org contains ~ 25 000
proteins with known structure of ~ 106
entries in SWISS-PROT

ArchDB ~ 20 000 classified loops

Barcelona sabatica

Recommandé

Recommandé

Contenu connexe

Similaire à Barcelona sabatica

Similaire à Barcelona sabatica (20)

Plus de Armando Vieira

Plus de Armando Vieira (20)

Barcelona sabatica