Protein Structure Alignment and Comparison

The ProCKSI-Server
An on-line Decision Support System for
Protein Structure Comparison

Natalio Krasnogor
www.cs.nott.ac.uk/~nxk
Natalio.Krasnogor@Nottingham.ac.uk

Interdisciplinary Optimisation Laboratory
Automated Scheduling, Optimisation & Planning Research Group
School of Computer Science and Information Technology

Centre for Integrative Systems Biology
School of Biology

Centre for Healthcare Associated Infections
Institute of Infection, Immunity & Inflammation

University of Nottingham
27th November 2008, University of Warwick 1

Outline
 Introduction
− Brief introduction to proteins
− Protein structures Comparison
− Methods
 ProCKSI
− Motivation
− External Methods
− USM & MAX-CMO
− Consensus building
 Results
− From a structural bioinformatics perspective
− From a Computational perspective
 Conclusions
 Acknowledgement


Introduction
www.procksi.org


What are Proteins?

 Proteins are
biological molecules
of primary
importance to the
functioning of living
organisms

 Perform many and
varied functions


Structural Proteins: the organism's basic building blocks, eg.
collagen, nails, hair, etc
Enzymes: biological engines which mediate multitude of biochemical
reactions. Usually enzymes are very specific and catalyze only a
single type of reaction, but they can play a role in more than one
pathway.
Transmembrane proteins: they are the cell’s housekeepers, eg. By
regulating cell volume, extraction and concentration of small
molecules from the extracellular environment and generation of ionic
gradients essential for muscle and nerve cell function (sodium/
potasium pump is an example)


Protein Structures

 Varying: size, shape, structure

 Structure determines their
biological activity

 “Natures Robots”

 Understanding protein structure
is key to understanding function
and dysfunction


Components of Proteins

 Build Blocks:
− Amino Acids
− Common Basic Unit

Livingstone and Barton:(1993)
• Distinct “side chains”
• 20 Amino Acid Types



•Thousands of different physicochemical and biochemical properties (AAIndex)
• Thus proteins are beautiful combinatorial beasts!


Protein Synthesis
 Amino Acid Sequences
− AAs polymerised into
Chains (Residues)
− Gene sequence
determines Protein
sequence
 Protein Structure
− Chains fold into
specific compact
structures
 Structure formation (folding)
is spontaneous
 Sequence determines
Structure
 Structure determines
function

Determining Protein Structures

 Protein Structure
determination is
slow and difficult
 Determining protein
sequence is
relatively easy
(Genomics)
 PDB vs Genbank

Thomas Splettstoesser


Comparing Protein Structures

• Proteins build the majority of cellular structures
and perform most life functions

• Extend knowledge about the protein universe:
– Understand interrelations
between structures and functions of proteins
through measured similarities
– Group (cluster) proteins by
structural similarities as to infer commonalities

• Goal is to predict functions of proteins
from their structure, or design new
proteins for specific functions

• Considering any two objects:

What does “similar” mean?

Similar or not? How / Where similar?



Similarity comparison of protein structures is not trivial even though it is
obvious that proteins may share certain common patterns (motifs)

 Many different similarity comparison
methods available, each with its own
strengths and weaknesses

 Different concepts of similarity:
sequence vs. structural, local vs. global,
chemical role vs. biological function vs. evolution
sequence vs. …

 Different algorithms and implementations:
exact vs. approximation vs. heuristic,
local vs. global search

Maximum Contact Map Overlap
using e.g. Memetic algorithms,
Picture source: http://www.cathdb.info
Variable Neighbourhood Search, Tabu Search


Existing Approaches
A variety of structure comparison methodologies exist, e.g.:

•SSAP (Orengo & Taylor, 96)

•ProSup (Feng & Sippl, 96)

•DALI (Holm & Sander, 93)

•CE (Shindyalov & Bourne, 98)

•Max-CMO (Goldman, Papadimitriou, Istrail, Lancia, 99 & 2001)

•LGA (Zemla, 2003)

•USM (Krasnogor & Pelta, 2004)

•SCOP (Murzin, Brenner, Hubbard & Chothia, 95)

•CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97)

Computational Underpinning

•Dynamic programming (Taylor, 99)

•Comparison of distance matrices (Holms & Sander, 93,96}

•Maximal common sub-graph detection (Artimiuk, Poirrette, Rice & Willet,
95)

•Geometrical matching (Wu, Schmidler, Hastie & Brutlag, 98)

•Root-mean-square-distances (Maiorov & Crippen, 94 – Cohen &
Sternberg,80)

•Other methods (eg. Lackner, Koppensteimer, Domingues & Sippl, 99 –
Zemla, Vendruscolo, Moult & Fidelis, 2001)

A survey of various similarity measures can be found in (Koehl P:
Protein structure similarities. Curr Opin Struct Biol 2001, 11:348-353)


Some Observations
•No agreement on which of these is the best method
• Various difficulties are associated with each.
• They assume that a suitable scoring function can be defined for which
optimum values correspond to the best possible structural match between
two structures (clearly not allways true, e.g. RMSD)
• Some methods cannot produce a proper ranking due to:
• ambiguous definitions of the similarity measures or
• neglect of alternative solutions with equivalent similarity values.

Structure Comparison, is at its core a multi-competence (multi-objective)
problem but it is seldom treated as such, e.g.:
 ProSup (Feng & Sippl, 96) optimizes the number of equivalent residues with the RMSD being an
additional constraint (and not another search dimension).
 DALI (Holm & Sander, 93) combines various derived measures into one value, effectively
transforming a multi-objective problem into a (weighted) single objective one.


What/How are we comparing?

Models, Measures, Metrics & Methods

or other tasks...


Until very recently researchers would:
 Focus on steps 1-4 , often collapsed into one single
step
 Compare one algorithm against others on a given
data set
 Conclude that their algorithm “is best” for that data
set and write a paper
Meanwhile, in the real world…
 No method is best in all data sets.
 The biologist will only use the method (s)he is most
familiar with! Regardless of the suitability to his/her
problem.


step
 Compare one do we change this reality? given
Q: How algorithm against others on a
data set
 Conclude that their algorithm “is best” for that data
set and write a paper
problem.


step
 Compare one do we change this reality? given
Q: How algorithm against others on a
data set
 Conclude that their it easy for the for that data
A: We make algorithm “is best”
set and write a to use the correct method
biologist paper
(and more)
problem.


ProCKSI
www.procksi.org


The ProCKSI-Server

ProCKSI: Protein Comparison, Knowledge,
Similarity, and Information
 Web Server for protein structure comparison

 Workbench / portal for established
methods and repositories for
protein structure information
– Integrates results from many
comparison methods in one place
– Home-grown comparison methods,
Max-CMO and USM (using contact
maps as their input)

 Decision Support System / analysis tool
– Visualises, compares and clusters all similarity measure results
– Incorporates all results and suggests a similarity consensus


The ProCKSI-Server

Minimise the Management Overhead for Experiments
• Upload your own dataset or download structures from the PDB repository
• Validate your PDB file, and extract desired models and chains
• Choose from multiple similarity comparison methods at one place (including
your own similarities) or don’t choose and use all!
Calculation USM
• Submit and monitor the Manager
Local External
progress of your experiment
MaxCMO
Dataset
• Integrate results from all Manager

pair-wise comparisons Similarity
Results Comparison

• Analyse and visualise results Management

from different similarity Task / Job
Scheduling
comparison methods
Overview
Manager
• Combine results and produce a Structure Task
Requests
similarity consensus profile Manager Managers
and Results
DataBase /
Filesystem
Analysis
• Download desired results Manager


Protein Comparison Methods United
 Home-grown methods:
− USM
− Max-CMO
 External methods:
− DaliLight
− FAST
− CE
− TMalign
− Vorolign
− URMS
 Additional informational sources:
− CATH, iHOP, RSCB, SCOP


Home-Grown Methods
• Representation of 3D protein structures as 2D contact maps
- Atoms that are far away in the linear chain,
come close together in the folded state

- If the distance between two atoms
i,j is below a threshold t, they are
said to form a contact

• Mathematical description of contact maps
- Calculation of all pairwise Euclidean distances between atoms i,j
Sequence of atoms
- Translation into a binary, symmetrical
matrix, called the contact map C

Sequence of atoms
• Contact maps in ProCKSI
Input for the two main similarity measures:
- Universal Similarity Metric (USM)‫‏‬
- Maximum Contact Map Overlap (MaxCMO)‫‏‬


 An Example of a contact map

1C7W.PDB



• Secondary structure elements can
be identified in the contact map:
− α-helix: wide bands on main diagonal
− β-sheet: parallel or perpendicular bands to main
diagonal

• Comparison of contact maps
- using different similarity measures, e.g.
number of alignments, overlap values,
information content, …

• Protein relationships
- Pair-wise comparison of multiple proteins
results in a (standardised) similarity matrix
- Comparison of all possible proteins describes
the protein universe
Protein 1NAT with α-helices and β-sheets 



• Maximum Contact Map Overlap (MaxCMO) method
is a specific measure of equivalence

- Number of aligned residues (dashed lines) and equivalent contacts
(aligned bows, called overlap)‫‏‬

- Overlap gives strong indication for topological similarity taking the
local environment into account


1ash 1hlm

Two related proteins taken from the PDB which share a 6 helices structural motif.


1ash 1hlm

Two related proteins taken from the PDB which share a 6 helices structural motif.
Two locally and globally similar contact maps.


A candidate
alignment
between the
contact maps of
these protein
structures.



• Universal Similarity Metric (USM) is the most concept/domain
independent measure in ProCKSI

- detects similarities between (quite) divergent structures

- based on the concept of Kolmogorov complexity

- compares the information content of two contact maps by compression
(NCD)



• Contact maps are the input to Universal Similarity Metric
(USM)

• Basic concept is Kolmogorov Complexity:
- Prior Kolmogorov complexity K(o):
Measures the amount of information contained in a given object o

- Conditional Kolmogorov complexity K(o1|o2):
How much (more) information is needed to produce object o1 if one
knows object o2 (as input)

• Calculation of the Normalized Information Distance (NID),
which is a proper, universal and normalized similarity metric


• Kolmogorov complexity is not computable directly, but can be heuristically
approximated

• Approximation of the Normalised Information Distance (NID) by the Normalised
Compression Distance (NCD):
– Objects are represented as bit strings s
(or files) that can be concatenated (.)
– Objects are compressed by any lossless
real-world compressor (e.g. zip, bzip2, …)‫‏‬
– Length of the compressed string/file 00000000001100000
00000000011100000
approximates the Kolmogorov complexity 00001100011000000
00000100000000000
00100001000000000
00110000000000000
00000000000000000
00001000010000000
00000000001000000
01100001000000000
11100000100000000
11000000000001000
00000000000000100
00000000000100011
00000000000010000
00000000000001000
00000000000001000

00000000001100000
00000000011100000
– Compression of the second string/file using the concatenation
00001100011000000
00000100000000000
00100001000000000 dictionary of the first one gives cond. Kolmogorov 000000000011
00110000000000000 000000000111
00000000000000000
00001000010000000
00000000001000000
complexity 000011000110
000001000000
001000010000
01100001000000000 001100000000
11100000100000000 000000000000
11000000000001000 000010000100
00000000000000100 000000000010
[ 0 + ε; 1 + ε ]
00000000000100011 011000010000
00000000000010000 NCD NCD 111000001000
00000000000001000 110000000000
00000000000001000



• Analysis of similarity matrices by hierarchical clustering:
– Similarity matrices not easy to analyse,
especially for very large datasets
– Similar proteins (with small values)
are grouped together (clustered)‫‏‬
– Many clustering algorithms available,
e.g. Ward’s Minimum Variance

• Results of the hierarchical clustering
can be visualised as linear or
hyperbolic tree
– Hyperbolic tree is favourable for
large sets of proteins
– Fish-eye perspective
– Navigation through the tree
possible
– Tree comparison across
methods/data sets

Total Evidence Consensus

• Comparison of a pair of proteins P1 and P2 with a given
similarity method 1M results in a similarity score 1S12

P1
1M 1S 1S
11
1S
12
… 1S
1n
12
1S 1S
P2 21 22

…
…

1M 1S 1S 1S
1n n1 Text nn
Pn

• Comparison of a dataset with multiple proteins P1 … Pn
with the same similarity method 1M results in similarity
matrix 1S

• Comparison of the same dataset with multiple similarity
methods 1M … mM results in multiple similarity matrices
1S … mS providing multiple similarity measures


Consensus Analysis

Consensus/Greedy
– Standardisation of similarity distances: [0;1]
– Assumption: For a given pair of structures,
the best method produces the best similarity values
– Compilation of a similarity matrix including the
best values from the best similarity method for each pair

Consensus/Average
– Expert user selects similarity measures; included measures contribute equally to the
consensus
– The intelligent combination of similarity comparison measures leads to better results
than any single one can provide!

Consensus/Weighted
– Assign weights to similarity measures according to
preference by ranking, e.g. Z-score > N-Align > RMSD
– Optimise weights: Determine minimum, average and
maximum weights by solving linear programming problem


Total Evidence Consensus

• Each similarity matrix must be standardised [0;1] as different
methods produce different qualities and ranges of measures

• Integration of multiple similarity matrices 1M … mM
in order to build a consensus similarity matrix C
1S
11
1S
12
… 1S
1n
1S 1S
21 22
…

C11 C12 … C1n
1S 1S
n1 nn
C21 C22
…

…
mS mS … mS Cn1 Cnn
11 12 1n
mS 1S
21 22
…

mS mS
n1 nn

• The consensus operator determines
how the different similarity matrices are
weighted and averaged, e.g.:


Results
www.procksi.org


Evaluation of CASP6 Results

• Evaluation of CASP6 competition results
• Prediction of protein structure against a given target
– Evaluation of predictions with similarity comparison methods
CASP ProCKSI MaxCMO CASP Evaluation
Target (T0196)‫‏‬ CONSENSUS Overlap GDT-TS

• Similarity ranking with different methods
– CONSENSUS =
Unweighted arithmetic average of
USM + MaxCMO/Overlap + DaliLite/Z
– Comparable results between ProCKSI‘s CONSENSUS method and the
community‘s gold standard GDT-TS supplemented with expert curation
– CONSENSUS detect better model for target T0196

Clustering of Protein Kinases
Comparison of sequence-based classification with structure-based
clustering from single similarity comparison methods and ProCKSI's
consensus method

• Biological background:
– Kinases are enzymes that catalyse the transfer of a phosphate to a protein substrate
– Play essential role in most of the cellular processes
e.g. cellular differentiation and repair, cell proliferation

• Kinases dataset:

http://www.nih.go.jp/mirror/Kinases
− 45 structures published at the Protein Kinase Resourse (PKR) web site

• Hanks' and Hunter's (HH) classification as gold standard:
– Based on sequence information
– HH-Clusters: Mainly 9 different groups (super-families)‫‏‬
– Sub-Clusters: Common features according to the SCOP database

• Experiments with 3 different comparison methods (USM, MaxCMO, DaliLite), 3 different
contact map thresholds, 7 different clustering methods (e.g. Wards, UPGAA)


Single Similarity Measures DaliLite/Z USM/USM MaxCMO/Overlap

• Best results with clustering
with Ward's Minimum
Variance method
• Each method/measure has
its own strengths and flaws

Strengths:
• Green: Classification on
Class level, e.g. α+β/PK-like
• Blue: Detect similarities
up to Species level with e.g.
mice, pigs, cows
• Red: Produce mixed bag of proteins
being least similar in Blue

Flaws:
• MaxCMO/Overlap only distinguishes proteins on Class level
• DaliLite/Z adds fairly wrong protein 1IAN to Green
• USM/USM reverses order of last two clustering steps (Blue and Green)



Similarity Consensus USM/USM + DaliLite/Z USM/USM + DaliLite/Z
+ MaxCMO/Overlap
• Exhaustive combination of all
available similarity measures

Best Results:
● Correct clustering with
USM/USM + DaliLite/Z
compensating for each
others flaws

General Trends:
● Including similarity measures
derived from the number of
alignments (e.g. MaxCMO/Align,
DaliLite/Align) partially destroy
good clustering outside Green
● Adding noisier measures (e.g.
MaxCMO/Overlap) still produces
comparable good and robust
results


Consensus Analysis

Comparison of the influence of the combination of different similarity
measures on the quality of the consensus method

• Rost/Sander dataset:
– Designed for secondary structure prediction
– Pairwise sequence similarity of less than 25%
– 126 globular proteins incl. 18 multi-domain proteins

• SCOP classification as gold standard:
– Manually curated database containing expert knowledge
– Hierarchical classification levels:
Class, Fold, Superfamily, Family, Protein, Species

• Analyse performance of each established comparison method against
consensus method using ROC analysis
– Compare true positives against false positives
– Performance measure is Area under the Curve (AUC)‫‏‬


Consensus Analysis - Technique

ROC = Receiver Operator Characteristics
– Technique for comparing the overall performance of
different methods / algorithms / tests on the same dataset
– Widely employed e.g. in signal detection theory,
machine learning, and diagnostic testing in medicine

• ROC curves depict the relative trade-off between benefits
(True Positives) and costs (False Positives)‫‏‬
True Classes

p n
• Confusion matrix of a binary test

Test Classes
Y TP FP
– Hit rate: True Positive rate TPr
N FN TN

P N

– False alarm: False Positive rate FPr Column Totals


Consensus Analysis - Technique

Important points in ROC space
(0,1) : high TPr and low FPr;
perfect classifiction
(0,0) : never issue positive
classifications; useless
(1,1) : always issue positive
classifications; useless
{y=x} : randomly guessing a
classification; useless

ROC curves for methods with continuous output
– Not a simple binary (discrete) decision problem (yes/no)
– Ranking or scoring output estimates the class membership probability
of an instance [0;1]
– Application of a variable threshold in order to produce and validate
discrete classifiers
– The best method has an uppermost (north-western) curve
– Area Under the Curve (AUC) quantifies the performance


Consensus Analysis

Analysis of SCOP’s Class level (as example for all levels)‫‏‬

- RMSD values are not good similarity measures (except for DaliLite)‫‏‬
- Best performance with FAST/SN and FAST/Align (Class level),
and with CE/Z, DaliLite/Z, and DaliLite/Align (all other levels)‫‏‬
- Consensus/All gives worse AUC value than best method but very close to it

Consensus Analysis
 Results from Comparisons/Singles

rating ranking
*** first
** second
* third


Consensus Analysis
 Results from Consensus/Average

rating ranking
*** first
** second
* third


Consensus Analysis

Analysis of SCOP’s Superfamily level (exemplary for all levels)‫‏‬

Consensus/
Average-Best3

- Consensus/Average-Best3 gives better AUC values than any of
the contributing similarity measures (except Protein level)‫‏‬
- Further reduction to Consensus/Average-Best2 improved only
performance for Protein and Superfamily level


Distributed Computing

Similarity comparison of proteins with multiple methods and
large datasets is very time consuming and needs to be
parallelised / distributed / gridified

– Simple automated scheduling system for job distribution
works well on dedicated ProCKSI cluster (5 nodes, dual)

– Research on how to bundle jobs including fast/slow
methods and small/large dataset
► Optimise the ratio between calculation time and
overhead (data transfer time, waiting time, ...)

– Generalised scheduler for usage of clusters on the GRID
and/or the University of Nottingham's cluster (> 1000
nodes)


Problem / Solution Space
All-against-all comparison of a dataset of S protein structures
using M different similarity comparison methods can be
represented as 3D cube.

s
h od Heterogeneity:
et
M
1. Each structure has
different length i.e number
of residues
2. Each method has different
execution time even for
Structures

same pair of structures
3. Back-end computational
nodes may have different
speeds etc
Structures


Possible Strategies
1. Comparison of one pair of proteins using one method
in the task list => SxSxM jobs, each performing 1 comparison
>> far too fine-grained
2. All-against-all comparison of the entire dataset with one
method => M jobs, each performing SxS comparisons
>> currently running , valid only for |S|<500 proteins
3. Comparison of one pair of proteins using all methods in the
task list => SxS jobs, each performing M comparisons
>> Slightly different from 1, does not allow intelligent load
balancing
4. Intelligent partitioning of the 3D problem space, comparing a
subset of proteins with a set/subset of methods
>> under investigation


Distributed (grid-enabled) architecture

• p = number of nodes

• N1, N2, .. Np= Cluster
or Grid nodes

•The system is able to
run both on a parallel
environment using the
MPI libraries and on a
grid computing
environment using the
MPICH-G2 libraries.

•Complexity of Proteins
is estimated and bag of
proteins are distributed
on different nodes


Experimental results: CK34


Experimental results: RS119


Experimental results: overall speed-up

Speed-up = Ts /Tp
Where,
Ts: sequential exec time
Tp: Parallel exec time on P

processors

Ideal speed-up = p
where,
P: number of processors


Conclusions
www.procksi.org


Conclusions
• ProCKSI is a workbench for protein structure comparison
– Implements multiple different similarity comparison methods with different
similarity concepts and algorithms
– Facilitates the comparison and analysis of large datasets of protein
structures through a single, user-friendly interface

• ProCKSI is a decision-support system
– Integrates many different similarity measures and suggests a consensus
similarity profile, taking their strengths and weaknesses into account

The combination of multi-competence similarity comparison measures
leads to better results than any single one can provide!
• Additional Tools:
• One of the most tested PDB parsers out-there
• Very flexible tool for generating contact maps under a variety of definitions
and parameters
• Flexible contact maps visualisation
• Trees comparison and visualisation
• You can add your own distance matrix

Conclusions
• ProCKSI keeps expanding:

• More methods are being added.

• If you have a method and want it included contact us!

• More sophisticated data fusion and visualisation are in their
way!

• Hardware is evolving.

• ProCKSI is publicly available at:

http://www.procksi.net


Literature

Journal Papers
– The ProCKSI Server: a decision support system for Protein (Structure)
Comparison, Knowledge, Similarity and Information
Daniel Barthel, Jonathan D. Hirst, Jacek Błażewicz, Edmund K. Burke, Natalio
Krasnogor. BMC Bioinformatics 2007, 8, 416.

– Web and Grid Technologies in Bioinformatics, Computational and Systems
Biology: A Review
Azhar A. Shah, Daniel Barthel, Piotr Lukasiak, Jacek Błażewicz, Natalio
Krasnogor. Current Bioinformatics 2008, 3, 10-31.

Conference Papers
– Grid and Distributed Public Comupting Schemes for Structural Proteomics: A Short
Overview
Azhar A. Shah, Daniel Barthel, Natalio Krasnogor. In Frontiers of High Performance Computing and
Networking (ISPA2007), Lecture Notes in Computer Science 4743, 424-434. Springer-Verlag, Niagara Falls,
Canada, August 2007.
– Protein Structure Comparison, Clustering and Analysis:
An Overview of the ProCKSI Decision Support System
Azhar Ali Shah, Daniel Barthel, Natalio Krasnogor. In Proceedings of the 4th International Symposium on
Biotechnology (IBS) and 1st Pakistan-China-Iran International Conference on Biotechnology, Bioengineering
and Biophysical Chemistry (ICBBB'07), Jamshoro, Pakistan, November 2007.


Acknowledgements


Protein Structure Alignment and Comparison

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Protein Structure Alignment and Comparison

Similaire à Protein Structure Alignment and Comparison (20)

Plus de Natalio Krasnogor

Plus de Natalio Krasnogor (20)

Dernier

Dernier (20)

Protein Structure Alignment and Comparison