SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
The ProCKSI-Server
                 An on-line Decision Support System for
                     Protein Structure Comparison

                                         Natalio Krasnogor
                                      www.cs.nott.ac.uk/~nxk
                               Natalio.Krasnogor@Nottingham.ac.uk


                           Interdisciplinary Optimisation Laboratory
               Automated Scheduling, Optimisation & Planning Research Group
                   School of Computer Science and Information Technology

                               Centre for Integrative Systems Biology
                                          School of Biology

                            Centre for Healthcare Associated Infections
                           Institute of Infection, Immunity & Inflammation

                                 University of Nottingham
27th November 2008, University of Warwick                                     1
Outline
 Introduction
    −      Brief introduction to proteins
    −      Protein structures Comparison
    −      Methods
 ProCKSI
    −      Motivation
    −      External Methods
    −      USM & MAX-CMO
    −      Consensus building
 Results
    −      From a structural bioinformatics perspective
    −      From a Computational perspective
 Conclusions
 Acknowledgement

 27th November 2008, University of Warwick                2
Introduction
                                  www.procksi.org




27th November 2008, University of Warwick           3
What are Proteins?


 Proteins are
  biological molecules
  of primary
  importance to the
  functioning of living
  organisms


 Perform many and
  varied functions




 27th November 2008, University of Warwick         4
Structural Proteins: the organism's basic building blocks, eg.
collagen, nails, hair, etc
Enzymes: biological engines which mediate multitude of biochemical
reactions. Usually enzymes are very specific and catalyze only a
single type of reaction, but they can play a role in more than one
pathway.
Transmembrane proteins: they are the cell’s housekeepers, eg. By
regulating cell volume, extraction and concentration of small
molecules from the extracellular environment and generation of ionic
gradients essential for muscle and nerve cell function (sodium/
potasium pump is an example)




27th November 2008, University of Warwick                        5
Protein Structures

 Varying: size, shape, structure

 Structure determines their
  biological activity


 “Natures Robots”

 Understanding protein structure
  is key to understanding function
  and dysfunction




  27th November 2008, University of Warwick          6
Components of Proteins

 Build Blocks:
   − Amino Acids
   − Common Basic Unit




                                              Livingstone and Barton:(1993)
• Distinct “side chains”
• 20 Amino Acid Types


  27th November 2008, University of Warwick                                   7
Components of Proteins




27th November 2008, University of Warwick      8
Components of Proteins




    •Thousands of different physicochemical and biochemical properties (AAIndex)
    • Thus proteins are beautiful combinatorial beasts!




27th November 2008, University of Warwick                                          8
Protein Synthesis
 Amino Acid Sequences
     −     AAs polymerised into
           Chains (Residues)
     −     Gene sequence
           determines Protein
           sequence
 Protein Structure
     −     Chains fold into
           specific compact
           structures
 Structure formation (folding)
  is spontaneous
 Sequence determines
  Structure
 Structure determines
  function
  27th November 2008, University of Warwick          9
Determining Protein Structures



 Protein Structure
  determination is
  slow and difficult
 Determining protein
  sequence is
  relatively easy
  (Genomics)
 PDB vs Genbank

                                             Thomas Splettstoesser


 27th November 2008, University of Warwick                           10
Comparing Protein Structures

• Proteins build the majority of cellular structures
  and perform most life functions

• Extend knowledge about the protein universe:
     – Understand interrelations
       between structures and functions of proteins
       through measured similarities
     – Group (cluster) proteins by
       structural similarities as to infer commonalities

                                              • Goal is to predict functions of proteins
                                                from their structure, or design new
                                                proteins for specific functions

                                              • Considering any two objects:

                                                What does “similar” mean?

    Similar or not?    How / Where similar?


 27th November 2008, University of Warwick                                           11
Protein Structure Comparison

Similarity comparison of protein structures is not trivial even though it is
  obvious that proteins may share certain common patterns (motifs)

    Many different similarity comparison
     methods available, each with its own
     strengths and weaknesses

    Different concepts of similarity:
      sequence vs. structural, local vs. global,
      chemical role vs. biological function vs. evolution
      sequence vs. …

    Different algorithms and implementations:
      exact vs. approximation vs. heuristic,
      local vs. global search

            Maximum Contact Map Overlap
            using e.g. Memetic algorithms,
                                                            Picture source: http://www.cathdb.info
            Variable Neighbourhood Search, Tabu Search

    27th November 2008, University of Warwick                                                    12
Existing Approaches
A variety of structure comparison methodologies exist, e.g.:

•SSAP (Orengo & Taylor, 96)

•ProSup (Feng & Sippl, 96)

•DALI (Holm & Sander, 93)

•CE (Shindyalov & Bourne, 98)

•Max-CMO (Goldman, Papadimitriou, Istrail, Lancia, 99 & 2001)

•LGA (Zemla, 2003)

•USM (Krasnogor & Pelta, 2004)

•SCOP (Murzin, Brenner, Hubbard & Chothia, 95)

•CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97)
 27th November 2008, University of Warwick                       13
Computational Underpinning

•Dynamic programming (Taylor, 99)

•Comparison of distance matrices (Holms & Sander, 93,96}

•Maximal common sub-graph detection (Artimiuk, Poirrette, Rice & Willet,
95)

•Geometrical matching (Wu, Schmidler, Hastie & Brutlag, 98)

•Root-mean-square-distances (Maiorov & Crippen, 94 – Cohen &
Sternberg,80)

•Other methods (eg. Lackner, Koppensteimer, Domingues & Sippl, 99 –
Zemla, Vendruscolo, Moult & Fidelis, 2001)

A survey of various similarity measures can be found in (Koehl P:
Protein structure similarities. Curr Opin Struct Biol 2001, 11:348-353)

27th November 2008, University of Warwick                                 14
Some Observations
•No agreement on which of these is the best method
• Various difficulties are associated with each.
• They assume that a suitable scoring function can be defined for which
optimum values correspond to the best possible structural match between
two structures (clearly not allways true, e.g. RMSD)
• Some methods cannot produce a proper ranking due to:
    • ambiguous definitions of the similarity measures or
    • neglect of alternative solutions with equivalent similarity values.

Structure Comparison, is at its core a multi-competence (multi-objective)
problem but it is seldom treated as such, e.g.:
       ProSup (Feng & Sippl, 96) optimizes the number of equivalent residues with the RMSD being an
        additional constraint (and not another search dimension).
       DALI (Holm & Sander, 93) combines various derived measures into one value, effectively
        transforming a multi-objective problem into a (weighted) single objective one.



    27th November 2008, University of Warwick                                                    15
What/How are we comparing?

                   Models, Measures, Metrics & Methods




                                                or other tasks...


27th November 2008, University of Warwick                           16
Until very recently researchers would:
 Focus on steps 1-4 , often collapsed into one single
  step
 Compare one algorithm against others on a given
  data set
 Conclude that their algorithm “is best” for that data
  set and write a paper
Meanwhile, in the real world…
 No method is best in all data sets.
 The biologist will only use the method (s)he is most
  familiar with! Regardless of the suitability to his/her
  problem.

 27th November 2008, University of Warwick          17
Until very recently researchers would:
 Focus on steps 1-4 , often collapsed into one single
  step
 Compare one do we change this reality? given
       Q: How algorithm against others on a
  data set
 Conclude that their algorithm “is best” for that data
  set and write a paper
Meanwhile, in the real world…
 No method is best in all data sets.
 The biologist will only use the method (s)he is most
  familiar with! Regardless of the suitability to his/her
  problem.

 27th November 2008, University of Warwick          17
Until very recently researchers would:
 Focus on steps 1-4 , often collapsed into one single
  step
 Compare one do we change this reality? given
       Q: How algorithm against others on a
  data set
 Conclude that their it easy for the for that data
       A: We make algorithm “is best”
  set and write a to use the correct method
       biologist paper
       (and more)
Meanwhile, in the real world…
 No method is best in all data sets.
 The biologist will only use the method (s)he is most
  familiar with! Regardless of the suitability to his/her
  problem.

 27th November 2008, University of Warwick          17
ProCKSI
                                  www.procksi.org




27th November 2008, University of Warwick           18
The ProCKSI-Server

ProCKSI: Protein Comparison, Knowledge,
         Similarity, and Information
    Web Server for protein structure comparison

                                                   Workbench / portal for established
                                                    methods and repositories for
                                                    protein structure information
                                                    – Integrates results from many
                                                      comparison methods in one place
                                                    – Home-grown comparison methods,
                                                      Max-CMO and USM (using contact
                                                      maps as their input)

    Decision Support System / analysis tool
      – Visualises, compares and clusters all similarity measure results
      – Incorporates all results and suggests a similarity consensus

    27th November 2008, University of Warwick                                   19
The ProCKSI-Server

Minimise the Management Overhead for Experiments
 • Upload your own dataset or download structures from the PDB repository
 • Validate your PDB file, and extract desired models and chains
 • Choose from multiple similarity comparison methods at one place (including
   your own similarities) or don’t choose and use all!
                                                         Calculation                      USM
 • Submit and monitor the                                 Manager
                                                                                  Local         External
   progress of your experiment
                                                                                      MaxCMO
                                                           Dataset
 • Integrate results from all                             Manager

   pair-wise comparisons                                                               Similarity
                                                           Results                    Comparison

 • Analyse and visualise results                         Management

   from different similarity                                                                       Task / Job
                                                                                                   Scheduling
   comparison methods
                                                          Overview
                                                          Manager
 • Combine results and produce a             Structure                   Task
                                                                                      Requests
   similarity consensus profile              Manager                   Managers
                                                                                     and Results
                                                                                      DataBase /
                                                                                     Filesystem
                                                            Analysis
 • Download desired results                                 Manager



 27th November 2008, University of Warwick                                                             20
Protein Comparison Methods United
 Home-grown methods:
   − USM
   − Max-CMO
 External methods:
   − DaliLight
   − FAST
   − CE
   − TMalign
   − Vorolign
   − URMS
 Additional informational sources:
    −      CATH, iHOP, RSCB, SCOP

 27th November 2008, University of Warwick   21
Home-Grown Methods
• Representation of 3D protein structures as 2D contact maps
   - Atoms that are far away in the linear chain,
     come close together in the folded state

   - If the distance between two atoms
     i,j is below a threshold t, they are
     said to form a contact



• Mathematical description of contact maps
    - Calculation of all pairwise Euclidean distances between atoms i,j
                                                                 Sequence
of
atoms
    - Translation into a binary, symmetrical
      matrix, called the contact map C




                                                            Sequence
of
atoms
• Contact maps in ProCKSI
  Input for the two main similarity measures:
    - Universal Similarity Metric (USM)‫‏‬
    - Maximum Contact Map Overlap (MaxCMO)‫‏‬

  27th November 2008, University of Warwick                                          22
 An Example of a contact map




 1C7W.PDB




 27th November 2008, University of Warwick   23
Protein Structure Comparison

• Secondary structure elements can
  be identified in the contact map:
   − α-helix: wide bands on main diagonal
    − β-sheet: parallel or perpendicular bands to main
      diagonal


• Comparison of contact maps
    - using different similarity measures, e.g.
      number of alignments, overlap values,
      information content, …


• Protein relationships
    - Pair-wise comparison of multiple proteins
      results in a (standardised) similarity matrix
    - Comparison of all possible proteins describes
      the protein universe
                                                         Protein
1NAT
with
α-helices
and
β-sheets




 27th November 2008, University of Warwick                                               24
Protein Structure Comparison

• Maximum Contact Map Overlap (MaxCMO) method
  is a specific measure of equivalence

     - Number of aligned residues (dashed lines) and equivalent contacts
       (aligned bows, called overlap)‫‏‬

     - Overlap gives strong indication for topological similarity taking the
       local environment into account




27th November 2008, University of Warwick                                      25
1ash                                           1hlm

      Two related proteins taken from the PDB which share a 6 helices structural motif.



27th November 2008, University of Warwick                                            26
1ash                                           1hlm

      Two related proteins taken from the PDB which share a 6 helices structural motif.



27th November 2008, University of Warwick                                            26
1ash                                           1hlm

      Two related proteins taken from the PDB which share a 6 helices structural motif.



27th November 2008, University of Warwick                                            26
1ash                                              1hlm

      Two related proteins taken from the PDB which share a 6 helices structural motif.
                          Two locally and globally similar contact maps.

27th November 2008, University of Warwick                                            26
A candidate
alignment
between the
contact maps of
these protein
structures.




  27th November 2008, University of Warwick   27
Protein Structure Comparison




• Universal Similarity Metric (USM) is the most concept/domain
  independent measure in ProCKSI

    - detects similarities between (quite) divergent structures

    - based on the concept of Kolmogorov complexity

    - compares the information content of two contact maps by compression
      (NCD)




27th November 2008, University of Warwick                                   28
Protein Structure Comparison

   • Contact maps are the input to Universal Similarity Metric
     (USM)

   • Basic concept is Kolmogorov Complexity:
        - Prior Kolmogorov complexity K(o):
          Measures the amount of information contained in a given object o



        - Conditional Kolmogorov complexity K(o1|o2):
          How much (more) information is needed to produce object o1 if one
          knows object o2 (as input)




   • Calculation of the Normalized Information Distance (NID),
     which is a proper, universal and normalized similarity metric


27th November 2008, University of Warwick                                     29
Protein Structure Comparison
• Kolmogorov complexity is not computable directly, but can be heuristically
  approximated

• Approximation of the Normalised Information Distance (NID) by the Normalised
  Compression Distance (NCD):
      – Objects are represented as bit strings s
        (or files) that can be concatenated (.)
      – Objects are compressed by any lossless
        real-world compressor (e.g. zip, bzip2, …)‫‏‬
      – Length of the compressed string/file                                    00000000001100000
                                                                                00000000011100000
        approximates the Kolmogorov complexity                                  00001100011000000
                                                                                00000100000000000
                                                                                00100001000000000
                                                                                00110000000000000
                                                                                00000000000000000
                                                                                00001000010000000
                                                                                00000000001000000
                                                                                01100001000000000
                                                                                11100000100000000
                                                                                11000000000001000
                                                                                00000000000000100
                                                                                00000000000100011
                                                                                00000000000010000
                                                                                00000000000001000
                                                                                00000000000001000

 00000000001100000
 00000000011100000
                         – Compression of the second string/file using the           concatenation
 00001100011000000
 00000100000000000
 00100001000000000         dictionary of the first one gives cond. Kolmogorov     000000000011
 00110000000000000                                                                000000000111
 00000000000000000
 00001000010000000
 00000000001000000
                           complexity                                             000011000110
                                                                                  000001000000
                                                                                  001000010000
 01100001000000000                                                                001100000000
 11100000100000000                                                                000000000000
 11000000000001000                                                                000010000100
 00000000000000100                                                                000000000010
                                             [
0
+
ε;
1
+
ε ]
 00000000000100011                                                                011000010000
 00000000000010000                     NCD                      NCD               111000001000
 00000000000001000                                                                110000000000
 00000000000001000



 27th November 2008, University of Warwick                                                  30
Protein Structure Comparison

• Analysis of similarity matrices by hierarchical clustering:
      – Similarity matrices not easy to analyse,
        especially for very large datasets
      – Similar proteins (with small values)
        are grouped together (clustered)‫‏‬
      – Many clustering algorithms available,
        e.g. Ward’s Minimum Variance


                 • Results of the hierarchical clustering
                   can be visualised as linear or
                   hyperbolic tree
                         – Hyperbolic tree is favourable for
                           large sets of proteins
                         – Fish-eye perspective
                         – Navigation through the tree
                           possible
                         – Tree comparison across
                           methods/data sets
 27th November 2008, University of Warwick                      31
Total Evidence Consensus

• Comparison of a pair of proteins P1 and P2 with a given
  similarity method 1M results in a similarity score 1S12

     P1
               1M        1S                 1S
                                              11
                                                   1S
                                                     12
                                                          …   1S
                                                                1n
                           12
                                            1S     1S
     P2                                       21     22




                                             …
       …




               1M        1S                 1S                1S
                           1n                 n1   Text         nn
     Pn


• Comparison of a dataset with multiple proteins P1 … Pn
  with the same similarity method 1M results in similarity
  matrix 1S

• Comparison of the same dataset with multiple similarity
  methods 1M … mM results in multiple similarity matrices
  1S … mS providing multiple similarity measures


27th November 2008, University of Warwick                            32
Consensus Analysis

Consensus/Greedy
   – Standardisation of similarity distances: [0;1]
   – Assumption: For a given pair of structures,
     the best method produces the best similarity values
   – Compilation of a similarity matrix including the
     best values from the best similarity method for each pair

Consensus/Average
   – Expert user selects similarity measures; included measures contribute equally to the
     consensus
   – The intelligent combination of similarity comparison measures leads to better results
     than any single one can provide!

Consensus/Weighted
   – Assign weights to similarity measures according to
     preference by ranking, e.g. Z-score > N-Align > RMSD
   – Optimise weights: Determine minimum, average and
     maximum weights by solving linear programming problem


 27th November 2008, University of Warwick                                            33
Total Evidence Consensus

• Each similarity matrix must be standardised [0;1] as different
  methods produce different qualities and ranges of measures

• Integration of multiple similarity matrices 1M … mM
  in order to build a consensus similarity matrix C
                              1S
                                11
                                     1S
                                       12
                                             …   1S
                                                   1n
                              1S     1S
                                21     22
                               …




                                                        C11   C12   …   C1n
                              1S                 1S
                                n1                 nn
                                                        C21   C22
                                            …




                                                        …
                              mS     mS      …   mS     Cn1             Cnn
                                11     12          1n
                              mS     1S
                                21     22
                                …




                              mS                 mS
                                n1                 nn




• The consensus operator determines
  how the different similarity matrices are
  weighted and averaged, e.g.:

 27th November 2008, University of Warwick                                    34
Results
                                  www.procksi.org




27th November 2008, University of Warwick           35
Evaluation of CASP6 Results

• Evaluation of CASP6 competition results
• Prediction of protein structure against a given target
       – Evaluation of predictions with similarity comparison methods
            CASP                ProCKSI       MaxCMO       CASP Evaluation
        Target (T0196)‫‏‬        CONSENSUS      Overlap         GDT-TS




• Similarity ranking with different methods
      – CONSENSUS                 =
                             Unweighted arithmetic average of
                             USM + MaxCMO/Overlap + DaliLite/Z
      – Comparable results between ProCKSI‘s CONSENSUS method and the
        community‘s gold standard GDT-TS supplemented with expert curation
      – CONSENSUS detect better model for target T0196
 27th November 2008, University of Warwick                                   36
Clustering of Protein Kinases
      Comparison of sequence-based classification with structure-based
      clustering from single similarity comparison methods and ProCKSI's
                               consensus method

• Biological background:
      – Kinases are enzymes that catalyse the transfer of a phosphate to a protein substrate
      – Play essential role in most of the cellular processes
        e.g. cellular differentiation and repair, cell proliferation



• Kinases dataset:




                                                                                               http://www.nih.go.jp/mirror/Kinases
      − 45 structures published at the Protein Kinase Resourse (PKR) web site

• Hanks' and Hunter's (HH) classification as gold standard:
      – Based on sequence information
      – HH-Clusters: Mainly 9 different groups (super-families)‫‏‬
      – Sub-Clusters: Common features according to the SCOP database

• Experiments with 3 different comparison methods (USM, MaxCMO, DaliLite), 3 different
  contact map thresholds, 7 different clustering methods (e.g. Wards, UPGAA)
   27th November 2008, University of Warwick                                           37
Clustering of Protein Kinases

Single Similarity Measures                    DaliLite/Z   USM/USM        MaxCMO/Overlap

 • Best results with clustering
   with Ward's Minimum
   Variance method
 • Each method/measure has
   its own strengths and flaws

 Strengths:
 • Green: Classification on
   Class level, e.g. α+β/PK-like
 • Blue: Detect similarities
   up to Species level with e.g.
   mice, pigs, cows
 • Red: Produce mixed bag of proteins
   being least similar in Blue


 Flaws:
 • MaxCMO/Overlap only distinguishes proteins on Class level
 • DaliLite/Z adds fairly wrong protein 1IAN to Green
 • USM/USM reverses order of last two clustering steps (Blue and Green)

  27th November 2008, University of Warwick                                     38
Clustering of Protein Kinases

Similarity Consensus                          USM/USM + DaliLite/Z   USM/USM + DaliLite/Z
                                                                      + MaxCMO/Overlap
 • Exhaustive combination of all
   available similarity measures


Best Results:
 ● Correct clustering with
   USM/USM + DaliLite/Z
   compensating for each
   others flaws


General Trends:
 ● Including similarity measures
   derived from the number of
   alignments (e.g. MaxCMO/Align,
   DaliLite/Align) partially destroy
   good clustering outside Green
 ● Adding noisier measures (e.g.
   MaxCMO/Overlap) still produces
   comparable good and robust
   results

  27th November 2008, University of Warwick                                          39
Consensus Analysis

Comparison of the influence of the combination of different similarity
 measures on the quality of the consensus method

• Rost/Sander dataset:
     – Designed for secondary structure prediction
     – Pairwise sequence similarity of less than 25%
     – 126 globular proteins incl. 18 multi-domain proteins


• SCOP classification as gold standard:
     – Manually curated database containing expert knowledge
     – Hierarchical classification levels:
       Class, Fold, Superfamily, Family, Protein, Species


• Analyse performance of each established comparison method against
  consensus method using ROC analysis
     – Compare true positives against false positives
     – Performance measure is Area under the Curve (AUC)‫‏‬

 27th November 2008, University of Warwick                         40
Consensus Analysis - Technique

ROC = Receiver Operator Characteristics
    – Technique for comparing the overall performance of
      different methods / algorithms / tests on the same dataset
    – Widely employed e.g. in signal detection theory,
      machine learning, and diagnostic testing in medicine


• ROC curves depict the relative trade-off between benefits
  (True Positives) and costs (False Positives)‫‏‬
                                                                              True Classes

                                                                                p     n
• Confusion matrix of a binary test




                                                           Test Classes
                                                                          Y     TP    FP
    – Hit rate: True Positive rate TPr
                                                                          N    FN     TN

                                                                                P     N

    – False alarm: False Positive rate FPr                                    Column Totals

 27th November 2008, University of Warwick                                             41
Consensus Analysis - Technique

Important points in ROC space
 (0,1)      :   high TPr and low FPr;
                perfect classifiction
 (0,0)      :   never issue positive
                classifications; useless
 (1,1)      :   always issue positive
                classifications; useless
 {y=x}      :   randomly guessing a
                classification; useless


ROC curves for methods with continuous output
   – Not a simple binary (discrete) decision problem (yes/no)
   – Ranking or scoring output estimates the class membership probability
     of an instance [0;1]
   – Application of a variable threshold in order to produce and validate
     discrete classifiers
   – The best method has an uppermost (north-western) curve
   – Area Under the Curve (AUC) quantifies the performance

27th November 2008, University of Warwick                                   42
Consensus Analysis

Analysis of SCOP’s Class level (as example for all levels)‫‏‬




   - RMSD values are not good similarity measures (except for DaliLite)‫‏‬
   - Best performance with FAST/SN and FAST/Align (Class level),
     and with CE/Z, DaliLite/Z, and DaliLite/Align (all other levels)‫‏‬
   - Consensus/All gives worse AUC value than best method but very close to it
 27th November 2008, University of Warwick                                 43
Consensus Analysis
 Results from Comparisons/Singles




                                                  rating   ranking
                                                  ***      first
                                                  **       second
                                                  *        third

 27th November 2008, University of Warwick                 44
Consensus Analysis
 Results from Consensus/Average




                                                  rating   ranking
                                                  ***      first
                                                  **       second
                                                  *        third

 27th November 2008, University of Warwick                 45
Consensus Analysis

Analysis of SCOP’s Superfamily level (exemplary for all levels)‫‏‬




                                                   Consensus/
                                                   Average-Best3



     - Consensus/Average-Best3 gives better AUC values than any of
       the contributing similarity measures (except Protein level)‫‏‬
     - Further reduction to Consensus/Average-Best2 improved only
       performance for Protein and Superfamily level

 27th November 2008, University of Warwick                            46
Distributed Computing

Similarity comparison of proteins with multiple methods and
large datasets is very time consuming and needs to be
parallelised / distributed / gridified

   – Simple automated scheduling system for job distribution
     works well on dedicated ProCKSI cluster (5 nodes, dual)

   – Research on how to bundle jobs including fast/slow
     methods and small/large dataset
      ► Optimise the ratio between calculation time and
        overhead (data transfer time, waiting time, ...)

   – Generalised scheduler for usage of clusters on the GRID
     and/or the University of Nottingham's cluster (> 1000
     nodes)




 27th November 2008, University of Warwick                     47
Problem / Solution Space
All-against-all comparison of a dataset of S protein structures
using M different similarity comparison methods can be
represented as 3D cube.

                 s
             h od                            Heterogeneity:
           et
     M
                                             1. Each structure has
                                                different length i.e number
                                                of residues
                                             2. Each method has different
                                                execution time even for
    Structures




                                                same pair of structures
                                             3. Back-end computational
                                                nodes may have different
                                                speeds etc
                     Structures


 27th November 2008, University of Warwick                            48
Possible Strategies
1. Comparison of one pair of proteins using one method
   in the task list => SxSxM jobs, each performing 1 comparison
   >> far too fine-grained
2. All-against-all comparison of the entire dataset with one
   method => M jobs, each performing SxS comparisons
   >> currently running , valid only for |S|<500 proteins
3. Comparison of one pair of proteins using all methods in the
   task list => SxS jobs, each performing M comparisons
   >> Slightly different from 1, does not allow intelligent load
   balancing
4. Intelligent partitioning of the 3D problem space, comparing a
   subset of proteins with a set/subset of methods
   >> under investigation




27th November 2008, University of Warwick                   49
Distributed (grid-enabled) architecture

                                            • p = number of nodes

                                            • N1, N2, .. Np= Cluster
                                            or Grid nodes

                                            •The system is able to
                                            run both on a parallel
                                            environment using the
                                            MPI libraries and on a
                                            grid computing
                                            environment using the
                                            MPICH-G2 libraries.

                                            •Complexity of Proteins
                                            is estimated and bag of
                                            proteins are distributed
                                            on different nodes




27th November 2008, University of Warwick                    50
Experimental results: CK34




27th November 2008, University of Warwick      51
Experimental results: CK34




27th November 2008, University of Warwick      51
Experimental results: RS119




27th November 2008, University of Warwick     52
Experimental results: RS119




27th November 2008, University of Warwick     52
Experimental results: overall speed-up

                                            Speed-up = Ts /Tp
                                            Where,
                                                     Ts: sequential exec time
                                                     Tp: Parallel exec time on P

                                                        processors




                                               Ideal speed-up = p
                                               where,
                                               P: number of processors




27th November 2008, University of Warwick                                 53
Conclusions
                                  www.procksi.org




27th November 2008, University of Warwick           54
Conclusions
• ProCKSI is a workbench for protein structure comparison
    – Implements multiple different similarity comparison methods with different
      similarity concepts and algorithms
    – Facilitates the comparison and analysis of large datasets of protein
      structures through a single, user-friendly interface

• ProCKSI is a decision-support system
    – Integrates many different similarity measures and suggests a consensus
      similarity profile, taking their strengths and weaknesses into account

       The combination of multi-competence similarity comparison measures
       leads to better results than any single one can provide!
• Additional Tools:
    • One of the most tested PDB parsers out-there
    • Very flexible tool for generating contact maps under a variety of definitions
      and parameters
    • Flexible contact maps visualisation
    • Trees comparison and visualisation
    • You can add your own distance matrix
 27th November 2008, University of Warwick                                      55
Conclusions
• ProCKSI keeps expanding:

    • More methods are being added.

    • If you have a method and want it included contact us!

    • More sophisticated data fusion and visualisation are in their
      way!

    • Hardware is evolving.


     • ProCKSI is publicly available at:

                                 http://www.procksi.net

 27th November 2008, University of Warwick                            56
Literature

  Journal Papers
   – The ProCKSI Server: a decision support system for Protein (Structure)
     Comparison, Knowledge, Similarity and Information
     Daniel Barthel, Jonathan D. Hirst, Jacek Błażewicz, Edmund K. Burke, Natalio
     Krasnogor. BMC Bioinformatics 2007, 8, 416.

    – Web and Grid Technologies in Bioinformatics, Computational and Systems
      Biology: A Review
      Azhar A. Shah, Daniel Barthel, Piotr Lukasiak, Jacek Błażewicz, Natalio
      Krasnogor. Current Bioinformatics 2008, 3, 10-31.


  Conference Papers
    – Grid and Distributed Public Comupting Schemes for Structural Proteomics: A Short
      Overview
       Azhar A. Shah, Daniel Barthel, Natalio Krasnogor. In Frontiers of High Performance Computing and
       Networking (ISPA2007), Lecture Notes in Computer Science 4743, 424-434. Springer-Verlag, Niagara Falls,
       Canada, August 2007.
    – Protein Structure Comparison, Clustering and Analysis:
      An Overview of the ProCKSI Decision Support System
       Azhar Ali Shah, Daniel Barthel, Natalio Krasnogor. In Proceedings of the 4th International Symposium on
       Biotechnology (IBS) and 1st Pakistan-China-Iran International Conference on Biotechnology, Bioengineering
       and Biophysical Chemistry (ICBBB'07), Jamshoro, Pakistan, November 2007.




27th November 2008, University of Warwick                                                                  57
Acknowledgements




27th November 2008, University of Warwick      58

Contenu connexe

Tendances

Tendances (20)

The ensembl database
The ensembl databaseThe ensembl database
The ensembl database
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Network components and biological network construction methods
Network components and biological network construction methodsNetwork components and biological network construction methods
Network components and biological network construction methods
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Protein structure prediction (1)
Protein structure prediction (1)Protein structure prediction (1)
Protein structure prediction (1)
 
Biological networks
Biological networksBiological networks
Biological networks
 
UniProt
UniProtUniProt
UniProt
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
In silico structure prediction
In silico structure predictionIn silico structure prediction
In silico structure prediction
 
Clustal
ClustalClustal
Clustal
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Pymol
PymolPymol
Pymol
 
Structural bioinformatics.
Structural bioinformatics.Structural bioinformatics.
Structural bioinformatics.
 
ProCheck
ProCheckProCheck
ProCheck
 
Kegg
KeggKegg
Kegg
 
Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)Homology modeling of proteins (ppt)
Homology modeling of proteins (ppt)
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 

Similaire à Protein Structure Alignment and Comparison

Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems PharmacologyPhilip Bourne
 
Molecular Structures 2009
Molecular Structures 2009Molecular Structures 2009
Molecular Structures 2009lyonja
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Lee Larcombe
 
Structure analysis of protein
Structure analysis of proteinStructure analysis of protein
Structure analysis of proteinKAUSHAL SAHU
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modelingpkchoudhury
 
Cadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmCadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmShikha Popali
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentSaramita De Chakravarti
 
Proteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyProteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyChrist College, Rajkot
 
Techniques in proteomics
Techniques in proteomicsTechniques in proteomics
Techniques in proteomicsN Poorin
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
Fast protein binding site comparisons using
Fast protein binding site comparisons usingFast protein binding site comparisons using
Fast protein binding site comparisons usingzhehuan01
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interactionZeshan Haider
 
STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...Lars Juhl Jensen
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Masahito Ohue
 
cadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxcadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxNoorelhuda2
 

Similaire à Protein Structure Alignment and Comparison (20)

Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
Molecular Structures 2009
Molecular Structures 2009Molecular Structures 2009
Molecular Structures 2009
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Structure analysis of protein
Structure analysis of proteinStructure analysis of protein
Structure analysis of protein
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modeling
 
Cadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmCadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.Pharm
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Docking
DockingDocking
Docking
 
Proteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyProteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASy
 
BIOINFORMATICS.pptx
BIOINFORMATICS.pptxBIOINFORMATICS.pptx
BIOINFORMATICS.pptx
 
Techniques in proteomics
Techniques in proteomicsTechniques in proteomics
Techniques in proteomics
 
CADD
CADDCADD
CADD
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
Fast protein binding site comparisons using
Fast protein binding site comparisons usingFast protein binding site comparisons using
Fast protein binding site comparisons using
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interaction
 
Applied Bioinformatics Assignment 5docx
Applied Bioinformatics Assignment  5docxApplied Bioinformatics Assignment  5docx
Applied Bioinformatics Assignment 5docx
 
STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
 
cadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxcadd-191129134050 (1).pptx
cadd-191129134050 (1).pptx
 

Plus de Natalio Krasnogor

Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...Natalio Krasnogor
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainNatalio Krasnogor
 
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Natalio Krasnogor
 
Advanced computationalsyntbio
Advanced computationalsyntbioAdvanced computationalsyntbio
Advanced computationalsyntbioNatalio Krasnogor
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputingNatalio Krasnogor
 
Evolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tilesEvolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tilesNatalio Krasnogor
 
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...Natalio Krasnogor
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsNatalio Krasnogor
 
An Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic AlgorithmsAn Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic AlgorithmsNatalio Krasnogor
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Natalio Krasnogor
 
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping  Strategy for Systems & Synthetic BiologyTowards a Rapid Model Prototyping  Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic BiologyNatalio Krasnogor
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...Natalio Krasnogor
 
HUMIES presentation: Evolutionary design of energy functions for protein str...
HUMIES presentation: Evolutionary design of energy functions  for protein str...HUMIES presentation: Evolutionary design of energy functions  for protein str...
HUMIES presentation: Evolutionary design of energy functions for protein str...Natalio Krasnogor
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...Natalio Krasnogor
 
Computational Synthetic Biology
Computational Synthetic BiologyComputational Synthetic Biology
Computational Synthetic BiologyNatalio Krasnogor
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
 
Synthetic Biology - Modeling and Optimisation
Synthetic Biology -  Modeling and OptimisationSynthetic Biology -  Modeling and Optimisation
Synthetic Biology - Modeling and OptimisationNatalio Krasnogor
 
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...Natalio Krasnogor
 

Plus de Natalio Krasnogor (20)

Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & Blockchain
 
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
 
DNA data-structure
DNA data-structureDNA data-structure
DNA data-structure
 
Advanced computationalsyntbio
Advanced computationalsyntbioAdvanced computationalsyntbio
Advanced computationalsyntbio
 
The Infobiotics workbench
The Infobiotics workbenchThe Infobiotics workbench
The Infobiotics workbench
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputing
 
Evolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tilesEvolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tiles
 
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
An Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic AlgorithmsAn Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic Algorithms
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
 
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping  Strategy for Systems & Synthetic BiologyTowards a Rapid Model Prototyping  Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic Biology
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
 
HUMIES presentation: Evolutionary design of energy functions for protein str...
HUMIES presentation: Evolutionary design of energy functions  for protein str...HUMIES presentation: Evolutionary design of energy functions  for protein str...
HUMIES presentation: Evolutionary design of energy functions for protein str...
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
 
Computational Synthetic Biology
Computational Synthetic BiologyComputational Synthetic Biology
Computational Synthetic Biology
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
Synthetic Biology - Modeling and Optimisation
Synthetic Biology -  Modeling and OptimisationSynthetic Biology -  Modeling and Optimisation
Synthetic Biology - Modeling and Optimisation
 
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
 

Dernier

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 

Dernier (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 

Protein Structure Alignment and Comparison

  • 1. The ProCKSI-Server An on-line Decision Support System for Protein Structure Comparison Natalio Krasnogor www.cs.nott.ac.uk/~nxk Natalio.Krasnogor@Nottingham.ac.uk Interdisciplinary Optimisation Laboratory Automated Scheduling, Optimisation & Planning Research Group School of Computer Science and Information Technology Centre for Integrative Systems Biology School of Biology Centre for Healthcare Associated Infections Institute of Infection, Immunity & Inflammation University of Nottingham 27th November 2008, University of Warwick 1
  • 2. Outline  Introduction − Brief introduction to proteins − Protein structures Comparison − Methods  ProCKSI − Motivation − External Methods − USM & MAX-CMO − Consensus building  Results − From a structural bioinformatics perspective − From a Computational perspective  Conclusions  Acknowledgement 27th November 2008, University of Warwick 2
  • 3. Introduction www.procksi.org 27th November 2008, University of Warwick 3
  • 4. What are Proteins?  Proteins are biological molecules of primary importance to the functioning of living organisms  Perform many and varied functions 27th November 2008, University of Warwick 4
  • 5. Structural Proteins: the organism's basic building blocks, eg. collagen, nails, hair, etc Enzymes: biological engines which mediate multitude of biochemical reactions. Usually enzymes are very specific and catalyze only a single type of reaction, but they can play a role in more than one pathway. Transmembrane proteins: they are the cell’s housekeepers, eg. By regulating cell volume, extraction and concentration of small molecules from the extracellular environment and generation of ionic gradients essential for muscle and nerve cell function (sodium/ potasium pump is an example) 27th November 2008, University of Warwick 5
  • 6. Protein Structures  Varying: size, shape, structure  Structure determines their biological activity  “Natures Robots”  Understanding protein structure is key to understanding function and dysfunction 27th November 2008, University of Warwick 6
  • 7. Components of Proteins  Build Blocks: − Amino Acids − Common Basic Unit Livingstone and Barton:(1993) • Distinct “side chains” • 20 Amino Acid Types 27th November 2008, University of Warwick 7
  • 8. Components of Proteins 27th November 2008, University of Warwick 8
  • 9. Components of Proteins •Thousands of different physicochemical and biochemical properties (AAIndex) • Thus proteins are beautiful combinatorial beasts! 27th November 2008, University of Warwick 8
  • 10. Protein Synthesis  Amino Acid Sequences − AAs polymerised into Chains (Residues) − Gene sequence determines Protein sequence  Protein Structure − Chains fold into specific compact structures  Structure formation (folding) is spontaneous  Sequence determines Structure  Structure determines function 27th November 2008, University of Warwick 9
  • 11. Determining Protein Structures  Protein Structure determination is slow and difficult  Determining protein sequence is relatively easy (Genomics)  PDB vs Genbank Thomas Splettstoesser 27th November 2008, University of Warwick 10
  • 12. Comparing Protein Structures • Proteins build the majority of cellular structures and perform most life functions • Extend knowledge about the protein universe: – Understand interrelations between structures and functions of proteins through measured similarities – Group (cluster) proteins by structural similarities as to infer commonalities • Goal is to predict functions of proteins from their structure, or design new proteins for specific functions • Considering any two objects: What does “similar” mean? Similar or not? How / Where similar? 27th November 2008, University of Warwick 11
  • 13. Protein Structure Comparison Similarity comparison of protein structures is not trivial even though it is obvious that proteins may share certain common patterns (motifs)  Many different similarity comparison methods available, each with its own strengths and weaknesses  Different concepts of similarity: sequence vs. structural, local vs. global, chemical role vs. biological function vs. evolution sequence vs. …  Different algorithms and implementations: exact vs. approximation vs. heuristic, local vs. global search Maximum Contact Map Overlap using e.g. Memetic algorithms, Picture source: http://www.cathdb.info Variable Neighbourhood Search, Tabu Search 27th November 2008, University of Warwick 12
  • 14. Existing Approaches A variety of structure comparison methodologies exist, e.g.: •SSAP (Orengo & Taylor, 96) •ProSup (Feng & Sippl, 96) •DALI (Holm & Sander, 93) •CE (Shindyalov & Bourne, 98) •Max-CMO (Goldman, Papadimitriou, Istrail, Lancia, 99 & 2001) •LGA (Zemla, 2003) •USM (Krasnogor & Pelta, 2004) •SCOP (Murzin, Brenner, Hubbard & Chothia, 95) •CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97) 27th November 2008, University of Warwick 13
  • 15. Computational Underpinning •Dynamic programming (Taylor, 99) •Comparison of distance matrices (Holms & Sander, 93,96} •Maximal common sub-graph detection (Artimiuk, Poirrette, Rice & Willet, 95) •Geometrical matching (Wu, Schmidler, Hastie & Brutlag, 98) •Root-mean-square-distances (Maiorov & Crippen, 94 – Cohen & Sternberg,80) •Other methods (eg. Lackner, Koppensteimer, Domingues & Sippl, 99 – Zemla, Vendruscolo, Moult & Fidelis, 2001) A survey of various similarity measures can be found in (Koehl P: Protein structure similarities. Curr Opin Struct Biol 2001, 11:348-353) 27th November 2008, University of Warwick 14
  • 16. Some Observations •No agreement on which of these is the best method • Various difficulties are associated with each. • They assume that a suitable scoring function can be defined for which optimum values correspond to the best possible structural match between two structures (clearly not allways true, e.g. RMSD) • Some methods cannot produce a proper ranking due to: • ambiguous definitions of the similarity measures or • neglect of alternative solutions with equivalent similarity values. Structure Comparison, is at its core a multi-competence (multi-objective) problem but it is seldom treated as such, e.g.:  ProSup (Feng & Sippl, 96) optimizes the number of equivalent residues with the RMSD being an additional constraint (and not another search dimension).  DALI (Holm & Sander, 93) combines various derived measures into one value, effectively transforming a multi-objective problem into a (weighted) single objective one. 27th November 2008, University of Warwick 15
  • 17. What/How are we comparing? Models, Measures, Metrics & Methods or other tasks... 27th November 2008, University of Warwick 16
  • 18. Until very recently researchers would:  Focus on steps 1-4 , often collapsed into one single step  Compare one algorithm against others on a given data set  Conclude that their algorithm “is best” for that data set and write a paper Meanwhile, in the real world…  No method is best in all data sets.  The biologist will only use the method (s)he is most familiar with! Regardless of the suitability to his/her problem. 27th November 2008, University of Warwick 17
  • 19. Until very recently researchers would:  Focus on steps 1-4 , often collapsed into one single step  Compare one do we change this reality? given Q: How algorithm against others on a data set  Conclude that their algorithm “is best” for that data set and write a paper Meanwhile, in the real world…  No method is best in all data sets.  The biologist will only use the method (s)he is most familiar with! Regardless of the suitability to his/her problem. 27th November 2008, University of Warwick 17
  • 20. Until very recently researchers would:  Focus on steps 1-4 , often collapsed into one single step  Compare one do we change this reality? given Q: How algorithm against others on a data set  Conclude that their it easy for the for that data A: We make algorithm “is best” set and write a to use the correct method biologist paper (and more) Meanwhile, in the real world…  No method is best in all data sets.  The biologist will only use the method (s)he is most familiar with! Regardless of the suitability to his/her problem. 27th November 2008, University of Warwick 17
  • 21. ProCKSI www.procksi.org 27th November 2008, University of Warwick 18
  • 22. The ProCKSI-Server ProCKSI: Protein Comparison, Knowledge, Similarity, and Information  Web Server for protein structure comparison  Workbench / portal for established methods and repositories for protein structure information – Integrates results from many comparison methods in one place – Home-grown comparison methods, Max-CMO and USM (using contact maps as their input)  Decision Support System / analysis tool – Visualises, compares and clusters all similarity measure results – Incorporates all results and suggests a similarity consensus 27th November 2008, University of Warwick 19
  • 23. The ProCKSI-Server Minimise the Management Overhead for Experiments • Upload your own dataset or download structures from the PDB repository • Validate your PDB file, and extract desired models and chains • Choose from multiple similarity comparison methods at one place (including your own similarities) or don’t choose and use all! Calculation USM • Submit and monitor the Manager Local External progress of your experiment MaxCMO Dataset • Integrate results from all Manager pair-wise comparisons Similarity Results Comparison • Analyse and visualise results Management from different similarity Task / Job Scheduling comparison methods Overview Manager • Combine results and produce a Structure Task Requests similarity consensus profile Manager Managers and Results DataBase / Filesystem Analysis • Download desired results Manager 27th November 2008, University of Warwick 20
  • 24. Protein Comparison Methods United  Home-grown methods: − USM − Max-CMO  External methods: − DaliLight − FAST − CE − TMalign − Vorolign − URMS  Additional informational sources: − CATH, iHOP, RSCB, SCOP 27th November 2008, University of Warwick 21
  • 25. Home-Grown Methods • Representation of 3D protein structures as 2D contact maps - Atoms that are far away in the linear chain, come close together in the folded state - If the distance between two atoms i,j is below a threshold t, they are said to form a contact • Mathematical description of contact maps - Calculation of all pairwise Euclidean distances between atoms i,j Sequence
of
atoms - Translation into a binary, symmetrical matrix, called the contact map C Sequence
of
atoms • Contact maps in ProCKSI Input for the two main similarity measures: - Universal Similarity Metric (USM)‫‏‬ - Maximum Contact Map Overlap (MaxCMO)‫‏‬ 27th November 2008, University of Warwick 22
  • 26.  An Example of a contact map 1C7W.PDB 27th November 2008, University of Warwick 23
  • 27. Protein Structure Comparison • Secondary structure elements can be identified in the contact map: − α-helix: wide bands on main diagonal − β-sheet: parallel or perpendicular bands to main diagonal • Comparison of contact maps - using different similarity measures, e.g. number of alignments, overlap values, information content, … • Protein relationships - Pair-wise comparison of multiple proteins results in a (standardised) similarity matrix - Comparison of all possible proteins describes the protein universe Protein
1NAT
with
α-helices
and
β-sheets
 27th November 2008, University of Warwick 24
  • 28. Protein Structure Comparison • Maximum Contact Map Overlap (MaxCMO) method is a specific measure of equivalence - Number of aligned residues (dashed lines) and equivalent contacts (aligned bows, called overlap)‫‏‬ - Overlap gives strong indication for topological similarity taking the local environment into account 27th November 2008, University of Warwick 25
  • 29. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. 27th November 2008, University of Warwick 26
  • 30. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. 27th November 2008, University of Warwick 26
  • 31. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. 27th November 2008, University of Warwick 26
  • 32. 1ash 1hlm Two related proteins taken from the PDB which share a 6 helices structural motif. Two locally and globally similar contact maps. 27th November 2008, University of Warwick 26
  • 33. A candidate alignment between the contact maps of these protein structures. 27th November 2008, University of Warwick 27
  • 34. Protein Structure Comparison • Universal Similarity Metric (USM) is the most concept/domain independent measure in ProCKSI - detects similarities between (quite) divergent structures - based on the concept of Kolmogorov complexity - compares the information content of two contact maps by compression (NCD) 27th November 2008, University of Warwick 28
  • 35. Protein Structure Comparison • Contact maps are the input to Universal Similarity Metric (USM) • Basic concept is Kolmogorov Complexity: - Prior Kolmogorov complexity K(o): Measures the amount of information contained in a given object o - Conditional Kolmogorov complexity K(o1|o2): How much (more) information is needed to produce object o1 if one knows object o2 (as input) • Calculation of the Normalized Information Distance (NID), which is a proper, universal and normalized similarity metric 27th November 2008, University of Warwick 29
  • 36. Protein Structure Comparison • Kolmogorov complexity is not computable directly, but can be heuristically approximated • Approximation of the Normalised Information Distance (NID) by the Normalised Compression Distance (NCD): – Objects are represented as bit strings s (or files) that can be concatenated (.) – Objects are compressed by any lossless real-world compressor (e.g. zip, bzip2, …)‫‏‬ – Length of the compressed string/file 00000000001100000 00000000011100000 approximates the Kolmogorov complexity 00001100011000000 00000100000000000 00100001000000000 00110000000000000 00000000000000000 00001000010000000 00000000001000000 01100001000000000 11100000100000000 11000000000001000 00000000000000100 00000000000100011 00000000000010000 00000000000001000 00000000000001000 00000000001100000 00000000011100000 – Compression of the second string/file using the concatenation 00001100011000000 00000100000000000 00100001000000000 dictionary of the first one gives cond. Kolmogorov 000000000011 00110000000000000 000000000111 00000000000000000 00001000010000000 00000000001000000 complexity 000011000110 000001000000 001000010000 01100001000000000 001100000000 11100000100000000 000000000000 11000000000001000 000010000100 00000000000000100 000000000010 [
0
+
ε;
1
+
ε ] 00000000000100011 011000010000 00000000000010000 NCD NCD 111000001000 00000000000001000 110000000000 00000000000001000 27th November 2008, University of Warwick 30
  • 37. Protein Structure Comparison • Analysis of similarity matrices by hierarchical clustering: – Similarity matrices not easy to analyse, especially for very large datasets – Similar proteins (with small values) are grouped together (clustered)‫‏‬ – Many clustering algorithms available, e.g. Ward’s Minimum Variance • Results of the hierarchical clustering can be visualised as linear or hyperbolic tree – Hyperbolic tree is favourable for large sets of proteins – Fish-eye perspective – Navigation through the tree possible – Tree comparison across methods/data sets 27th November 2008, University of Warwick 31
  • 38. Total Evidence Consensus • Comparison of a pair of proteins P1 and P2 with a given similarity method 1M results in a similarity score 1S12 P1 1M 1S 1S 11 1S 12 … 1S 1n 12 1S 1S P2 21 22 … … 1M 1S 1S 1S 1n n1 Text nn Pn • Comparison of a dataset with multiple proteins P1 … Pn with the same similarity method 1M results in similarity matrix 1S • Comparison of the same dataset with multiple similarity methods 1M … mM results in multiple similarity matrices 1S … mS providing multiple similarity measures 27th November 2008, University of Warwick 32
  • 39. Consensus Analysis Consensus/Greedy – Standardisation of similarity distances: [0;1] – Assumption: For a given pair of structures, the best method produces the best similarity values – Compilation of a similarity matrix including the best values from the best similarity method for each pair Consensus/Average – Expert user selects similarity measures; included measures contribute equally to the consensus – The intelligent combination of similarity comparison measures leads to better results than any single one can provide! Consensus/Weighted – Assign weights to similarity measures according to preference by ranking, e.g. Z-score > N-Align > RMSD – Optimise weights: Determine minimum, average and maximum weights by solving linear programming problem 27th November 2008, University of Warwick 33
  • 40. Total Evidence Consensus • Each similarity matrix must be standardised [0;1] as different methods produce different qualities and ranges of measures • Integration of multiple similarity matrices 1M … mM in order to build a consensus similarity matrix C 1S 11 1S 12 … 1S 1n 1S 1S 21 22 … C11 C12 … C1n 1S 1S n1 nn C21 C22 … … mS mS … mS Cn1 Cnn 11 12 1n mS 1S 21 22 … mS mS n1 nn • The consensus operator determines how the different similarity matrices are weighted and averaged, e.g.: 27th November 2008, University of Warwick 34
  • 41. Results www.procksi.org 27th November 2008, University of Warwick 35
  • 42. Evaluation of CASP6 Results • Evaluation of CASP6 competition results • Prediction of protein structure against a given target – Evaluation of predictions with similarity comparison methods CASP ProCKSI MaxCMO CASP Evaluation Target (T0196)‫‏‬ CONSENSUS Overlap GDT-TS • Similarity ranking with different methods – CONSENSUS = Unweighted arithmetic average of USM + MaxCMO/Overlap + DaliLite/Z – Comparable results between ProCKSI‘s CONSENSUS method and the community‘s gold standard GDT-TS supplemented with expert curation – CONSENSUS detect better model for target T0196 27th November 2008, University of Warwick 36
  • 43. Clustering of Protein Kinases Comparison of sequence-based classification with structure-based clustering from single similarity comparison methods and ProCKSI's consensus method • Biological background: – Kinases are enzymes that catalyse the transfer of a phosphate to a protein substrate – Play essential role in most of the cellular processes e.g. cellular differentiation and repair, cell proliferation • Kinases dataset: http://www.nih.go.jp/mirror/Kinases − 45 structures published at the Protein Kinase Resourse (PKR) web site • Hanks' and Hunter's (HH) classification as gold standard: – Based on sequence information – HH-Clusters: Mainly 9 different groups (super-families)‫‏‬ – Sub-Clusters: Common features according to the SCOP database • Experiments with 3 different comparison methods (USM, MaxCMO, DaliLite), 3 different contact map thresholds, 7 different clustering methods (e.g. Wards, UPGAA) 27th November 2008, University of Warwick 37
  • 44. Clustering of Protein Kinases Single Similarity Measures DaliLite/Z USM/USM MaxCMO/Overlap • Best results with clustering with Ward's Minimum Variance method • Each method/measure has its own strengths and flaws Strengths: • Green: Classification on Class level, e.g. α+β/PK-like • Blue: Detect similarities up to Species level with e.g. mice, pigs, cows • Red: Produce mixed bag of proteins being least similar in Blue Flaws: • MaxCMO/Overlap only distinguishes proteins on Class level • DaliLite/Z adds fairly wrong protein 1IAN to Green • USM/USM reverses order of last two clustering steps (Blue and Green) 27th November 2008, University of Warwick 38
  • 45. Clustering of Protein Kinases Similarity Consensus USM/USM + DaliLite/Z USM/USM + DaliLite/Z + MaxCMO/Overlap • Exhaustive combination of all available similarity measures Best Results: ● Correct clustering with USM/USM + DaliLite/Z compensating for each others flaws General Trends: ● Including similarity measures derived from the number of alignments (e.g. MaxCMO/Align, DaliLite/Align) partially destroy good clustering outside Green ● Adding noisier measures (e.g. MaxCMO/Overlap) still produces comparable good and robust results 27th November 2008, University of Warwick 39
  • 46. Consensus Analysis Comparison of the influence of the combination of different similarity measures on the quality of the consensus method • Rost/Sander dataset: – Designed for secondary structure prediction – Pairwise sequence similarity of less than 25% – 126 globular proteins incl. 18 multi-domain proteins • SCOP classification as gold standard: – Manually curated database containing expert knowledge – Hierarchical classification levels: Class, Fold, Superfamily, Family, Protein, Species • Analyse performance of each established comparison method against consensus method using ROC analysis – Compare true positives against false positives – Performance measure is Area under the Curve (AUC)‫‏‬ 27th November 2008, University of Warwick 40
  • 47. Consensus Analysis - Technique ROC = Receiver Operator Characteristics – Technique for comparing the overall performance of different methods / algorithms / tests on the same dataset – Widely employed e.g. in signal detection theory, machine learning, and diagnostic testing in medicine • ROC curves depict the relative trade-off between benefits (True Positives) and costs (False Positives)‫‏‬ True Classes p n • Confusion matrix of a binary test Test Classes Y TP FP – Hit rate: True Positive rate TPr N FN TN P N – False alarm: False Positive rate FPr Column Totals 27th November 2008, University of Warwick 41
  • 48. Consensus Analysis - Technique Important points in ROC space (0,1) : high TPr and low FPr; perfect classifiction (0,0) : never issue positive classifications; useless (1,1) : always issue positive classifications; useless {y=x} : randomly guessing a classification; useless ROC curves for methods with continuous output – Not a simple binary (discrete) decision problem (yes/no) – Ranking or scoring output estimates the class membership probability of an instance [0;1] – Application of a variable threshold in order to produce and validate discrete classifiers – The best method has an uppermost (north-western) curve – Area Under the Curve (AUC) quantifies the performance 27th November 2008, University of Warwick 42
  • 49. Consensus Analysis Analysis of SCOP’s Class level (as example for all levels)‫‏‬ - RMSD values are not good similarity measures (except for DaliLite)‫‏‬ - Best performance with FAST/SN and FAST/Align (Class level), and with CE/Z, DaliLite/Z, and DaliLite/Align (all other levels)‫‏‬ - Consensus/All gives worse AUC value than best method but very close to it 27th November 2008, University of Warwick 43
  • 50. Consensus Analysis  Results from Comparisons/Singles rating ranking *** first ** second * third 27th November 2008, University of Warwick 44
  • 51. Consensus Analysis  Results from Consensus/Average rating ranking *** first ** second * third 27th November 2008, University of Warwick 45
  • 52. Consensus Analysis Analysis of SCOP’s Superfamily level (exemplary for all levels)‫‏‬ Consensus/ Average-Best3 - Consensus/Average-Best3 gives better AUC values than any of the contributing similarity measures (except Protein level)‫‏‬ - Further reduction to Consensus/Average-Best2 improved only performance for Protein and Superfamily level 27th November 2008, University of Warwick 46
  • 53. Distributed Computing Similarity comparison of proteins with multiple methods and large datasets is very time consuming and needs to be parallelised / distributed / gridified – Simple automated scheduling system for job distribution works well on dedicated ProCKSI cluster (5 nodes, dual) – Research on how to bundle jobs including fast/slow methods and small/large dataset ► Optimise the ratio between calculation time and overhead (data transfer time, waiting time, ...) – Generalised scheduler for usage of clusters on the GRID and/or the University of Nottingham's cluster (> 1000 nodes) 27th November 2008, University of Warwick 47
  • 54. Problem / Solution Space All-against-all comparison of a dataset of S protein structures using M different similarity comparison methods can be represented as 3D cube. s h od Heterogeneity: et M 1. Each structure has different length i.e number of residues 2. Each method has different execution time even for Structures same pair of structures 3. Back-end computational nodes may have different speeds etc Structures 27th November 2008, University of Warwick 48
  • 55. Possible Strategies 1. Comparison of one pair of proteins using one method in the task list => SxSxM jobs, each performing 1 comparison >> far too fine-grained 2. All-against-all comparison of the entire dataset with one method => M jobs, each performing SxS comparisons >> currently running , valid only for |S|<500 proteins 3. Comparison of one pair of proteins using all methods in the task list => SxS jobs, each performing M comparisons >> Slightly different from 1, does not allow intelligent load balancing 4. Intelligent partitioning of the 3D problem space, comparing a subset of proteins with a set/subset of methods >> under investigation 27th November 2008, University of Warwick 49
  • 56. Distributed (grid-enabled) architecture • p = number of nodes • N1, N2, .. Np= Cluster or Grid nodes •The system is able to run both on a parallel environment using the MPI libraries and on a grid computing environment using the MPICH-G2 libraries. •Complexity of Proteins is estimated and bag of proteins are distributed on different nodes 27th November 2008, University of Warwick 50
  • 57. Experimental results: CK34 27th November 2008, University of Warwick 51
  • 58. Experimental results: CK34 27th November 2008, University of Warwick 51
  • 59. Experimental results: RS119 27th November 2008, University of Warwick 52
  • 60. Experimental results: RS119 27th November 2008, University of Warwick 52
  • 61. Experimental results: overall speed-up Speed-up = Ts /Tp Where, Ts: sequential exec time Tp: Parallel exec time on P processors Ideal speed-up = p where, P: number of processors 27th November 2008, University of Warwick 53
  • 62. Conclusions www.procksi.org 27th November 2008, University of Warwick 54
  • 63. Conclusions • ProCKSI is a workbench for protein structure comparison – Implements multiple different similarity comparison methods with different similarity concepts and algorithms – Facilitates the comparison and analysis of large datasets of protein structures through a single, user-friendly interface • ProCKSI is a decision-support system – Integrates many different similarity measures and suggests a consensus similarity profile, taking their strengths and weaknesses into account The combination of multi-competence similarity comparison measures leads to better results than any single one can provide! • Additional Tools: • One of the most tested PDB parsers out-there • Very flexible tool for generating contact maps under a variety of definitions and parameters • Flexible contact maps visualisation • Trees comparison and visualisation • You can add your own distance matrix 27th November 2008, University of Warwick 55
  • 64. Conclusions • ProCKSI keeps expanding: • More methods are being added. • If you have a method and want it included contact us! • More sophisticated data fusion and visualisation are in their way! • Hardware is evolving. • ProCKSI is publicly available at: http://www.procksi.net 27th November 2008, University of Warwick 56
  • 65. Literature Journal Papers – The ProCKSI Server: a decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information Daniel Barthel, Jonathan D. Hirst, Jacek Błażewicz, Edmund K. Burke, Natalio Krasnogor. BMC Bioinformatics 2007, 8, 416. – Web and Grid Technologies in Bioinformatics, Computational and Systems Biology: A Review Azhar A. Shah, Daniel Barthel, Piotr Lukasiak, Jacek Błażewicz, Natalio Krasnogor. Current Bioinformatics 2008, 3, 10-31. Conference Papers – Grid and Distributed Public Comupting Schemes for Structural Proteomics: A Short Overview Azhar A. Shah, Daniel Barthel, Natalio Krasnogor. In Frontiers of High Performance Computing and Networking (ISPA2007), Lecture Notes in Computer Science 4743, 424-434. Springer-Verlag, Niagara Falls, Canada, August 2007. – Protein Structure Comparison, Clustering and Analysis: An Overview of the ProCKSI Decision Support System Azhar Ali Shah, Daniel Barthel, Natalio Krasnogor. In Proceedings of the 4th International Symposium on Biotechnology (IBS) and 1st Pakistan-China-Iran International Conference on Biotechnology, Bioengineering and Biophysical Chemistry (ICBBB'07), Jamshoro, Pakistan, November 2007. 27th November 2008, University of Warwick 57
  • 66. Acknowledgements 27th November 2008, University of Warwick 58