SlideShare a Scribd company logo
1 of 65
Machine Learning Algorithms
for Protein Structure Prediction


                Jianlin Cheng

    Institute for Genomics and Bioinformatics
  School of Information and Computer Sciences
           University of California Irvine
                       2006
Outline
I.     Introduction
II.    1D Prediction
III.   2D Prediction (Beta-Sheet Topology)
IV.    3D Prediction (Fold Recognition)
V.     Publications and Bioinformatics Tools
Importance of Protein Structure
             Prediction
AGCWY……




                                 Cell

Sequence          Structure    Function
Four Levels of Protein Structure
Primary Structure (a directional sequence of amino acids/residues)


  N                                                       C
                                                         …

      Residue1       Residue2
             Peptide bond
Secondary Structure (helix, strand, coil)




        Alpha Helix             Beta Strand / Sheet   Coil
Four Levels of Protein Structure

Tertiary Structure   Quaternary Structure (complex)




                                G Protein Complex
1D: Secondary Structure Prediction

                                       MWLKKFGINLLIGQSV…
                          Helix


                                                Neural Networks
Coil
                                                 + Alignments



                                        CCCCHHHHHCCCSSSSS…
 Strand
                                        Accuracy: 78%

                  Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
1D: Solvent Accessibility Prediction
Exposed
                                     MWLKKFGINLLIGQSV…



                                              Neural Networks
                                               + Alignments



                                      eeeeeeebbbbbbbbeeeebbb…
Buried
                                      Accuracy: 79%

                Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
1D: Disordered Region Prediction Using Neural
Networks
                                            MWLKKFGINLLIGQSV…
  Disordered Region




                                                                         1D-RNN




                                             OOOOODDDDOOOOO…

                                             93% TP at 5% FP
              Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005
1D: Protein Domain Prediction Using Neural
Networks
                                                  MWLKKFGINLLIGQSV…
                           Boundary
                                                         + SS and SA

                                                                             1D-RNN



                                                NNNNNNNBBBBBNNNN…
   HIV capsid protein                                              Inference/Cut
   Domain 1                 Domain 2                        Domains
                                        Top ab-initio domain predictor in CAFASP4

                  Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.
1D: Predict Single-Site Mutation From Sequence
 Using Support Vector Machine
                                                    Correlation = 0.76




                     Support
…MWLAVFILINLK…        Vector
                     Machine

 • First method to predict energy changes from sequence
   accurately
 • Useful for protein engineering, protein design, and
   mutagenesis analysis
                               Cheng, Randall, and Baldi. Proteins, 2006
2D: Contact Map Prediction
3D Structure                           2D Contact Map
                                1 2 ………..………..…j...…………………..…n
                            1
                            2
                            3
                            .
                            .
                            .
                            .
                            i
                            .
                            .
                            .
                            .
                            .
                            .
                            .
                            n
                                                     Distance Threshold = 8Ao

               Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
2D: Disulfide Bond Prediction
Cysteine i
                                       Support yes
                                                       2D-RNN
                                       Vector
                                       Machine

                          Disulfide Bond

                                                       Graph
Cysteine j                                             Matching


             [1] Baldi, Cheng, Vullo. NIPS, 2004.
             [2] Cheng, Saigo, Baldi. Proteins, 2005
2D: Prediction of Beta-Sheet Topology
                        N terminus
  Beta Sheet                           • Ab-Initio Structure
                                         Prediction
                                       • Fold Recognition
Beta
Strand
                                       • Protein Design
                                       • Protein Folding

                                        Cheng and Baldi, Bioinformatics, 2005



                                     C terminus
         Beta Residue
         Pair
An Example of Beta-Sheet Topology
               Level 1
                   4 5




                  2   1 3    6 7




Structure of   Beta Sheets
Protein 1VJG
An Example of Beta-Sheet Topology
               Level 1                 Level 2
                   4 5                      Antiparallel




                  2   1 3    6 7
                                             Parallel




Structure of   Beta Sheets         Strand
Protein 1VJG                       Strand Pair
                                   Strand Alignment
                                   Pairing Direction
An Example of Beta-Sheet Topology
               Level 1                 Level 2                Level 3
                   4 5                      Antiparallel




                                                                 H-bond
                  2   1 3    6 7
                                             Parallel




Structure of   Beta Sheets         Strand                  Beta Residue
Protein 1VJG                       Strand Pair             Residue Pair
                                   Strand Alignment
                                   Pairing Direction
Three-Stage Prediction of Beta-
            Sheets
• Stage 1
 Predict beta-residue pairing probabilities
  using 2D-Recursive Neural Networks (2D-
  RNN, Baldi and Pollastri, 2003)

• Stage 2
 Use beta-residue pairing probabilities to
  align beta-strands
• Stage 3
 Predict beta-strand pairs and beta-sheet
  topology using graph algorithms
Stage 1: Prediction of Beta-Residue Pairings
     Using 2D-Recusive Neural Networks
Input Matrix I (m×m)                    Output / Target Matrix (m×m)




        Iij
                                                         (i,j)
                           2D-RNN
                           O = f(I)



              i                           j                      Oij: Pairing Prob.
                                                                 Tij: 0/1
   …AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK….

        20 for Residues   3 SS   2 SA
An Example (Target)
                          1        2      3   45 6   7




Protein 1VJG
               Beta-Residue Pairing Map (Target Matrix)
An Example (Target)
                                     1        2      3   45 6   7
               Antiparallel




               Parallel

Protein 1VJG
                          Beta-Residue Pairing Map (Target Matrix)
An Example (Prediction)
Stage 2: Beta-Strand Alignment
                            Antiparallel
• Use output probability
  matrix as scoring matrix 1                        m
• Dynamic programming                      n               1
• Disallow gaps and use
                           Parallel
  the simplified search
  algorithm                 1                         m
                                           1                n


                          Total number of alignments = 2(m+n-1)
Strand Alignment and Pairing Matrix
• The alignment score is the
  sum of the pairing
  probabilities of the aligned
  residues
• The best alignment is the
  alignment with the
  maximum score
• Strand Pairing Matrix

                                 Strand Pairing Matrix of 1VJG
Stage 3: Prediction of Beta-Strand
     Pairings and Beta-Sheet Topology


(a) Seven strands of protein 1VJG in sequence order




(b) Beta-sheet topology of protein 1VJG
Minimum Spanning Tree Like
             Algorithm
                          Strand Pairing Graph (SPG)




                        (a) Complete SPG
Strand Pairing Matrix
Minimum Spanning Tree Like
             Algorithm
                            Strand Pairing Graph (SPG)




                          (a) Complete SPG       (b) True Weighted SPG
Strand Pairing Matrix

     Goal: Find a set of connected subgraphs that maximize the
         sum of the alignment scores and satisfy the constraints
     Algorithm: Minimum Spanning Tree Like Algorithm
An Example of MST Like Algorithm
    1      2    3     4     5     6     7

1   0                                       Step 1: Pair strand 4 and 5
2   1.3   0

3   .94   .37   0

4   .02   .02   .04   0                                              4    5
5   .02   .02   .03   1.9   0

6   .10   .05   .74   .04   .04   0

7   .02   .02   .03   .02   .02   .20   0

    Strand Pairing Matrix of 1VJG
An Example of MST Like Algorithm
    1      2    3     4     5     6     7

1   0                                       Step 2: Pair strand 1 and 2
2   1.3   0

3   .94   .37   0

4   .02   .02   .04   0                                              4    5
5   .02   .02   .03   1.9   0

6   .10   .05   .74   .04   .04   0

7   .02   .02   .03   .02   .02   .20   0

    Strand Pairing Matrix of 1VJG
                                             2    1




                                                  N
An Example of MST Like Algorithm
    1      2    3     4     5     6     7

1   0                                       Step 3: Pair strand 1 and 3
2   1.3   0

3   .94   .37   0

4   .02   .02   .04   0                                              4    5
5   .02   .02   .03   1.9   0

6   .10   .05   .74   .04   .04   0

7   .02   .02   .03   .02   .02   .20   0

    Strand Pairing Matrix of 1VJG
                                             2    1    3




                                                  N
An Example of MST Like Algorithm
    1      2    3     4     5     6     7

1   0                                       Step 4: Pair strand 3 and 6
2   1.3   0

3   .94   .37   0

4   .02   .02   .04   0                                              4    5
5   .02   .02   .03   1.9   0

6   .10   .05   .74   .04   .04   0

7   .02   .02   .03   .02   .02   .20   0

    Strand Pairing Matrix of 1VJG
                                                           6
                                             2    1    3




                                                  N
An Example of MST Like Algorithm
    1      2    3     4     5     6     7

1   0                                       Step 5: Pair strand 6 and 7
2   1.3   0

3   .94   .37   0

4   .02   .02   .04   0                                                 4   5
5   .02   .02   .03   1.9   0

6   .10   .05   .74   .04   .04   0
                                                                    C
7   .02   .02   .03   .02   .02   .20   0

    Strand Pairing Matrix of 1VJG                               7
                                                           6
                                             2    1    3




                                                  N
1.Beta Residue Pairing
 Method                         Specificity/              Ratio of
                                Sensitivity               Improvement
 BetaPairing                    41%                       17.8
 CMAPpro                        27%                       11.7
 (Pollastri and Baldi, 2002)


2. Beta Strand Alignment
 Method                                                 Alignment Pairing
                                                        Accuracy Direction
 BetaPairing                                            66%          84%
 Statistical Potential (Hubbard, 1994)                  40%          X
 Pseudo-energy (Zhu and Braun, 1999)                    35%          X

 Information Theory (Steward and Thornton, 2002)        37%          X

3. Beta Strand Pairing
 Method               Specificity     Sensitivity   % of non-local pairs

 MST Like             53%             59%           20%
3D Structure Prediction
                                                 MWLKKFGINLLIGQSV…
•Ab-Initio Structure Prediction
                                                                 Simulation
 Physical force field – protein folding               ……
 Contact map - reconstruction
                                                          Select structure with
                                                          minimum free energy
•Template-Based Structure Prediction


Query protein
                                          Fold
MWLKKFGINKH…
                                          Recognition    Alignment

                                                    Template
                      Protein Data Bank
A Machine Learning Information Retrieval
         Framework for Fold Recognition

                Fold Recognition
                                       Cheng and Baldi, Bioinformatics, 2006




Query Protein                                      Alignment
MWLKKFGIN……




                                          Template
                Protein Data Bank

                           Machine Learning Ranking
Classic Fold Recognition Approaches

           Sequence - Sequence Alignment
           (Needleman and Wunsch, 1970. Smith and Waterman, 1981)




Query      ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL



Template   ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL


                     Alignment (similarity) score

            Works for >40% sequence identity
            (Close homologs in protein family)
Classic Fold Recognition Approaches
                      Profile - Sequence Alignment
                      (Altschul et al., 1997)


          ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
Query     ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL
Family    ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL
          ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL            Average
                                                           Score

Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN


         More sensitive for distant homologs in superfamily.
         (> 25% identity)
Classic Fold Recognition Approaches
                      Profile - Sequence Alignment
                      (Altschul et al., 1997)

         12………………………………….………………n                               1   2   …   n
          ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL         A    0.4
Query     ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL         C    0.1
Family    ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL         …
          ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL         W    0.5

                                                  Position Specific Scoring Matrix
                                                  Or Hidden Markov Model
Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN


         More sensitive for distant homologs in superfamily.
         (> 25% identity)
Classic Fold Recognition Approaches
                     Profile - Profile Alignment
                     (Rychlewski et al., 2000)

                                                           1   2   …   n
         ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL       A   0.1
Query    ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL       C   0.4
Family   ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL       …
         ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL       W   0.5



                                                           1   2   …   m

Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN        A   0.3
         IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN        C   0.5
Family IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM          …
                                                     W   0.2




         More sensitive for very distant homologs.
         (> 15% identity)
Classic Fold Recognition Approaches
           Sequence - Structure Alignment (Threading)
           (Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)




 Query                   Fit
                                                                                             Fitness
MWLKKFGINLLIGQS….                                                                            Score




                                      Template Structure
 Useful for recognizing similar folds without sequence similarity.
 (no evolutionary relationship)
Integration of Complementary Approaches

                                                            FR Server1


Query
            Meta Server                                     FR server2
Consensus
        (Lundstrom et al.,2001. Fischer, 2003)



                                                            FR server3

                                                 Internet

    1. Reliability depends on availability of external servers
    2. Make decisions on a handful candidates
Machine Learning Classification Approach

                                                  Support Vector Machine (SVM)   Class 1


                     Proteins                                                    Class 2




                                                                                 Class m

Classify individual proteins to several or dozens of structure classes
(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)

        Problem 1: can’t scale up to thousands of protein classes
        Problem 2: doesn’t provide templates for structure modeling
Machine Learning Information
               Retrieval Framework
Query-Template Pair
                      Relevance Function (e.g., SVM)   Score 1


                                         +
                                                       Score 2   Rank
                                                         .
                                                         .
                            -                            .

                                                       Score n


               • Extract pairwise features
               • Comparison of two pairs (four proteins)
               • Relevant or not (one score) vs. many classes
               • Ranking of templates (retrieval)
Pairwise Feature Extraction
• Sequence / Family Information Features
  Cosine, correlation, and Gaussian kernel
• Sequence – Sequence Alignment Features
  Palign, ClustalW
• Sequence – Profile Alignment Features
  PSI-BLAST, IMPALA, HMMer, RPS-BLAST
• Profile – Profile Alignment Features
  ClustalW, HHSearch, Lobster, Compass, PRC-HMM
• Structural Features
  Secondary structure, solvent accessibility, contact map, beta-
  sheet topology
Pairwise Feature Extraction
Relevance Function: Support Vector
             Machine Learning
                                               Feature Space
  Positive Pairs
  (Same Folds)


                        Support
 Negative Pairs
                         Vector
 (Different Folds)
                        Machine
                     Training/Learning


                                         Hyperplane
Training Data Set
Relevance Function: Support Vector
              Machine Learning
(1)                           (2)




                                                        Margin
                     Margin




                     f(x) =
                                K is Gaussian Kernel:
Training and Cross-Validation
 • Standard benchmark (Lindahl’s dataset, 976 proteins)
 • 976 x 975 query-template pairs (about 7,468 positives)
Query
                        Query 1’s pairs
 1        975 pairs
 2                      Query 2’s pairs     Train / Learn
 3        975 pairs         .
 .                          .
 .                          .
 .                       (90%: 1- 878)
                                                    Rank 975
 .                                           Test   templates
 .                       (10%: 879 – 976)
          975 pairs                                 for each
 976
                                                    query
Results for Top Five Ranked Templates
 Method                  Family         Superfamily       Fold
 PSI-BLAST               72.3           27.9              4.7
 HMMER                   73.5           31.3              14.6
 SAM-T98                 75.4           38.9              18.7
 BLASTLINK               78.9           4.06              16.5
 SSEARCH                 75.5           32.5              15.6
 SSHMM                   71.7           31.6              24
 THREADER                58.9           24.7              37.7
 FUGUE                   85.8           53.2              26.8
 RAPTOR                  77.8           50                45.1
 SPARKS3                 86.8           67.7              47.4
 FOLDpro                 89.9           70.0              48.3
          •Family: close homologs, more identity
          •Superfamily: distant homologs, less identity
          •Fold: no evolutionary relation, no identity
Specificity-Sensitivity Plot (Family)
Specificity-Sensitivity Plot (Superfamily)
Specificity-Sensitivity Plot (Fold)
Advantages of MLIR Framework
 •   Integration
 •   Accuracy
 •   Extensibility
 •   Simplicity
 •   Reliability
 •   Completeness
 •   Potentials
Disadvantages
  Slower than some alignment methods
A CASP7 Example: T0290
Query sequence (173 residues):
RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFM
VQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVV
FGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP
                     FOLDpro




                                        Compare with the experimental
                                        structure:
                                        RMSD = 1Ao




          Predicted Structure
Publications and Bioinformatics Tools
 1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond
 Connectivity. NIPS 2004.
                                                                             [DIpro 1.0]
 2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide
 Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks,
 and Weighted Graph Matching. Proteins, 2006.
                                                                             [DIpro 2.0]
 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by
 Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.
                                                                             [BETApro]
 4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein
 Structure and Structural Feature Prediction Server. Nucleic Acids Research,
 2005.
                                                        [SSpro 4/ACCpro 4/CMAPpro 2]
 5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein
 Disordered Regions by Mining Protein Structure Data. Data Mining and
 Knowledge Discovery, 2005.
                                                                                [DISpro]
Publications and Bioinformatics Tools
 6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a
 Generative, Scalable, Software Infrastructure for Pathway Bioinformatics
 and Systems Biology. IEEE Intelligent Systems, 2005.
                                                                              [Sigmoid]
 7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes
 for Single Site Mutations Using Support Vector Machines. Proteins, 2006.
                                                                                [MUpro]
 8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J.
 Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H.
 Lathrop. Functional Census of Mutation Sequence Spaces: The Example of
 p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology
 and Bioinformatics, 2006.

 9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction
 Using Profiles, Secondary Structure, Relative Solvent Accessibility, and
 Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.
                                                                           [DOMpro]
 10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach
 to Protein Fold Recognition. Bioinformatics, 2006.
                                                                          [FOLDpro]
Acknowledgements
• Pierre Baldi
• G. Wesley Hatfield, Eric Mjolsness, Hal
  Stern, Dennis Decoste, Suzanne Sandmeyer,
  Richard Lathrop, Gianluca Pollastri, Chin-
  Rang Yang
• Mike Sweredoski, Arlo Randall, Liza Larsen,
  Sam Danziger, Trent Su, Hiroto Saigo,
  Alessandro Vullo, Lucas Scharenbroich
Markov Models
1D-Recursive Neural Network
2D-Recursive Neural Network
2D-RNNs
2D RNNs

More Related Content

Viewers also liked

methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure predictionkaramveer prajapat
 
protein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modellingprotein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modellingDileep Paruchuru
 
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachProtein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachGualberto Asencio Cortés
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...butest
 
Human brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax vaHuman brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax vaAvi Dey
 
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Purdue University
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeNixon Mendez
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionArindam Ghosh
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 

Viewers also liked (10)

methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
protein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modellingprotein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modelling
 
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachProtein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors Approach
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...
 
Human brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax vaHuman brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax va
 
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its Scope
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 

Similar to [Talk]

Multi-scale network biology model & the model library
Multi-scale network biology model & the model libraryMulti-scale network biology model & the model library
Multi-scale network biology model & the model librarylaserxiong
 
The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...Borlaug Global Rust Initiative
 
2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp Sci2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp SciJason Stajich
 
Thesis def
Thesis defThesis def
Thesis defJay Vyas
 
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Mark Berjanskii
 
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"..."Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...Davide Chicco
 
Protein Structure Prediction using Coarse Grain Force Fields
Protein Structure Prediction using Coarse Grain Force FieldsProtein Structure Prediction using Coarse Grain Force Fields
Protein Structure Prediction using Coarse Grain Force FieldsNasir Mahmood, PhD
 

Similar to [Talk] (8)

Multi-scale network biology model & the model library
Multi-scale network biology model & the model libraryMulti-scale network biology model & the model library
Multi-scale network biology model & the model library
 
The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp Sci2009 11 16 UCR Comp Sci
2009 11 16 UCR Comp Sci
 
Thesis def
Thesis defThesis def
Thesis def
 
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
 
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"..."Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
"Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis"...
 
Protein Structure Prediction using Coarse Grain Force Fields
Protein Structure Prediction using Coarse Grain Force FieldsProtein Structure Prediction using Coarse Grain Force Fields
Protein Structure Prediction using Coarse Grain Force Fields
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

[Talk]

  • 1. Machine Learning Algorithms for Protein Structure Prediction Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006
  • 2. Outline I. Introduction II. 1D Prediction III. 2D Prediction (Beta-Sheet Topology) IV. 3D Prediction (Fold Recognition) V. Publications and Bioinformatics Tools
  • 3. Importance of Protein Structure Prediction AGCWY…… Cell Sequence Structure Function
  • 4. Four Levels of Protein Structure Primary Structure (a directional sequence of amino acids/residues) N C … Residue1 Residue2 Peptide bond Secondary Structure (helix, strand, coil) Alpha Helix Beta Strand / Sheet Coil
  • 5. Four Levels of Protein Structure Tertiary Structure Quaternary Structure (complex) G Protein Complex
  • 6. 1D: Secondary Structure Prediction MWLKKFGINLLIGQSV… Helix Neural Networks Coil + Alignments CCCCHHHHHCCCSSSSS… Strand Accuracy: 78% Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
  • 7. 1D: Solvent Accessibility Prediction Exposed MWLKKFGINLLIGQSV… Neural Networks + Alignments eeeeeeebbbbbbbbeeeebbb… Buried Accuracy: 79% Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
  • 8. 1D: Disordered Region Prediction Using Neural Networks MWLKKFGINLLIGQSV… Disordered Region 1D-RNN OOOOODDDDOOOOO… 93% TP at 5% FP Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005
  • 9. 1D: Protein Domain Prediction Using Neural Networks MWLKKFGINLLIGQSV… Boundary + SS and SA 1D-RNN NNNNNNNBBBBBNNNN… HIV capsid protein Inference/Cut Domain 1 Domain 2 Domains Top ab-initio domain predictor in CAFASP4 Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.
  • 10. 1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine Correlation = 0.76 Support …MWLAVFILINLK… Vector Machine • First method to predict energy changes from sequence accurately • Useful for protein engineering, protein design, and mutagenesis analysis Cheng, Randall, and Baldi. Proteins, 2006
  • 11. 2D: Contact Map Prediction 3D Structure 2D Contact Map 1 2 ………..………..…j...…………………..…n 1 2 3 . . . . i . . . . . . . n Distance Threshold = 8Ao Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
  • 12. 2D: Disulfide Bond Prediction Cysteine i Support yes 2D-RNN Vector Machine Disulfide Bond Graph Cysteine j Matching [1] Baldi, Cheng, Vullo. NIPS, 2004. [2] Cheng, Saigo, Baldi. Proteins, 2005
  • 13. 2D: Prediction of Beta-Sheet Topology N terminus Beta Sheet • Ab-Initio Structure Prediction • Fold Recognition Beta Strand • Protein Design • Protein Folding Cheng and Baldi, Bioinformatics, 2005 C terminus Beta Residue Pair
  • 14. An Example of Beta-Sheet Topology Level 1 4 5 2 1 3 6 7 Structure of Beta Sheets Protein 1VJG
  • 15. An Example of Beta-Sheet Topology Level 1 Level 2 4 5 Antiparallel 2 1 3 6 7 Parallel Structure of Beta Sheets Strand Protein 1VJG Strand Pair Strand Alignment Pairing Direction
  • 16. An Example of Beta-Sheet Topology Level 1 Level 2 Level 3 4 5 Antiparallel H-bond 2 1 3 6 7 Parallel Structure of Beta Sheets Strand Beta Residue Protein 1VJG Strand Pair Residue Pair Strand Alignment Pairing Direction
  • 17. Three-Stage Prediction of Beta- Sheets • Stage 1 Predict beta-residue pairing probabilities using 2D-Recursive Neural Networks (2D- RNN, Baldi and Pollastri, 2003) • Stage 2 Use beta-residue pairing probabilities to align beta-strands • Stage 3 Predict beta-strand pairs and beta-sheet topology using graph algorithms
  • 18. Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks Input Matrix I (m×m) Output / Target Matrix (m×m) Iij (i,j) 2D-RNN O = f(I) i j Oij: Pairing Prob. Tij: 0/1 …AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK…. 20 for Residues 3 SS 2 SA
  • 19. An Example (Target) 1 2 3 45 6 7 Protein 1VJG Beta-Residue Pairing Map (Target Matrix)
  • 20. An Example (Target) 1 2 3 45 6 7 Antiparallel Parallel Protein 1VJG Beta-Residue Pairing Map (Target Matrix)
  • 22. Stage 2: Beta-Strand Alignment Antiparallel • Use output probability matrix as scoring matrix 1 m • Dynamic programming n 1 • Disallow gaps and use Parallel the simplified search algorithm 1 m 1 n Total number of alignments = 2(m+n-1)
  • 23. Strand Alignment and Pairing Matrix • The alignment score is the sum of the pairing probabilities of the aligned residues • The best alignment is the alignment with the maximum score • Strand Pairing Matrix Strand Pairing Matrix of 1VJG
  • 24. Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology (a) Seven strands of protein 1VJG in sequence order (b) Beta-sheet topology of protein 1VJG
  • 25. Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (a) Complete SPG Strand Pairing Matrix
  • 26. Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (a) Complete SPG (b) True Weighted SPG Strand Pairing Matrix Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm
  • 27. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 1: Pair strand 4 and 5 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG
  • 28. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 2: Pair strand 1 and 2 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 2 1 N
  • 29. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 3: Pair strand 1 and 3 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 2 1 3 N
  • 30. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 4: Pair strand 3 and 6 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 6 2 1 3 N
  • 31. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 5: Pair strand 6 and 7 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 C 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 7 6 2 1 3 N
  • 32. 1.Beta Residue Pairing Method Specificity/ Ratio of Sensitivity Improvement BetaPairing 41% 17.8 CMAPpro 27% 11.7 (Pollastri and Baldi, 2002) 2. Beta Strand Alignment Method Alignment Pairing Accuracy Direction BetaPairing 66% 84% Statistical Potential (Hubbard, 1994) 40% X Pseudo-energy (Zhu and Braun, 1999) 35% X Information Theory (Steward and Thornton, 2002) 37% X 3. Beta Strand Pairing Method Specificity Sensitivity % of non-local pairs MST Like 53% 59% 20%
  • 33. 3D Structure Prediction MWLKKFGINLLIGQSV… •Ab-Initio Structure Prediction Simulation Physical force field – protein folding …… Contact map - reconstruction Select structure with minimum free energy •Template-Based Structure Prediction Query protein Fold MWLKKFGINKH… Recognition Alignment Template Protein Data Bank
  • 34. A Machine Learning Information Retrieval Framework for Fold Recognition Fold Recognition Cheng and Baldi, Bioinformatics, 2006 Query Protein Alignment MWLKKFGIN…… Template Protein Data Bank Machine Learning Ranking
  • 35. Classic Fold Recognition Approaches Sequence - Sequence Alignment (Needleman and Wunsch, 1970. Smith and Waterman, 1981) Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL Alignment (similarity) score Works for >40% sequence identity (Close homologs in protein family)
  • 36. Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Average Score Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)
  • 37. Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) 12………………………………….………………n 1 2 … n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.4 Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.1 Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL … ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5 Position Specific Scoring Matrix Or Hidden Markov Model Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)
  • 38. Classic Fold Recognition Approaches Profile - Profile Alignment (Rychlewski et al., 2000) 1 2 … n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.1 Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.4 Family ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL … ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5 1 2 … m Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN A 0.3 IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN C 0.5 Family IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM … W 0.2 More sensitive for very distant homologs. (> 15% identity)
  • 39. Classic Fold Recognition Approaches Sequence - Structure Alignment (Threading) (Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994) Query Fit Fitness MWLKKFGINLLIGQS…. Score Template Structure Useful for recognizing similar folds without sequence similarity. (no evolutionary relationship)
  • 40. Integration of Complementary Approaches FR Server1 Query Meta Server FR server2 Consensus (Lundstrom et al.,2001. Fischer, 2003) FR server3 Internet 1. Reliability depends on availability of external servers 2. Make decisions on a handful candidates
  • 41. Machine Learning Classification Approach Support Vector Machine (SVM) Class 1 Proteins Class 2 Class m Classify individual proteins to several or dozens of structure classes (Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004) Problem 1: can’t scale up to thousands of protein classes Problem 2: doesn’t provide templates for structure modeling
  • 42. Machine Learning Information Retrieval Framework Query-Template Pair Relevance Function (e.g., SVM) Score 1 + Score 2 Rank . . - . Score n • Extract pairwise features • Comparison of two pairs (four proteins) • Relevant or not (one score) vs. many classes • Ranking of templates (retrieval)
  • 43. Pairwise Feature Extraction • Sequence / Family Information Features Cosine, correlation, and Gaussian kernel • Sequence – Sequence Alignment Features Palign, ClustalW • Sequence – Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST • Profile – Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM • Structural Features Secondary structure, solvent accessibility, contact map, beta- sheet topology
  • 45. Relevance Function: Support Vector Machine Learning Feature Space Positive Pairs (Same Folds) Support Negative Pairs Vector (Different Folds) Machine Training/Learning Hyperplane Training Data Set
  • 46. Relevance Function: Support Vector Machine Learning (1) (2) Margin Margin f(x) = K is Gaussian Kernel:
  • 47. Training and Cross-Validation • Standard benchmark (Lindahl’s dataset, 976 proteins) • 976 x 975 query-template pairs (about 7,468 positives) Query Query 1’s pairs 1 975 pairs 2 Query 2’s pairs Train / Learn 3 975 pairs . . . . . . (90%: 1- 878) Rank 975 . Test templates . (10%: 879 – 976) 975 pairs for each 976 query
  • 48. Results for Top Five Ranked Templates Method Family Superfamily Fold PSI-BLAST 72.3 27.9 4.7 HMMER 73.5 31.3 14.6 SAM-T98 75.4 38.9 18.7 BLASTLINK 78.9 4.06 16.5 SSEARCH 75.5 32.5 15.6 SSHMM 71.7 31.6 24 THREADER 58.9 24.7 37.7 FUGUE 85.8 53.2 26.8 RAPTOR 77.8 50 45.1 SPARKS3 86.8 67.7 47.4 FOLDpro 89.9 70.0 48.3 •Family: close homologs, more identity •Superfamily: distant homologs, less identity •Fold: no evolutionary relation, no identity
  • 52. Advantages of MLIR Framework • Integration • Accuracy • Extensibility • Simplicity • Reliability • Completeness • Potentials Disadvantages Slower than some alignment methods
  • 53. A CASP7 Example: T0290 Query sequence (173 residues): RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFM VQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVV FGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP FOLDpro Compare with the experimental structure: RMSD = 1Ao Predicted Structure
  • 54. Publications and Bioinformatics Tools 1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond Connectivity. NIPS 2004. [DIpro 1.0] 2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins, 2006. [DIpro 2.0] 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005. [BETApro] 4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, 2005. [SSpro 4/ACCpro 4/CMAPpro 2] 5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery, 2005. [DISpro]
  • 55. Publications and Bioinformatics Tools 6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a Generative, Scalable, Software Infrastructure for Pathway Bioinformatics and Systems Biology. IEEE Intelligent Systems, 2005. [Sigmoid] 7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single Site Mutations Using Support Vector Machines. Proteins, 2006. [MUpro] 8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H. Lathrop. Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology and Bioinformatics, 2006. 9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006. [DOMpro] 10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, 2006. [FOLDpro]
  • 56. Acknowledgements • Pierre Baldi • G. Wesley Hatfield, Eric Mjolsness, Hal Stern, Dennis Decoste, Suzanne Sandmeyer, Richard Lathrop, Gianluca Pollastri, Chin- Rang Yang • Mike Sweredoski, Arlo Randall, Liza Larsen, Sam Danziger, Trent Su, Hiroto Saigo, Alessandro Vullo, Lucas Scharenbroich
  • 57.
  • 59.
  • 60.
  • 63.