SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Pittsburgh Learning Classifier Systems for Protein
Structure Prediction: Scalability and Explanatory
Power


          Jaume Bacardit & Natalio Krasnogor
          ASAP research group
          School of Computer Science & IT
          University of Nottingham
Outline

    Introduction


    The GAssist Pittsburgh LCS


    Problem definition


    Experimentation design


    Results and discussion


    Conclusions and further work

Introduction

    Proteins are biological molecules of

    primary importance to the functioning
    of living organisms
    Proteins are constructed as a chains of

    amino acid residues
    This chain folds to create a 3D

    structure
Introduction

    It is relatively easy to know the primary

    sequence of a protein, but much more
    difficult to know its 3D structure
    Can we predict the 3D structure of a protein

    from its primary sequence?  Protein
    Structure Prediction (PSP)
    PSP problem is divided in several sub

    problems. We focus on Coordination
    Number (CN) prediction
Introduction

    We will use the GAssist [Bacardit,

    2004] Pittsburgh LCS to solve this
    problem
    We analyze two sides of the problem:

        Performance
    

        Interpretability
    

    We also study the scalability issue

The GAssist Pittsburgh LCS
    GAssist uses a generational GA where each

    individual is a complete solution to the
    classification problem
    A solution is encoded as an ordered set of

    rules
    GABIL [De Jong & Spears, 93]

    representation and crossover operator are
    used
    Representation is extended with an explicit

    default rule
    Covering style initialization operator

The GAssist Pittsburgh LCS
    Representation of GABIL for nominal

    attributes
        Predicate → Class
    
        Predicate: Conjunctive Normal Form (CNF)
    
        (A1=V11∨.. ∨ A1=V1n) ∧.....∧ (An=Vn2∨.. ∨ An=Vnm)
              Ai : ith attribute
        
              Vij : jth value of the ith attribute
        

        “If
    
              (attribute ‘number’ takes values ‘1’ or ‘2’)
        
              and (attribute ‘letter’ takes values ‘a’ or ‘b’)
        
              and (attribute ‘color’ takes values ‘blue’ or ‘red’)
        

        Then predict class ‘class1’
    
The GAssist Pittsburgh LCS

    Fitness function based on the MDL

    principle [Rissanen,78] to balance
    accuracy and complexity
        Our aim is to create compact individuals
    
        for two reasons
            Avoiding over-learning
        
            Increasing the explicability of the rules
        

        Complexity metric will promote
    
            Individuals with few rules
        
            Individuals with few expressed attributes
        
The GAssist Pittsburgh LCS
    Costly evaluation process if dataset is big

    Computational cost is alleviated by using a

    windowing mechanism called ILAS
                              2·Ex/n   3·Ex/n
                   0   Ex/n                     Ex

    Training set




     Iterations
                   0                            Iter

    This mechanism also introduces some

    generalization pressure
Problem definition

Primary Structure = Sequence
  MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFI
  LLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLV
  KHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQII
  KQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIK
  KHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQT
  VRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDH
  AEQLLAQVNQLLADKDHLHLVFE


Secondary Structure                          Tertiary




          Local Interactions                    Global Interactions
Problem definition

    Coordination Number (CN) prediction

        Two residues of a chain are said to be in
    
        contact if their distance is less than a certain
        threshold
        CN of a residue : number of contacts that a
    
        certain residue has
Problem definition
    Definition of CN [Kinjo et al., 05]

        Distance between two residues is defined as the
    
        distance between their Cβ atoms (Cα for Glycine)
        Uses a smooth definition of contact based on a
    
        sigmoid function instead of the usual crisp
        definition
        Discards local contacts
    




        Unsupervised discretization is needed to
    
        transform the CN domain into a classification
        problem
Experimentation design

    Protein dataset

        Used the same set used by Kinjo et al.
    

        1050 protein chains
    

        259768 residues
    

        Ten lists of the chains are available, first 950
    
        chains in each list are for training, the rest for
        tests (10xbootstrap)
        Data was extracted from PDB format and
    
        cleaned of inconsistencies and exceptions
Experimentation design
    We have to transform the data into a regular

    structure so that it can be processed by standard
    machine learning techniques
    Each residue is characterized by several features.

    We use one (i.e., the AA type) or more of them as
    input information and one of them as target (CN)

         Ri-5    Ri-4    Ri-3        Ri-2    Ri-1     Ri            Ri+2    Ri+3    Ri+4    Ri+5
                                                            Ri+1
         CNi-5   CNi-4   CNi-3       CNi-2   CNi-1   CNi           CNi+2   CNi+3   CNi+4   CNi+5
                                                           CNi+1




                                 Ri-1,Ri,Ri+1  CNi
                                 Ri,Ri+1,Ri+2  CNi+1
                                 Ri+1,Ri+2,Ri+3  CNi+2
Experimentation design

    GAssist adjustements

        150 strata for the ILAS windowing
    
        system
        20000 iterations
    

        Majority class will be used by the
    
        default rule
Experimentation design
    Distance cut-off of 10Å

    Window size of 9 (4 residues at each side of

    the target)
    3 number of states (2, 3, 5) and 2 definitions

    of state partitions
    3 sources of input data that generate 6

    combination of attributes
    GAssist compared against SVM,

    NaiveBayes and C4.5
Results and discussion

    Summary of results

       Global ranking of performance:
    

    LIBSVM       GAssist     Naive Bayes   C4.5
     CN prediction accuracy
            2 states: ~80%
        

            3 states: ~67%
        

            5 states: ~47%
        
Results and discussion

    LIBSVM consistently achieved better

    performance than GAssist. However:
        Learning time for these experiments was
    
        up to 4.5 days
        Test time could take hours
    

        Interpretability is impossible. 70-90% of
    
        instances were selected as support
        vectors
Results and discussion
    Interpretability analysis of GAssist

        Example of a rule set for the CN1-UL-2
    
        classes dataset
Results and discussion
Results and discussion

    Interpretability analysis of GAssist

        All AA types associated to the central
    
        residue are hydrophobic (core of a
        protein)
        D, E consistently do not appear in the
    
        predicates. They are negatively charges
        residues (surface of a protein)
Results and discussion

    Interpretability analysis. Relevance and

    specificity of the attributes
Results and discussion
    Is GAssist learning?

         Evolution of the training accuracy of the best solution
    

        The knowledge
    
        learnt by the
        different strata does
        not integrate
        successfully into a
        single solution
Conclusions and further work
    Initial experiments of the GAssist Pittsburgh

    LCS applied to Bioinformatics
    GAssist can provide compact and

    interpretable solutions
    Its accuracy, however, needs to be

    improved but is not far from state-of-the-art
    These datasets are a challenge for LCS in

    several aspects such as scalability and
    representations, but LCS can give more
    explanatory power than other methods
Conclusions and further work

    Future directions

        Domain solutions
    
            Subdividing the problem into several subproblems and
        
            aggregate the solutions
            Smooth predictions of neighbouring residues
        

        Improving GAssist
    
            Finding a more stable (slower) tuning and parallelize
        

            With purely ML techniques
        

            Feeding back information from the interpretability
        
            analysis to bias/restrict the search
Coordination number prediction using
Learning Classifier Systems




          Questions?

Contenu connexe

En vedette

W2 module 4
W2 module 4W2 module 4
W2 module 4
GBU2012
 
W2 module 2
W2 module 2W2 module 2
W2 module 2
GBU2012
 
W3 module 2
W3 module 2W3 module 2
W3 module 2
GBU2012
 
W2 module 3
W2 module 3W2 module 3
W2 module 3
GBU2012
 
W3 module 1
W3 module 1W3 module 1
W3 module 1
GBU2012
 
W1 Module 1
W1 Module 1W1 Module 1
W1 Module 1
GBU2012
 

En vedette (8)

W2 module 4
W2 module 4W2 module 4
W2 module 4
 
W2 module 2
W2 module 2W2 module 2
W2 module 2
 
W3 module 2
W3 module 2W3 module 2
W3 module 2
 
W2 module 3
W2 module 3W2 module 3
W2 module 3
 
W3 module 1
W3 module 1W3 module 1
W3 module 1
 
Modelling the Initialisation Stage of the ALKR Representation for Discrete Do...
Modelling the Initialisation Stage of the ALKR Representation for Discrete Do...Modelling the Initialisation Stage of the ALKR Representation for Discrete Do...
Modelling the Initialisation Stage of the ALKR Representation for Discrete Do...
 
W1 Module 1
W1 Module 1W1 Module 1
W1 Module 1
 
Multi objective optimization and Benchmark functions result
Multi objective optimization and Benchmark functions resultMulti objective optimization and Benchmark functions result
Multi objective optimization and Benchmark functions result
 

Similaire à Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power

NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
zukun
 
Bayesian Optimization for Balancing Metrics in Recommender Systems
Bayesian Optimization for Balancing Metrics in Recommender SystemsBayesian Optimization for Balancing Metrics in Recommender Systems
Bayesian Optimization for Balancing Metrics in Recommender Systems
Viral Gupta
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
imu409
 

Similaire à Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power (20)

Linear regression
Linear regressionLinear regression
Linear regression
 
Efficient aggregation for graph summarization
Efficient aggregation for graph summarizationEfficient aggregation for graph summarization
Efficient aggregation for graph summarization
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning...
Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning...Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning...
Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning...
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systems
 
Natural Language Generation in the Wild
Natural Language Generation in the WildNatural Language Generation in the Wild
Natural Language Generation in the Wild
 
Bayesian Optimization for Balancing Metrics in Recommender Systems
Bayesian Optimization for Balancing Metrics in Recommender SystemsBayesian Optimization for Balancing Metrics in Recommender Systems
Bayesian Optimization for Balancing Metrics in Recommender Systems
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
 

Plus de Xavier Llorà

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
Xavier Llorà
 

Plus de Xavier Llorà (20)

Meandre 2.0 Alpha Preview
Meandre 2.0 Alpha PreviewMeandre 2.0 Alpha Preview
Meandre 2.0 Alpha Preview
 
Soaring the Clouds with Meandre
Soaring the Clouds with MeandreSoaring the Clouds with Meandre
Soaring the Clouds with Meandre
 
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Scalabiltity in GBML, Accuracy-based Michigan Fuzzy LCS, and new Trends
Scalabiltity in GBML, Accuracy-based Michigan Fuzzy LCS, and new TrendsScalabiltity in GBML, Accuracy-based Michigan Fuzzy LCS, and new Trends
Scalabiltity in GBML, Accuracy-based Michigan Fuzzy LCS, and new Trends
 
Towards a Theoretical Towards a Theoretical Framework for LCS Framework fo...
Towards a Theoretical  Towards a Theoretical  Framework for LCS  Framework fo...Towards a Theoretical  Towards a Theoretical  Framework for LCS  Framework fo...
Towards a Theoretical Towards a Theoretical Framework for LCS Framework fo...
 
Learning Classifier Systems for Class Imbalance Problems
Learning Classifier Systems  for Class Imbalance  ProblemsLearning Classifier Systems  for Class Imbalance  Problems
Learning Classifier Systems for Class Imbalance Problems
 
A Retrospective Look at A Retrospective Look at Classifier System ResearchCl...
A Retrospective Look at  A Retrospective Look at  Classifier System ResearchCl...A Retrospective Look at  A Retrospective Look at  Classifier System ResearchCl...
A Retrospective Look at A Retrospective Look at Classifier System ResearchCl...
 
XCS: Current capabilities and future challenges
XCS: Current capabilities and future  challengesXCS: Current capabilities and future  challenges
XCS: Current capabilities and future challenges
 
Negative Selection for Algorithm for Anomaly Detection
Negative Selection for Algorithm for Anomaly DetectionNegative Selection for Algorithm for Anomaly Detection
Negative Selection for Algorithm for Anomaly Detection
 
Searle, Intentionality, and the Future of Classifier Systems
Searle, Intentionality, and the  Future of Classifier SystemsSearle, Intentionality, and the  Future of Classifier Systems
Searle, Intentionality, and the Future of Classifier Systems
 
Computed Prediction: So far, so good. What now?
Computed Prediction:  So far, so good. What now?Computed Prediction:  So far, so good. What now?
Computed Prediction: So far, so good. What now?
 
NIGEL 2006 welcome
NIGEL 2006 welcomeNIGEL 2006 welcome
NIGEL 2006 welcome
 
Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
Meandre: Semantic-Driven Data-Intensive Flows in the CloudsMeandre: Semantic-Driven Data-Intensive Flows in the Clouds
Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
 
ZigZag: The Meandring Language
ZigZag: The Meandring LanguageZigZag: The Meandring Language
ZigZag: The Meandring Language
 
HUMIES 2007 Bronze Winner: Towards Better than Human Capability in Diagnosing...
HUMIES 2007 Bronze Winner: Towards Better than Human Capability in Diagnosing...HUMIES 2007 Bronze Winner: Towards Better than Human Capability in Diagnosing...
HUMIES 2007 Bronze Winner: Towards Better than Human Capability in Diagnosing...
 
Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infr...
Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infr...Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infr...
Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infr...
 
The DISCUS project
The DISCUS projectThe DISCUS project
The DISCUS project
 
Visualizing content in metadata stores
Visualizing content in metadata storesVisualizing content in metadata stores
Visualizing content in metadata stores
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power

  • 1. Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power Jaume Bacardit & Natalio Krasnogor ASAP research group School of Computer Science & IT University of Nottingham
  • 2. Outline Introduction  The GAssist Pittsburgh LCS  Problem definition  Experimentation design  Results and discussion  Conclusions and further work 
  • 3. Introduction Proteins are biological molecules of  primary importance to the functioning of living organisms Proteins are constructed as a chains of  amino acid residues This chain folds to create a 3D  structure
  • 4. Introduction It is relatively easy to know the primary  sequence of a protein, but much more difficult to know its 3D structure Can we predict the 3D structure of a protein  from its primary sequence?  Protein Structure Prediction (PSP) PSP problem is divided in several sub  problems. We focus on Coordination Number (CN) prediction
  • 5. Introduction We will use the GAssist [Bacardit,  2004] Pittsburgh LCS to solve this problem We analyze two sides of the problem:  Performance  Interpretability  We also study the scalability issue 
  • 6. The GAssist Pittsburgh LCS GAssist uses a generational GA where each  individual is a complete solution to the classification problem A solution is encoded as an ordered set of  rules GABIL [De Jong & Spears, 93]  representation and crossover operator are used Representation is extended with an explicit  default rule Covering style initialization operator 
  • 7. The GAssist Pittsburgh LCS Representation of GABIL for nominal  attributes Predicate → Class  Predicate: Conjunctive Normal Form (CNF)  (A1=V11∨.. ∨ A1=V1n) ∧.....∧ (An=Vn2∨.. ∨ An=Vnm) Ai : ith attribute  Vij : jth value of the ith attribute  “If  (attribute ‘number’ takes values ‘1’ or ‘2’)  and (attribute ‘letter’ takes values ‘a’ or ‘b’)  and (attribute ‘color’ takes values ‘blue’ or ‘red’)  Then predict class ‘class1’ 
  • 8. The GAssist Pittsburgh LCS Fitness function based on the MDL  principle [Rissanen,78] to balance accuracy and complexity Our aim is to create compact individuals  for two reasons Avoiding over-learning  Increasing the explicability of the rules  Complexity metric will promote  Individuals with few rules  Individuals with few expressed attributes 
  • 9. The GAssist Pittsburgh LCS Costly evaluation process if dataset is big  Computational cost is alleviated by using a  windowing mechanism called ILAS 2·Ex/n 3·Ex/n 0 Ex/n Ex Training set Iterations 0 Iter This mechanism also introduces some  generalization pressure
  • 10. Problem definition Primary Structure = Sequence MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFI LLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLV KHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQII KQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIK KHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQT VRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDH AEQLLAQVNQLLADKDHLHLVFE Secondary Structure Tertiary Local Interactions Global Interactions
  • 11. Problem definition Coordination Number (CN) prediction  Two residues of a chain are said to be in  contact if their distance is less than a certain threshold CN of a residue : number of contacts that a  certain residue has
  • 12. Problem definition Definition of CN [Kinjo et al., 05]  Distance between two residues is defined as the  distance between their Cβ atoms (Cα for Glycine) Uses a smooth definition of contact based on a  sigmoid function instead of the usual crisp definition Discards local contacts  Unsupervised discretization is needed to  transform the CN domain into a classification problem
  • 13. Experimentation design Protein dataset  Used the same set used by Kinjo et al.  1050 protein chains  259768 residues  Ten lists of the chains are available, first 950  chains in each list are for training, the rest for tests (10xbootstrap) Data was extracted from PDB format and  cleaned of inconsistencies and exceptions
  • 14. Experimentation design We have to transform the data into a regular  structure so that it can be processed by standard machine learning techniques Each residue is characterized by several features.  We use one (i.e., the AA type) or more of them as input information and one of them as target (CN) Ri-5 Ri-4 Ri-3 Ri-2 Ri-1 Ri Ri+2 Ri+3 Ri+4 Ri+5 Ri+1 CNi-5 CNi-4 CNi-3 CNi-2 CNi-1 CNi CNi+2 CNi+3 CNi+4 CNi+5 CNi+1 Ri-1,Ri,Ri+1  CNi Ri,Ri+1,Ri+2  CNi+1 Ri+1,Ri+2,Ri+3  CNi+2
  • 15. Experimentation design GAssist adjustements  150 strata for the ILAS windowing  system 20000 iterations  Majority class will be used by the  default rule
  • 16. Experimentation design Distance cut-off of 10Å  Window size of 9 (4 residues at each side of  the target) 3 number of states (2, 3, 5) and 2 definitions  of state partitions 3 sources of input data that generate 6  combination of attributes GAssist compared against SVM,  NaiveBayes and C4.5
  • 17. Results and discussion Summary of results  Global ranking of performance:  LIBSVM GAssist Naive Bayes C4.5  CN prediction accuracy 2 states: ~80%  3 states: ~67%  5 states: ~47% 
  • 18. Results and discussion LIBSVM consistently achieved better  performance than GAssist. However: Learning time for these experiments was  up to 4.5 days Test time could take hours  Interpretability is impossible. 70-90% of  instances were selected as support vectors
  • 19. Results and discussion Interpretability analysis of GAssist  Example of a rule set for the CN1-UL-2  classes dataset
  • 21. Results and discussion Interpretability analysis of GAssist  All AA types associated to the central  residue are hydrophobic (core of a protein) D, E consistently do not appear in the  predicates. They are negatively charges residues (surface of a protein)
  • 22. Results and discussion Interpretability analysis. Relevance and  specificity of the attributes
  • 23. Results and discussion Is GAssist learning?  Evolution of the training accuracy of the best solution  The knowledge  learnt by the different strata does not integrate successfully into a single solution
  • 24. Conclusions and further work Initial experiments of the GAssist Pittsburgh  LCS applied to Bioinformatics GAssist can provide compact and  interpretable solutions Its accuracy, however, needs to be  improved but is not far from state-of-the-art These datasets are a challenge for LCS in  several aspects such as scalability and representations, but LCS can give more explanatory power than other methods
  • 25. Conclusions and further work Future directions  Domain solutions  Subdividing the problem into several subproblems and  aggregate the solutions Smooth predictions of neighbouring residues  Improving GAssist  Finding a more stable (slower) tuning and parallelize  With purely ML techniques  Feeding back information from the interpretability  analysis to bias/restrict the search
  • 26. Coordination number prediction using Learning Classifier Systems Questions?