SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Random Forest Ensembles Applied to
MLSCN Screening Data for Prediction
      and Feature Selection


             Rajarshi Guha
             School of Informatics
              Indiana University
                     and
            Stephan Schürer
                 Scripps, FL


            23rd August, 2007
Broad Goals
   Understand and possibly predict cytotoxicity
     −   Utilizing MLSCN screening data and external data
     −   Characterize and visualize various screening results
     −   Relate screening data to known information
   Model and predict acute toxicity in animals
     −   Relate large cytotoxicty data sets to animal toxicity(?)
   Modelling protocols to handle the characteristics of HTS data
     −   Large datasets, imbalanced classes, applicability
   Make models publicly available
     −   For use in multiple scenarios and accessible by a variety of methods
Datasets
   Animal Acute Toxicity Data was extracted from the
    ToxNet database (available from MDL)
    −   Selected only LD50 data for mouse and rat and three routes of
        administration
    −   Summarized LD50 data by structure, species and route
        (140,808 LD50 data points, 103,040 structures)
    −   Classified into Toxic/Nontoxic using a cutoff

   Cytotoxicity Data was taken as published in PubChem
    from Scripps and NCGC
    −   Scripps Jurkat cytotoxicity assay
        (59,805 structures with %Inhib, 801 IC50 values)
    −   NCGC data from PubChem for 13 cell lines (non-MLSMR structures):
        summarized multiple sample data by unique structures and extracted IC50
        data: 1,334 structure, 13 x 1,334 IC50 values for different cell lines
Scripps Cytotoxicity Models
   57,469 valid structures
   775 structures with measured IC50
    −   Skipped 26 structures that BCI could not parse
   How do we model this dataset?
    −   Use all data. Very poor results
    −   Use a sampling procedure to get an ensemble of
        models
    −   Consider just the 775 structures
Scripps Cytotoxicity Models
   First considered the 775 structures
   Evaluated 1052 bit BCI fingerprints
   Selected a cutoff pIC50
    −   >= 5.5 - toxic
    −   < 5.5 – nontoxic
   Used sampling to
    create 10-member
    ensemble
Handling Imbalanced Classes
                          Dataset


                 Toxic              Nontoxic


PSET   TSET
                         PSET        TSET-1    TSET-2    TSET-N




                                      RF 1      RF 2      RF N




       Consensus Predictions         Preds 1   Preds 2   Preds N
Scripps Cytotoxicity Models
   % correct
    (ensemble average) = 69%
   % correct
    (consensus) = 71%

         Nontoxic Toxic
Nontoxic   39      17
 Toxic     15      41


   Not very good
    performance
Do More Negatives Help?

   Include 10,000 structures, randomly selected
    −   Primary data, assumed to be nontoxic
   Selected a cutoff pIC50
    −   >= 5.0 - toxic
    −   < 5.0 – nontoxic
   Used sampling to
    create 10-member
    ensemble
Expanded Cytotoxicity Dataset
   % correct (averaged over
    the ensemble) = 69%
   % correct (consensus
    prediction) = 71%
         Nontoxic Toxic
Nontoxic   79      24
 Toxic     35      68

   Not much
    improvement
   Insufficient sampling
    of the nontoxics
We Need More Positives

   The two datasets (775 vs 10,775 compounds)
    are quite similar in terms of bit spectrum
   Normalized Manhattan distance = 0.016
Selecting Important Features
   Identifying the important features in an ensemble
    −   Each RF model can rank the input features
    −   A consistent ensemble should have similar, but not
        necessarily identical, features highly ranked
    −   Consider the unique set of top N
    −   The size of the unique set of top N features provides
        information about the robustness of the ensemble

        4    4     3    4    Unique set of
        3    8     23   5    top 4 features

        15   16    15   28                    3   4   5   8 15 16 23 28
        23   15    8    23


    Ranked features from each
     member of the ensemble
Important Structural Features
   The 10 most important features for predictive
    ability across the ensemble leads to 43 unique
    important bits
   This is a total of 66 structural features
    −   The toxic compounds are characterized by having a
        slightly larger number of these features, on average
Predicting Animal Toxicity

   Should we use cytotoxicity model to predict
    animal toxicity?
   Normalized Manhattan distance = 0.037
Predicting Animal Toxicity
    Performance really depends on the model
     cutoff and our goals

           Nontoxic Toxic
Nontoxic    43072   1683    Cutoff = 0.6, 93% correct
 Toxic      1748     140

           Nontoxic Toxic
Nontoxic    34674   1158    Cutoff = 0.5, 75% correct
 Toxic      10146    665

           Nontoxic Toxic
Nontoxic    20369    587    Cutoff = 0.4, 46% correct
 Toxic      24451   1236
NCGC Toxicity Dataset
   Considered 13 cell lines, pIC50's
   1334 compounds, including
    −   metals
    −   inorganics
   Classified into toxic / nontoxic using a cutoff
    −   mean + 2 * SD
   Built models for each cell line
NCGC – Class Distributions

   Cutoff values ranged from 3.56 to 4.72
   Classes are severely imbalanced
   Developed ensembles of RF models
NCGC Model ROC Curves
NCGC Model ROC Curves
NCGC – Model Performance
          (Prediction Set)
NCGC – Using the Models
   Predicted toxicity class for the Scripps
    Cytotoxicity dataset (775 compounds) using
    model built for NCGC Jurkat cell line


         Nontoxic Toxic
                            Predictions for the Scripps
Nontoxic   67       49      Cytotox dataset, using the
 Toxic     432     227      original cutoffs (32% correct)


         Nontoxic Toxic     Predictions for the Scripps
Nontoxic   26       90      Cytotox dataset, using the
 Toxic     109     550      NCGC cutoff (75% correct)
Comparing NCGC & Scripps Datasets
   Comparing the datasets as a whole
Comparing NCGC & Scripps Datasets
   Comparing datasets class-wise
Important Features
   We consider the NCGC Jurkat cell line
   The 10 most important features for predictive
    ability across the ensemble leads to 53 unique
    important bits
   This is a total of 72 structural features
    −   The toxic compounds are characterized by having a
        larger number of these features, on average
Feature matches for example structure
                                                                              O



                                                                                    O



                                                              O H

                                                                    C H
                                                                          3


                                        O H

                                                    C H
                                                          3
             H O                    O


                                                                    O H

                              O

              O                   H C   O     O
                                   3


                    C H
                          3
         O




   H C
    3
                    O H
                                                  Digoxin
              O H
                                                                                        [C;D2](-;@[C,c])-;@[C,c]
                                                                                                                   [O,o;D2](~[C,c])~[C,c]     [O,o]!@[C,c]@[C,c]@[C,c]!@[C,c]




                                                                                  [*]-;@[A]-;@[A]=;@[A]-;!@[*]
                              [O,o]~[C,c]~[O,o]~[C,c]~[C,c]~[O,o]                                                            [*]!@[*]@[*]@[*]@[*]!@[*]
[O,o]!@[C,c]@[C,c]@[C,c]@[C,c]!@[O,o]


                                                                                                                                                   [*]1@[*]@[*]@[*]@[*]@[*]@1
              [O,o,S,s,Se,Te,Po]!@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]
Important Features
                 Animal Toxicity vs. Cytotoxicity
   The ToxNet (Mouse/IP) and NCGC Jurkat models
    have 130 important features in common
   These features are more common in the NCGC toxic
    compounds than in the NCGC nontoxic compounds
   The average number of these features present in the
    NCGC dataset, overall, is 18.8
    −   Very low, might indicate that the NCGC model is not going to
        be applicable to the ToxNet data
Toxicity vs No. of Features - NCGC Dataset
                                                                                                         C 3
                                                                                                          H                   C 3
                                                                                                                               H
                                                                 H3
                                                                  C          C 3
                                                                              H
                                                               H3
                                                                C                 O                                 O                       O
                                                           O     N
                                                                                       O
                                                                                               C 3
                                                                                                H                   N                       NH2

                   S   H
                                                     C
                                                    H3     N                                     N           O          HN          O
                                                                                                 H           H3C               O

                                                     O                                           O                                              CH3
        N                                                                              NH                    O                      NH
                                                                 N                                   O
                           N
                                                                                            H3
                                                                                             C                                                        C 3
                                                                                                                                                       H
                                                                        O                  CH3                                                    O
                                                                         H3C
                                                                                           H3
                                                                                            C            N                              N
                                                                                                 H3
                                                                                                  C
                                                         Dactinomycin
    H   N
                   N           N   H
                                       2
                                                                                                         O              N
                                                                                                                               O
                                                                                                                        CH3
             Leukerin deriv.                                                 HO

                                                                                                               OH        O


                                                                         O
                                                                             HO



                                                     HCl



                                                                 H 2N                            O             OH        O              O
                                                                                                                                                CH 3



                                                                 HO
                                                                                           O
                                                                                                     Daunorubicin
                                                                               CH 3




           The important features                                           C H
                                                                                   3




            focus on the toxic class
           No correlation between number of
            features and pIC50 for nontoxic class                            C H
                                                                                   3
Predicting Animal Toxicity


         Nontoxic Toxic          Predictions for the ToxNet
Nontoxic 12182     558           Mouse/IP dataset.
 Toxic    32638 1265             29% correct overall.
                                 70% correct on the toxic class


   Overall predictive performance is poor
   Possible causes
    −   Poor sampling of the nontoxics during training
    −   Feature distributions between the two datasets
Feature Distributions – ToxNet vs
                                   NCGC
ToxNet (Mouse/IP)
Whats Next?
   Relate structural features to mechanisms of toxicity
   Incorporate these into models / build class models
    −   Different cell-lines vs. animal toxicity
    −   Structural features vs. mechanisms?
   Based on prediction confidence and model
    applicability, can we suggest alternative assays?
   Use the vote fraction & common bit counts to prioritize
    compounds, which may be toxic
    −   Improve assessment of model applicability
Summary

   Applying models to predict other datasets is a
    tricky affair
    −   Are the features distributed in a similar manner
        between training data & the new data?
    −   Do toxic/nontoxic labels transfer between datasets?
   More secondary data required
    −   But this is not the final solution since the NCGC
        dataset is small but leads to (some) good models
Summary
   Fingerprints may not be the optimal way to get the best
    predictive ability
    −   They do let us look at structural features easily
   We have investigated Molconn-Z descriptors
    −   Preliminary results don't indicate significant improvements
   We cannot globally model animal toxicity based on
    cytotoxicity
    −   Animal data sets are biased to toxic compounds
    −   Different structural classes behave differently (mechanism of
        action, metabolic effects
Acknowledgements
   MLSCN data sets / PubChem
   NCGC
   Scripps
    −   Screening (Peter Hodder)
    −   Informatics (Nick Tsinoremas, Chris Mader)
    −   Hugh Rosen
   Alex Tropsha, UNC
   Digital Chemistry
   Tudor Oprea, UNM

   NIH
Extras
Structure Sets: Fingerprint Similarity
                    Only a small fraction of MLSMR structures
                     are similar to ToxNet structures; and vice
                     versa; 4 to 5 % of MLSMR and ToxNet have
                     at least one >50 % similar structure to each
                     other

                    NCGC structures are much more similar to
                     ToxNet (86% >50 % max similar) than
                     MLSMR (9% >50 % max similar)
Important Features - Distributions

                                                               [*]1@[*]@[*]@[*]@[*]@[*]@1

                    [N,n]~[C,c]~[C,c]~[C,c]


                                   [C,c]~[N,n]~[C,c]~[C,c]


                                                   [A]=;@[A]-;@[A]-;@[A]=;@[A]
[C,c;D1]-;!@[N,n]
Are The Cytotox Classes Distinct?

   Poor predictive ability may be explained by the
    lack of separation between toxic & nontoxic
   Normalized Manhattan distance = 0.017
Are The Cytotox Classes Distinct?

   But the situation is a little better if we just look
    at the important bits
   Normalized Manhattan distance = 0.06
Standardization Issues - Data
   Extracting data sets out of PubChem requires manual
    curation and post-processing and aggregation of data
    −   No standard measures or column definitions
    −   Activity score and outcome only valid within one experiment
    −   Assay results are not globally comparable
    −   No standardization of assay format (e.g. type, readout, etc.)
    −   Limited ability to query PubChem for specific data sets
         rpubchem package for R is one option
    −   Need better way to access specific bulk data sets
    −   No aggregation of assay (sample) data by compound
   PubChem seems better suited to browse individual data
    than access large standardized data sets
Model Deployment

   Final models are deployed in our R WS
    infrastructure
    −   Currently the Scripps Jurkat model is available
   Model file can be downloaded
    −   http://www.chembiogrid.org/cheminfo/rws/mlist
   A web page client is available at
    −   http://www.chembiogrid.org/cheminfo/rws/scripps
   Incorporated the model into a Pipeline Pilot
    workflow
Toxicity vs No. of Features - Mouse/IP
                                                OH           O
                                                     H   O                                                                                                        H       C
                                HC                                                                                                                                    3
                                 3
                                                                                                                       H       C
                                                                                                                           3
                                H                                  CH3
                                                                                                                                                    O
                        H       O
                                                     O   H
                                                                                                                                       N
                                                             CH3
    HC
     3                                           H                                                                                                                                                N H

                                                                                                                               H       C
                                            H                                                                                      3
                                    O                                                                                                                    O
            H                        H
                                                                                           O                                                                                  H       C
                O                                                                              H O                                                                                3

     H3C                            H
    HO                                                                                          C H
                                O                                                                     3

                                        H

                H                                                                                                          O
                    O
            H                                                                 H O

                                                                                                H                      Batrachotoxin
                                    H
                            O
                                H                                                              NH2
                                                                                     HN
        H
            O
        H                                                                                 HN
                                                                                                                                                             O
                                H
    H                   O                                                                                                                                                 CH3
                                                                                                                   O                                HN
    O                       H
                                                                                                                                       H
H                           OH                                                             O                                           N                         HN                       O

                        H                                                                                      N
                                                                                                               H
                O
                                                                                               NH                                               O            S
                            Ciguatoxin CTX 3C                             N
                                                                                                                                                                                          N
                                                                                                                                                                  H
                                                                                                     HO            O                                    O         N
                                                                         HN
                                                                                     HN        O                               S                O                                             O

           No correlation between                                                             O
                                                                                                      OH                                                N
                                                                                                                                                                                              NH2


            the number of important                                                       HN                                               NH
                                                                                                                                                        H
                                                                                                                                                                  S                       O
                                                                                                           H2N

            features and the pLD50                                                         O              NH
                                                                                                                               O


            for the toxic class                                                     OH     O                       S



                                                                                                NH2
                                                                                                                       Conotoxin GI

Contenu connexe

Similaire à Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
 
Homology Modeling.pptx
Homology Modeling.pptxHomology Modeling.pptx
Homology Modeling.pptxAmnaAkram29
 
Ben Kelty Summer Research Poster Presentation
Ben Kelty Summer Research Poster Presentation Ben Kelty Summer Research Poster Presentation
Ben Kelty Summer Research Poster Presentation Benjamin Kelty
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryLee Larcombe
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeDavid Thompson
 
Strategies in Drug Design
Strategies in Drug DesignStrategies in Drug Design
Strategies in Drug DesignAshwani Dhingra
 
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...jaumebp
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Alex Clark
 
Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)Michel Dumontier
 
Computer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programmingComputer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programmingWookjin Choi
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible researchYannick Wurm
 
Protein Crystallography of Lysozyme
Protein Crystallography of LysozymeProtein Crystallography of Lysozyme
Protein Crystallography of LysozymeGuillermo Llamas
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataIRJET Journal
 
A comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction modelsA comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction modelsAndrew McEachran
 
Circular Dichroism ppt,
Circular Dichroism ppt, Circular Dichroism ppt,
Circular Dichroism ppt, Manu MS
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
 

Similaire à Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection (20)

Computational Chemistry Robots
Computational Chemistry RobotsComputational Chemistry Robots
Computational Chemistry Robots
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Homology Modeling.pptx
Homology Modeling.pptxHomology Modeling.pptx
Homology Modeling.pptx
 
NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms
NMR Chemical Shift Prediction by Atomic Increment-Based AlgorithmsNMR Chemical Shift Prediction by Atomic Increment-Based Algorithms
NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms
 
Ben Kelty Summer Research Poster Presentation
Ben Kelty Summer Research Poster Presentation Ben Kelty Summer Research Poster Presentation
Ben Kelty Summer Research Poster Presentation
 
Chemical data
Chemical dataChemical data
Chemical data
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discovery
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
 
Strategies in Drug Design
Strategies in Drug DesignStrategies in Drug Design
Strategies in Drug Design
 
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
 
Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)
 
Computer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programmingComputer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programming
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
Protein Crystallography of Lysozyme
Protein Crystallography of LysozymeProtein Crystallography of Lysozyme
Protein Crystallography of Lysozyme
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
A comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction modelsA comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction models
 
Circular Dichroism ppt,
Circular Dichroism ppt, Circular Dichroism ppt,
Circular Dichroism ppt,
 
Elucidating undecipherable chemical structures using computer assisted struct...
Elucidating undecipherable chemical structures using computer assisted struct...Elucidating undecipherable chemical structures using computer assisted struct...
Elucidating undecipherable chemical structures using computer assisted struct...
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 

Plus de Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 

Plus de Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection

  • 1. Random Forest Ensembles Applied to MLSCN Screening Data for Prediction and Feature Selection Rajarshi Guha School of Informatics Indiana University and Stephan Schürer Scripps, FL 23rd August, 2007
  • 2. Broad Goals  Understand and possibly predict cytotoxicity − Utilizing MLSCN screening data and external data − Characterize and visualize various screening results − Relate screening data to known information  Model and predict acute toxicity in animals − Relate large cytotoxicty data sets to animal toxicity(?)  Modelling protocols to handle the characteristics of HTS data − Large datasets, imbalanced classes, applicability  Make models publicly available − For use in multiple scenarios and accessible by a variety of methods
  • 3. Datasets  Animal Acute Toxicity Data was extracted from the ToxNet database (available from MDL) − Selected only LD50 data for mouse and rat and three routes of administration − Summarized LD50 data by structure, species and route (140,808 LD50 data points, 103,040 structures) − Classified into Toxic/Nontoxic using a cutoff  Cytotoxicity Data was taken as published in PubChem from Scripps and NCGC − Scripps Jurkat cytotoxicity assay (59,805 structures with %Inhib, 801 IC50 values) − NCGC data from PubChem for 13 cell lines (non-MLSMR structures): summarized multiple sample data by unique structures and extracted IC50 data: 1,334 structure, 13 x 1,334 IC50 values for different cell lines
  • 4. Scripps Cytotoxicity Models  57,469 valid structures  775 structures with measured IC50 − Skipped 26 structures that BCI could not parse  How do we model this dataset? − Use all data. Very poor results − Use a sampling procedure to get an ensemble of models − Consider just the 775 structures
  • 5. Scripps Cytotoxicity Models  First considered the 775 structures  Evaluated 1052 bit BCI fingerprints  Selected a cutoff pIC50 − >= 5.5 - toxic − < 5.5 – nontoxic  Used sampling to create 10-member ensemble
  • 6. Handling Imbalanced Classes Dataset Toxic Nontoxic PSET TSET PSET TSET-1 TSET-2 TSET-N RF 1 RF 2 RF N Consensus Predictions Preds 1 Preds 2 Preds N
  • 7. Scripps Cytotoxicity Models  % correct (ensemble average) = 69%  % correct (consensus) = 71% Nontoxic Toxic Nontoxic 39 17 Toxic 15 41  Not very good performance
  • 8. Do More Negatives Help?  Include 10,000 structures, randomly selected − Primary data, assumed to be nontoxic  Selected a cutoff pIC50 − >= 5.0 - toxic − < 5.0 – nontoxic  Used sampling to create 10-member ensemble
  • 9. Expanded Cytotoxicity Dataset  % correct (averaged over the ensemble) = 69%  % correct (consensus prediction) = 71% Nontoxic Toxic Nontoxic 79 24 Toxic 35 68  Not much improvement  Insufficient sampling of the nontoxics
  • 10. We Need More Positives  The two datasets (775 vs 10,775 compounds) are quite similar in terms of bit spectrum  Normalized Manhattan distance = 0.016
  • 11. Selecting Important Features  Identifying the important features in an ensemble − Each RF model can rank the input features − A consistent ensemble should have similar, but not necessarily identical, features highly ranked − Consider the unique set of top N − The size of the unique set of top N features provides information about the robustness of the ensemble 4 4 3 4 Unique set of 3 8 23 5 top 4 features 15 16 15 28 3 4 5 8 15 16 23 28 23 15 8 23 Ranked features from each member of the ensemble
  • 12. Important Structural Features  The 10 most important features for predictive ability across the ensemble leads to 43 unique important bits  This is a total of 66 structural features − The toxic compounds are characterized by having a slightly larger number of these features, on average
  • 13. Predicting Animal Toxicity  Should we use cytotoxicity model to predict animal toxicity?  Normalized Manhattan distance = 0.037
  • 14. Predicting Animal Toxicity  Performance really depends on the model cutoff and our goals Nontoxic Toxic Nontoxic 43072 1683 Cutoff = 0.6, 93% correct Toxic 1748 140 Nontoxic Toxic Nontoxic 34674 1158 Cutoff = 0.5, 75% correct Toxic 10146 665 Nontoxic Toxic Nontoxic 20369 587 Cutoff = 0.4, 46% correct Toxic 24451 1236
  • 15. NCGC Toxicity Dataset  Considered 13 cell lines, pIC50's  1334 compounds, including − metals − inorganics  Classified into toxic / nontoxic using a cutoff − mean + 2 * SD  Built models for each cell line
  • 16. NCGC – Class Distributions  Cutoff values ranged from 3.56 to 4.72  Classes are severely imbalanced  Developed ensembles of RF models
  • 17. NCGC Model ROC Curves
  • 18. NCGC Model ROC Curves
  • 19. NCGC – Model Performance (Prediction Set)
  • 20. NCGC – Using the Models  Predicted toxicity class for the Scripps Cytotoxicity dataset (775 compounds) using model built for NCGC Jurkat cell line Nontoxic Toxic Predictions for the Scripps Nontoxic 67 49 Cytotox dataset, using the Toxic 432 227 original cutoffs (32% correct) Nontoxic Toxic Predictions for the Scripps Nontoxic 26 90 Cytotox dataset, using the Toxic 109 550 NCGC cutoff (75% correct)
  • 21. Comparing NCGC & Scripps Datasets  Comparing the datasets as a whole
  • 22. Comparing NCGC & Scripps Datasets  Comparing datasets class-wise
  • 23. Important Features  We consider the NCGC Jurkat cell line  The 10 most important features for predictive ability across the ensemble leads to 53 unique important bits  This is a total of 72 structural features − The toxic compounds are characterized by having a larger number of these features, on average
  • 24. Feature matches for example structure O O O H C H 3 O H C H 3 H O O O H O O H C O O 3 C H 3 O H C 3 O H Digoxin O H [C;D2](-;@[C,c])-;@[C,c] [O,o;D2](~[C,c])~[C,c] [O,o]!@[C,c]@[C,c]@[C,c]!@[C,c] [*]-;@[A]-;@[A]=;@[A]-;!@[*] [O,o]~[C,c]~[O,o]~[C,c]~[C,c]~[O,o] [*]!@[*]@[*]@[*]@[*]!@[*] [O,o]!@[C,c]@[C,c]@[C,c]@[C,c]!@[O,o] [*]1@[*]@[*]@[*]@[*]@[*]@1 [O,o,S,s,Se,Te,Po]!@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]@[C,c,Si,si,Ge,Sn,Pb]
  • 25. Important Features Animal Toxicity vs. Cytotoxicity  The ToxNet (Mouse/IP) and NCGC Jurkat models have 130 important features in common  These features are more common in the NCGC toxic compounds than in the NCGC nontoxic compounds  The average number of these features present in the NCGC dataset, overall, is 18.8 − Very low, might indicate that the NCGC model is not going to be applicable to the ToxNet data
  • 26. Toxicity vs No. of Features - NCGC Dataset C 3 H C 3 H H3 C C 3 H H3 C O O O O N O C 3 H N NH2 S H C H3 N N O HN O H H3C O O O CH3 N NH O NH N O N H3 C C 3 H O CH3 O H3C H3 C N N H3 C Dactinomycin H N N N H 2 O N O CH3 Leukerin deriv. HO OH O O HO HCl H 2N O OH O O CH 3 HO O Daunorubicin CH 3  The important features C H 3 focus on the toxic class  No correlation between number of features and pIC50 for nontoxic class C H 3
  • 27. Predicting Animal Toxicity Nontoxic Toxic Predictions for the ToxNet Nontoxic 12182 558 Mouse/IP dataset. Toxic 32638 1265 29% correct overall. 70% correct on the toxic class  Overall predictive performance is poor  Possible causes − Poor sampling of the nontoxics during training − Feature distributions between the two datasets
  • 28. Feature Distributions – ToxNet vs NCGC ToxNet (Mouse/IP)
  • 29. Whats Next?  Relate structural features to mechanisms of toxicity  Incorporate these into models / build class models − Different cell-lines vs. animal toxicity − Structural features vs. mechanisms?  Based on prediction confidence and model applicability, can we suggest alternative assays?  Use the vote fraction & common bit counts to prioritize compounds, which may be toxic − Improve assessment of model applicability
  • 30. Summary  Applying models to predict other datasets is a tricky affair − Are the features distributed in a similar manner between training data & the new data? − Do toxic/nontoxic labels transfer between datasets?  More secondary data required − But this is not the final solution since the NCGC dataset is small but leads to (some) good models
  • 31. Summary  Fingerprints may not be the optimal way to get the best predictive ability − They do let us look at structural features easily  We have investigated Molconn-Z descriptors − Preliminary results don't indicate significant improvements  We cannot globally model animal toxicity based on cytotoxicity − Animal data sets are biased to toxic compounds − Different structural classes behave differently (mechanism of action, metabolic effects
  • 32. Acknowledgements  MLSCN data sets / PubChem  NCGC  Scripps − Screening (Peter Hodder) − Informatics (Nick Tsinoremas, Chris Mader) − Hugh Rosen  Alex Tropsha, UNC  Digital Chemistry  Tudor Oprea, UNM  NIH
  • 34. Structure Sets: Fingerprint Similarity  Only a small fraction of MLSMR structures are similar to ToxNet structures; and vice versa; 4 to 5 % of MLSMR and ToxNet have at least one >50 % similar structure to each other  NCGC structures are much more similar to ToxNet (86% >50 % max similar) than MLSMR (9% >50 % max similar)
  • 35. Important Features - Distributions [*]1@[*]@[*]@[*]@[*]@[*]@1 [N,n]~[C,c]~[C,c]~[C,c] [C,c]~[N,n]~[C,c]~[C,c] [A]=;@[A]-;@[A]-;@[A]=;@[A] [C,c;D1]-;!@[N,n]
  • 36. Are The Cytotox Classes Distinct?  Poor predictive ability may be explained by the lack of separation between toxic & nontoxic  Normalized Manhattan distance = 0.017
  • 37. Are The Cytotox Classes Distinct?  But the situation is a little better if we just look at the important bits  Normalized Manhattan distance = 0.06
  • 38. Standardization Issues - Data  Extracting data sets out of PubChem requires manual curation and post-processing and aggregation of data − No standard measures or column definitions − Activity score and outcome only valid within one experiment − Assay results are not globally comparable − No standardization of assay format (e.g. type, readout, etc.) − Limited ability to query PubChem for specific data sets rpubchem package for R is one option − Need better way to access specific bulk data sets − No aggregation of assay (sample) data by compound  PubChem seems better suited to browse individual data than access large standardized data sets
  • 39. Model Deployment  Final models are deployed in our R WS infrastructure − Currently the Scripps Jurkat model is available  Model file can be downloaded − http://www.chembiogrid.org/cheminfo/rws/mlist  A web page client is available at − http://www.chembiogrid.org/cheminfo/rws/scripps  Incorporated the model into a Pipeline Pilot workflow
  • 40. Toxicity vs No. of Features - Mouse/IP OH O H O H C HC 3 3 H C 3 H CH3 O H O O H N CH3 HC 3 H N H H C H 3 O O H H O H C O H O 3 H3C H HO C H O 3 H H O O H H O H Batrachotoxin H O H NH2 HN H O H HN O H H O CH3 O HN O H H H OH O N HN O H N H O NH O S Ciguatoxin CTX 3C N N H HO O O N HN HN O S O O  No correlation between O OH N NH2 the number of important HN NH H S O H2N features and the pLD50 O NH O for the toxic class OH O S NH2 Conotoxin GI