SlideShare a Scribd company logo
1 of 14
The Chemistry-to-Protein Relationship Quality
        Challenge: Confounding Linked Data?
                      (Poster, Chris Southan, BioIT Boston, 2012)

                                      Introduction

As evidenced from this meeting data integration to facilitate the generation of new
knowledge is undergoing a quantum jump driven by the generation of larger data sets,
expanded computational capacity and semantic web federated queries across linked open
sources.

However, the cloud in this bright future is that molecular mechanistic relationships inferred
from data of equivocal quality can become a house of cards. On a good day, these may
remain local artefacts in the uber-network. On a bad day, the very linking on which utility
depends can propagate errors instantly, remorselessly, globally and permanently.

This poster compares inferred mechanistic mappings between chemical structures and
proteins, both in curated drug databases and large chemogenomic data portals. A
surprising degree of discordance and different error types were found. It could also be
shown that various curatorial and automated parsing errors were being transitively passed
on between databases.

The results are given below as a series of problems that are potentially confounding for
linking between chemistry <> protein databases.
                                                                                            [1]
Problem I: Constitutive Mapping Challenges
We know mapping between chemicals and proteins is neither pure nor simple. This is not even a
complete list of what ”compound X <> protein Y ” relationships can encompass in databases.

•   Binds-to and modulates activity
•   Binds-to with known specificity (e.g. active or allosteric site in PDB)
•   Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist
•   Binds-to with quantiative mmo (Ki, IC50, Kd etc)
•   Binds-to and is metabolicaly transformed by (e.g. P450)
•   Binds-to and is transported by (e.g. multidrug resistance-associated protein)
•   Binds-to but no activity modulation (e.g. albumin)
•   X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite)
•   X is non-canonical (e.g. enatiomers with different affinity for Y)
•   One X to-many proteins (panel screen)
•   Data source ambigous in description of X (e.g. errors or tautomers)
•   Data source ambigous in description of Y (e.g. protein ID not resolved)
•   X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y)
•   Many cpds to-one Y (a throughput assay)
•   X has relevant linked data in addtion to binding Y (e.g. plasma clearance)
•   Y is part of a functional complex (e.g. gamma secretase)
•   X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico)
•   Y is species-specific
•   Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc)
                                                                                                             [2]
Problem II: The Numbers Don’t Add Up
    A collation of entity and relatishionship counts between databases and curated sets,
    ranked by compounds-per-protein




•     The statistical differences in orders of magnitude are only partialy intepretable
•     No concencus defintions or heirachies of ”target” or ”interaction” as concepts
•     Ipso facto curation and/or parsing rules are very different
•     Evidence filtration functionality different
•     Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs)
•     Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs)
                                                                                          [3]
Problem III: Differential Chemistry Capture
•     We can compare the two premier academic drug mapping resources, DrugBank and
      Therapeutic Target Database, in principle having convergent capture concepts.
•     Both use expert curation teams to extract from the same primary data corpora.
•     The intra-PubChem comparison of chemical content (at the CID level) is shown below
    DB = 6720              TTD= 14631                  Union = 19803               Intersect = 1548




•     Results show very different capture (e.g. union is over 10x larger than the intersect )
•     Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking up
      BioAssayed compounds from ChEMBL) but reasons for other differences are less clear.
                                                                                                      [4]
Problem IV: Differential Target Capture

                                        •   The Venn compares DrugBank with TTD
                                            and a re-curated DrugBank sub-set (Ra-
                                            An ”Trends in the exploitation of novel
                                            drug targets” 2011, PMID: 21804595)

                                        •   While there are caveats related to set
                                            defintions, species filters and protein ID
                                            cross-mapings, the differencial capture of
                                            the three manualy curated sets is clear

                                        •   The intersect at only 170 human UniProt
                                            IDs is ~ ½ the expected primary targets

• Some of this is explicable (i.e. R-An picking up new targets) but the cause of
  other differences are unclear

•   Over 900 targets (this comparison excluded enzymes and transporters) are
    unique to DrugBank so their curatorial rules are clearly different

                                                                                   [5]
Problem V: Large chemistry <> protein Dbs
 • Leading expert teams and significant resources
 • Overlaps in concepts and utility
 • Differences in approaches and technical implimentation




                                                            [6]
Problem V (ctd): Too Large to Verify but too
           Divergent to Trust?

                             •   Comparing atorvastin <>
                                 proteins in four large-scale
                                 Dbs

                             •   The 4-database intersect is
                                 only 8 from 143

                             •   6 of these are probably
                                 indirect (no binding ) and
                                 mechanistically unclear

                             •   Significant database-unique
                                 capture (e.g. CTD)

                             •   There are caveats with these
                                 exact numbers because they
                                 depend on protein database
                                 x-mappings


                                                                [7]
Problem VI:           Whose curation is ”correct”




•   Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank)
•   Sum is proteins from the four dbs in previous slide
•   Consensus is only HMGCR and CP450 3A4
•   Unique capture of transporters and metabolic enzymes by DrugBank
•   Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon
    receptor
•   Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4
                                                                               [8]
Problem VII. The PDB Hetero Entry Trap:
      False Drug/ligands and False Targets

E.g. Stitch makes high-scoring links from DPPIV to galatose and fucose




                                                                         [9]
Problem VII ctd. STICH X-refs the Same Errors
      in DrugBank that Passed them to PubChem




DrugBank links to the wrong sugar isomer as CID 671379 and
PubChem inherited the 40 targets in the ”Biomolecular
Interactions and Pathways” field. DB entry now deprecated



                                                             [10]
Problem VII ctd. Mixed mappings of the
”Wrong” and ”Right” (drug-relevant) Ligands




                            Most of the mappings above are
                            ”right”, on the left is ”wrong”
                            (sugar is in the crystal but not a
                            ligand or a drug in this context)




                                                             [11]
Problem VIII: False-negatives




• This clinically signficant infered interaction is missed by (all ?) Dbs

• A guess is that neither text mining nor curation rules (as implimented in the 7
  dbs checked here) connected the individual drug names to the general case
  triple ”statins-inhibited-PAR-1”

• We can grapple with false-positives via filtration rules and heuristic tuning but
  false-negatives are a more difficult and potentialy more serious problem
                                                                                      [12]
Ameliorating the Problems
•   Avoid ”brainless parsing” and go for precision over recall
•   Make circularity explicit (e.g. dbs within dbs and curatorial recycling)
•   Refresh and update cross-links between dbs
•   Define biochemical and pharmacological relationships
•   Rigorous and deep QC (e.g. actually eyeball records)
•   Referential integrity checks (e.g. spot orphaned entities)
•   Display relationship distributions, inspect the extreme tails and attempt
    to understand them
•   Document curatorial practice (e.g. equivocality handling rules)
•   Facilitate annotation judgments and quality-based filtration (i.e.
    curatorial empowerment )
•   Consider canonical merging of chemical structures with multiplexed
    bioactivity mappings
•   Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations)
•   Encourage author mark-up at source (i.e. MIABE PMID: 21878981)
•   “But wait, hold on – did anyone peer review the database? “
    (Williams and Eakins 2012 ACS presentation)
                                                                                [13]
Conclusions
• Linked Open Data is the new mining rock and roll; but...................
• Even just chemistry <> protein is subject to the caveats in this poster (and
  more besides)
• At the very least circumspection is needed if inferences from database
  linking are to be acted upon, validated and exploited
• In the end, nothing saves us from database quality so this has to be
  addressed by all of us

Dr Christopher Southan
ChrisDS Consulting:
http://www.cdsouthan.info/Consult/CDS_cons.htm
Email: cdsouthan@hotmail.com
Twitter: @cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications:
http://www.citeulike.org/user/cdsouthan/publications/ord
er/year
Citations:http://scholar.google.com/citations?user=y1Ds
HJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan
                                                                             [14]

More Related Content

Viewers also liked

The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012
Soroka Medical Center
 
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
King Abdulaziz University - Jeddah
 
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
Roberto Anglani
 
Causal comparative research ckv
Causal comparative research ckvCausal comparative research ckv
Causal comparative research ckv
china_velasco
 

Viewers also liked (12)

The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012The effect of rosuvastatin on incident pneumonia from CMAJ 2012
The effect of rosuvastatin on incident pneumonia from CMAJ 2012
 
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
Rosuvastatin, pcsk9 concentrations, and ldl cholesterol response the jupiter ...
 
ROSUVASTATIN CALCIUM PPT
ROSUVASTATIN CALCIUM PPTROSUVASTATIN CALCIUM PPT
ROSUVASTATIN CALCIUM PPT
 
Jupiter Slides translate
Jupiter Slides translateJupiter Slides translate
Jupiter Slides translate
 
Design of gastroretentive bilayer floating films of propranolol hydrochloride...
Design of gastroretentive bilayer floating films of propranolol hydrochloride...Design of gastroretentive bilayer floating films of propranolol hydrochloride...
Design of gastroretentive bilayer floating films of propranolol hydrochloride...
 
Crestor Tablets to treat high cholesterol and related conditions
Crestor Tablets to treat high cholesterol and related conditionsCrestor Tablets to treat high cholesterol and related conditions
Crestor Tablets to treat high cholesterol and related conditions
 
Cardiovascular disorder
Cardiovascular disorderCardiovascular disorder
Cardiovascular disorder
 
Statins (report biopharm) Pravastatin and Rosuvastatin
Statins (report biopharm) Pravastatin and RosuvastatinStatins (report biopharm) Pravastatin and Rosuvastatin
Statins (report biopharm) Pravastatin and Rosuvastatin
 
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
JUPITER (Justification for the Use of Statins in Primary Prevention: An Inter...
 
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...A comparative study of Gaussian Graphical Model approaches to genomic data (R...
A comparative study of Gaussian Graphical Model approaches to genomic data (R...
 
Causal comparative research ckv
Causal comparative research ckvCausal comparative research ckv
Causal comparative research ckv
 
Causal comparative research
Causal comparative researchCausal comparative research
Causal comparative research
 

Similar to Chemistry-to-Protein Relastionship Quality

Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Lorenz Lo Sauer
 
Analysing the drug targets in the human genome
Analysing the drug targets in the human genomeAnalysing the drug targets in the human genome
Analysing the drug targets in the human genome
Guide to PHARMACOLOGY
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogens
Chris Southan
 

Similar to Chemistry-to-Protein Relastionship Quality (20)

Evolving consensus-based curatorial strategies
Evolving consensus-based curatorial strategiesEvolving consensus-based curatorial strategies
Evolving consensus-based curatorial strategies
 
Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014Southan real drugs_paris_oct_11_2014
Southan real drugs_paris_oct_11_2014
 
Analysing targets and drugs to populate the GToP database
Analysing  targets and drugs to populate the GToP databaseAnalysing  targets and drugs to populate the GToP database
Analysing targets and drugs to populate the GToP database
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
 
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGYSlicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
Slicing and dicing expert-curated protein targets in the Guide to PHARMACOLGY
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
Analysing the drug targets in the human genome
Analysing the drug targets in the human genomeAnalysing the drug targets in the human genome
Analysing the drug targets in the human genome
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Lecture 7 computer aided drug design
Lecture 7  computer aided drug designLecture 7  computer aided drug design
Lecture 7 computer aided drug design
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
GtoPdb: A resource for cell-based perturbogens
GtoPdb:  A resource for cell-based perturbogensGtoPdb:  A resource for cell-based perturbogens
GtoPdb: A resource for cell-based perturbogens
 
Estimating bioactivity database error rates, tiikkainen
Estimating bioactivity database error rates, tiikkainenEstimating bioactivity database error rates, tiikkainen
Estimating bioactivity database error rates, tiikkainen
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug design
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
 
Mining public domain data as a basis for drug repurposing
Mining public domain data as a basis for drug repurposingMining public domain data as a basis for drug repurposing
Mining public domain data as a basis for drug repurposing
 
SLAS ADMET SIG: SLAS2013 Presentation
SLAS ADMET SIG: SLAS2013 PresentationSLAS ADMET SIG: SLAS2013 Presentation
SLAS ADMET SIG: SLAS2013 Presentation
 
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
 

More from Chris Southan

Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
Chris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
Chris Southan
 

More from Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Chemistry-to-Protein Relastionship Quality

  • 1. The Chemistry-to-Protein Relationship Quality Challenge: Confounding Linked Data? (Poster, Chris Southan, BioIT Boston, 2012) Introduction As evidenced from this meeting data integration to facilitate the generation of new knowledge is undergoing a quantum jump driven by the generation of larger data sets, expanded computational capacity and semantic web federated queries across linked open sources. However, the cloud in this bright future is that molecular mechanistic relationships inferred from data of equivocal quality can become a house of cards. On a good day, these may remain local artefacts in the uber-network. On a bad day, the very linking on which utility depends can propagate errors instantly, remorselessly, globally and permanently. This poster compares inferred mechanistic mappings between chemical structures and proteins, both in curated drug databases and large chemogenomic data portals. A surprising degree of discordance and different error types were found. It could also be shown that various curatorial and automated parsing errors were being transitively passed on between databases. The results are given below as a series of problems that are potentially confounding for linking between chemistry <> protein databases. [1]
  • 2. Problem I: Constitutive Mapping Challenges We know mapping between chemicals and proteins is neither pure nor simple. This is not even a complete list of what ”compound X <> protein Y ” relationships can encompass in databases. • Binds-to and modulates activity • Binds-to with known specificity (e.g. active or allosteric site in PDB) • Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist • Binds-to with quantiative mmo (Ki, IC50, Kd etc) • Binds-to and is metabolicaly transformed by (e.g. P450) • Binds-to and is transported by (e.g. multidrug resistance-associated protein) • Binds-to but no activity modulation (e.g. albumin) • X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite) • X is non-canonical (e.g. enatiomers with different affinity for Y) • One X to-many proteins (panel screen) • Data source ambigous in description of X (e.g. errors or tautomers) • Data source ambigous in description of Y (e.g. protein ID not resolved) • X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y) • Many cpds to-one Y (a throughput assay) • X has relevant linked data in addtion to binding Y (e.g. plasma clearance) • Y is part of a functional complex (e.g. gamma secretase) • X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico) • Y is species-specific • Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc) [2]
  • 3. Problem II: The Numbers Don’t Add Up A collation of entity and relatishionship counts between databases and curated sets, ranked by compounds-per-protein • The statistical differences in orders of magnitude are only partialy intepretable • No concencus defintions or heirachies of ”target” or ”interaction” as concepts • Ipso facto curation and/or parsing rules are very different • Evidence filtration functionality different • Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs) • Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs) [3]
  • 4. Problem III: Differential Chemistry Capture • We can compare the two premier academic drug mapping resources, DrugBank and Therapeutic Target Database, in principle having convergent capture concepts. • Both use expert curation teams to extract from the same primary data corpora. • The intra-PubChem comparison of chemical content (at the CID level) is shown below DB = 6720 TTD= 14631 Union = 19803 Intersect = 1548 • Results show very different capture (e.g. union is over 10x larger than the intersect ) • Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking up BioAssayed compounds from ChEMBL) but reasons for other differences are less clear. [4]
  • 5. Problem IV: Differential Target Capture • The Venn compares DrugBank with TTD and a re-curated DrugBank sub-set (Ra- An ”Trends in the exploitation of novel drug targets” 2011, PMID: 21804595) • While there are caveats related to set defintions, species filters and protein ID cross-mapings, the differencial capture of the three manualy curated sets is clear • The intersect at only 170 human UniProt IDs is ~ ½ the expected primary targets • Some of this is explicable (i.e. R-An picking up new targets) but the cause of other differences are unclear • Over 900 targets (this comparison excluded enzymes and transporters) are unique to DrugBank so their curatorial rules are clearly different [5]
  • 6. Problem V: Large chemistry <> protein Dbs • Leading expert teams and significant resources • Overlaps in concepts and utility • Differences in approaches and technical implimentation [6]
  • 7. Problem V (ctd): Too Large to Verify but too Divergent to Trust? • Comparing atorvastin <> proteins in four large-scale Dbs • The 4-database intersect is only 8 from 143 • 6 of these are probably indirect (no binding ) and mechanistically unclear • Significant database-unique capture (e.g. CTD) • There are caveats with these exact numbers because they depend on protein database x-mappings [7]
  • 8. Problem VI: Whose curation is ”correct” • Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank) • Sum is proteins from the four dbs in previous slide • Consensus is only HMGCR and CP450 3A4 • Unique capture of transporters and metabolic enzymes by DrugBank • Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon receptor • Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4 [8]
  • 9. Problem VII. The PDB Hetero Entry Trap: False Drug/ligands and False Targets E.g. Stitch makes high-scoring links from DPPIV to galatose and fucose [9]
  • 10. Problem VII ctd. STICH X-refs the Same Errors in DrugBank that Passed them to PubChem DrugBank links to the wrong sugar isomer as CID 671379 and PubChem inherited the 40 targets in the ”Biomolecular Interactions and Pathways” field. DB entry now deprecated [10]
  • 11. Problem VII ctd. Mixed mappings of the ”Wrong” and ”Right” (drug-relevant) Ligands Most of the mappings above are ”right”, on the left is ”wrong” (sugar is in the crystal but not a ligand or a drug in this context) [11]
  • 12. Problem VIII: False-negatives • This clinically signficant infered interaction is missed by (all ?) Dbs • A guess is that neither text mining nor curation rules (as implimented in the 7 dbs checked here) connected the individual drug names to the general case triple ”statins-inhibited-PAR-1” • We can grapple with false-positives via filtration rules and heuristic tuning but false-negatives are a more difficult and potentialy more serious problem [12]
  • 13. Ameliorating the Problems • Avoid ”brainless parsing” and go for precision over recall • Make circularity explicit (e.g. dbs within dbs and curatorial recycling) • Refresh and update cross-links between dbs • Define biochemical and pharmacological relationships • Rigorous and deep QC (e.g. actually eyeball records) • Referential integrity checks (e.g. spot orphaned entities) • Display relationship distributions, inspect the extreme tails and attempt to understand them • Document curatorial practice (e.g. equivocality handling rules) • Facilitate annotation judgments and quality-based filtration (i.e. curatorial empowerment ) • Consider canonical merging of chemical structures with multiplexed bioactivity mappings • Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations) • Encourage author mark-up at source (i.e. MIABE PMID: 21878981) • “But wait, hold on – did anyone peer review the database? “ (Williams and Eakins 2012 ACS presentation) [13]
  • 14. Conclusions • Linked Open Data is the new mining rock and roll; but................... • Even just chemistry <> protein is subject to the caveats in this poster (and more besides) • At the very least circumspection is needed if inferences from database linking are to be acted upon, validated and exploited • In the end, nothing saves us from database quality so this has to be addressed by all of us Dr Christopher Southan ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm Email: cdsouthan@hotmail.com Twitter: @cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/publications/ord er/year Citations:http://scholar.google.com/citations?user=y1Ds HJ8AAAAJ&hl=en Presentations: http://www.slideshare.net/cdsouthan [14]

Editor's Notes

  1. Over