This document discusses how advanced analytics and big data techniques can be applied in the chemistry industry. It provides examples of how matched molecular pair analysis has been used to extract statistically valid structure-activity relationships from large datasets and summarize them in the form of transformation rules. These rules have helped suggest new molecules, explore structure-activity relationships, identify exceptional structure-property relationships, and enable the rapid optimization of drug candidates. The document argues that combining data from multiple sources yields more comprehensive rules and that interfaces must be designed with the intended users in mind.
SCI What can Big Data do for Chemistry 2017 MedChemica
1. MedChemica
What have we done? What could we do with
Advanced Analytics in the Chemistry
Industry?
Ed Griffen
MedChemica Ltd
2. MedChemica
Big Data – Focus on Benefits not Features
From the Gartner IT Glossary:
What is Big Data?
Big Data is
high-volume,
high-velocity and/or high-variety information assets
that demand cost-effective,
innovative forms of information processing that
enable
enhanced insight,
decision making,
and process automation.
2
Features
Benefits
3. MedChemica
Where is Big Data proving most Successful?
• Customer analysis
• Targeted advertising
• Language translation
3
• What do these have in common?
• Underlying theoretical model insufficiently accurate or unknown
• Very, very large data sets
• Straightforward statistical methods
• Most users are unskilled and not interested in mechanics
4. MedChemica
What are the classes of chemical problem?
4
‘Potency’ Properties Production Patents
• Lead finding
• Potency
improvement
• Pharmacokinetics
• Solubility
• Off target toxicity
• First
successful
route
• ‘Best” route
• Freedom to
Operate
Product Size
Duration of action
Safety margin
Speed
Cost
Commercial
Position
Common to all
• Pharmaceuticals
• Agrochemicals
• Flavors and Fragrances
• Consumer products
• Materials science
• Underlying theoretical model insufficiently
accurate or unknown
• Very, very large data sets
• Straightforward statistical methods
• Most users are unskilled and not interested in
mechanics
5. MedChemica
‘Big Data’ analysis for Chemistry
Making and testing compounds is expensive!
• No new compounds to make
• No new testing to do
• Exploit the compounds and data you’ve already paid for
• Accelerate all new projects
• Augment the skills and experience of your chemists
• Mythbusting…
All very cost effective
6. MedChemica
Help the HiPPOs – or they’ll crush you
6
1. McAfee & Brynjolfsson “Big Data: The Management Revolution”,
Harvard Business Review October 2012
“Companies often make most of
their important decisions by
relying on “HiPPO”—the highest-
paid person’s opinion.”1
Chemistry HiPPs:
• experts in pattern recognition
• judged on their ability to make the best decisions with partial data
• highly trained
• time poor
• delivery focused
• gatekeepers to the adoption of new approaches
7. MedChemica
Making a real textbook of Medicinal Chemistry
MMPA
MMPA
MMPA
Combine
and
Extract
Rules
Multiple Pharma
ADMET data
>437000 rules
Better
Project
decisions
Increased
Medicinal
Chemistry
learning
Kramer, Robb, Ting, Zheng, Griffen, et al: J. Med. Chem 2017
http://pubs.acs.org/doi/10.1021/acs.jmedchem.7b00935
‘Potency’ Properties Production Patents
• Lead finding
• Potency
improvement
• Pharmacokinetics
• Solubility
• Off target toxicity
• First
successful
route
• ‘Best” route
• Freedom to
Operate
8. MedChemica
Making the complicated simple: HOT-Fit
Learning from the development of clinical decision support software
Algorithms
Technology
Data
Speed
Benefits
Human
System Use
User
Satisfaction
Organization
Structure
Environment
E.Kilsdonk, L.W.Peute, M.W.M.Jaspers, Factors Influencing Implementation Success of Guideline-based Clinical Decision
Support Systems: a systematic review and gaps analysis, International Journal of Medical Informatics
http://dx.doi.org/10.1016/j.ijmedinf.2016.12.001
9. MedChemica
Chemistry Knowledge extraction methods
Remember: your HiPPO needs to understand
9
substructures Physical chemistry
descriptors(Hansch,
Taft, Fujita, Abraham)
Atomic, pair, triplet
descriptors
Indices
Counts & descriptive
statistics
MMPA
(M)LR Free Wilson
PLS
Trees / Forests
SVM
Bayesian NN
Deep Learning Dark
Black
Descriptors
Method
It’s a
summit –
but what
else is out
there?
10. MedChemica
• Matched Molecular Pairs –
Molecules that differ only by a
particular, well-defined
structural transformation
Griffen, E. et al. J. Med. Chem. 2011, 54(22), pp.7739-7750.
Advanced MMPA with MCPairs
• Transformation with environment
capture – MMPs can be recorded
as transformations from A B
Δ Data A-
B
1
2
2
3
3
3
4
4
4
12
23
3
34
4
4
A B
Environment is key - must be captured in the chemical encoding
11. MMPA: Environment really matters
HMe:
• Median Dlog(Solubility)
• 225 different
environments
2.5log
1.5log
HMe:
• Median Dlog(Clint)
Human microsomal
clearance
• 278 different
environments
MedChemica
12. MedChemica
Matched Molecular Pair methods matter
If you don’t use both you’ll miss 12-56% of the pairs
2 Methods:
Maximum Common SubStructure(MCSS) Fragment and Index(FI)
Warner, Sheridan Hussein & Rea
Strengths:
Ring replacement linker and core swaps
Macrocycle ring pairs
12
EGF D1 Cav3.2
fF+I
fMCSS
0.1
0.9
0.1 0.9 0.1 0.9 0.1 0.9
0.1
0.9
0.1
0.9
Leach et al J.Chem. Inf. Model. 2017 http://dx.doi.org/10.1021/acs.jcim.7b00335
13. MedChemica
Identify and group matching SMIRKS
Calc ulate statistical parameters for eac h unique
SMIRKS(n, median, sd, se, n_up/ n_down)
Is n ≥ 6?
Not enough data:
ignore transformation
Is the | median| ≤ 0.05 and the
interc entile range (10-90%) ≤ 0.3?
Perform two-tailed binomial test on the
transformation to determine the
signific anc e of the up/ down frequenc y
transformation is
c lassified as ‘neutral’
Transformation c lassified as
‘NED’ (No Effec t Determined)
Transformation c lassified as
‘increase’ or ‘ decrease’
depending on whic h direc tion the
property is c hanging
passfail
yesno
yesno
Rule selection
0 +ve-ve
Median data difference
Neutral IncreaseDecrease
NED
• No assumption of normal
distribution
• Manage ‘censored’ = qualified
/ out-of-range data
14. MedChemica
Making the complicated simple: HOT-Fit
Algorithms
Technology
Human
Organization
Data
Speed
System Use
User
Satisfaction
Structure
Environment
Benefits
15. MedChemica
Where to get data?
• Public data is unrepresentative
• Censored by publication bias
• Pharma data – can’t share
structures due to IP.
• Use chemical transformations to
encode knowledge from matched
molecular pair (MMP) analysis
now sharable
Novartis: Kramer, C.; Kalliokoski,
T. et al The Experimental
Uncertainty of Heterogeneous
Public Ki Data J. Med. Chem
2012, 55, 5165
If project data really looked like
that, there would be no problem
in the Pharma industry.
17. MedChemica
Merge
Pharma 1 100k rules
Pharma 2 92k rules
Pharma 3 37k rules
5.8k rules in common (pre-merge) ~ 2%
New Rules 88k
~26% of total
Combining data yields brand new rules
Gains: 300 - 900%
Merging knowledge – GRDv1
18. MedChemica
Knowledge Extracted
Numbers of statistically valid transforms
Grouped Datasets Number of Rules
logD7.4 153449
Merged solubility 46655
In vitro microsomal clearance:
Human, rat, mouse, cyno, dog
88423
In vitro hepatocyte clearance :
Human, rat, mouse, cyno, dog 26627
MCDK permeability A-B / B – A efflux 1852
Cytochrome P450 inhibition:
2C9, 2D6 , 3A4 , 2C19 , 1A2
40605
Cardiac ion channels
NaV 1.5, hERG ion channel inhibition
15636
Glutathione Stability 116
plasma protein or albumin binding
Human, rat, mouse, cyno, dog
64622
Grand Rule
Database
v3
19. MedChemica
Single company vs merged
Comparison between Roche-only and GRD rules for human
microsomal clearance. Overall R2 is 0.76 and RMSE 0.11.
20. MedChemica
Chemists use logD as a benchmark:
• Standard to use lipophilicity as a design surrogate
• Provides a context for changes
• Key multi-objective design issues are centered round
conflicting logD correlations:
• Solubility & metabolic stabilitypotency & permeability
• Particularly useful to look at chemical transformations that
‘ break the dogma’ of logD correlation
21. MedChemica
Solubility : logD – trends & exceptions
>=20 examples per rule, n=13,453
R2 = 0.66, slope = -0.57, intercept = 0.
Magenta line: line of slope -1, intercept 0, dark blue line linear best fit, pale blue density ellipse contains
99% and the mid blue ellipse contains 50% of the transformations.
23. MedChemica
Clearance : logD – trends & exceptions
>=20 examples per rule, n=11,572
R2 = 0.40, slope 0.23, intercept = 0.
Magenta line: line of slope 1, intercept 0, dark blue line linear best fit, pale blue density
ellipse contains 99% and the mid blue ellipse contains 50% of the transformations.
25. MedChemica
Making the complicated simple: HOT-Fit
Algorithms
Technology
Human
Organization
Data
Speed
System Use
User
Satisfaction
Structure
Environment
Benefits
26. MedChemica
MMPA: Engineering challenges
• Quick to implement on a small scale
• Always becomes an n2 problem….
• ‘Challenging’ at enterprise scales 100,000+
- Cheminformatics ‘gotchas’
• Tautomers, charge states
• Unusual aromatic systems
• Highly symmetric molecules
• Capturing and coding environments accurately
- Structure and data integrity
- Assay ontologies
- Database schema optimized for cluster I/O
Speed at scale essential – time poor users
27. MedChemica
Interface Design depends on the User
27
• > 2 x 1012 searches / year
• Totally unskilled users
• Simple consistent interface
• Rocket scientists
?
Meet your HiPPO where they’re skilled
• Intuitive ( = fast & familiar)
• Summary data + option to drill into the
detail
• Web browsers
• Excel
28. MedChemica
Exploiting Knowledge for Compound Optimization
Measured
Data
rule
finder
Rule
Database
Compounds
from Rules
Problem molecule
New molecule
suggestions
rule
finder
MCPairs=
“..it’s like asking 150 of your peers
for ideas in just a few seconds” –
AZ Principal Scientist
33. MedChemica
• MMP based clustering
• QSAR from MMPA
• Matched molecular series
•Interface design is key
There is so much more…
?
34. MedChemica
What can we do with Advanced Analytics?
Accelerate Chemistry by using:
• right algorithms that our users understand
• as much data as possible
• fast, “user appropriate” interfaces
deliver better products into development faster.
34
Lot’s of people come forward with ideas to ‘revolutionise drug discovery’, but being more data driven is surprisingly cheap compared to most of them. Eg ‘new modalities’ like therapeutic RNAs or chimeric antigen receptors, r even large ring macrocycles.
We may be at the summit but who can tell? And what is around us?
Alternatively we may want to have a completely clear view and potential cliffs and valleys, but by the time you get there, so much has been published that compounds are probabaly in the clinic if not to market – but of course there may still be opportunities