SCI What can Big Data do for Chemistry 2017 MedChemica

MedChemica
What have we done? What could we do with
Advanced Analytics in the Chemistry
Industry?
Ed Griffen
MedChemica Ltd

MedChemica
Big Data – Focus on Benefits not Features
From the Gartner IT Glossary:
What is Big Data?
Big Data is
high-volume,
high-velocity and/or high-variety information assets
that demand cost-effective,
innovative forms of information processing that
enable
enhanced insight,
decision making,
and process automation.
2
Features
Benefits

MedChemica
Where is Big Data proving most Successful?
• Customer analysis
• Targeted advertising
• Language translation
3
• What do these have in common?
• Underlying theoretical model insufficiently accurate or unknown
• Very, very large data sets
• Straightforward statistical methods
• Most users are unskilled and not interested in mechanics

MedChemica
What are the classes of chemical problem?
4
‘Potency’ Properties Production Patents
• Lead finding
• Potency
improvement
• Pharmacokinetics
• Solubility
• Off target toxicity
• First
successful
route
• ‘Best” route
• Freedom to
Operate
Product Size
Duration of action
Safety margin
Speed
Cost
Commercial
Position
Common to all
• Pharmaceuticals
• Agrochemicals
• Flavors and Fragrances
• Consumer products
• Materials science
• Underlying theoretical model insufficiently
accurate or unknown
• Very, very large data sets
• Straightforward statistical methods
• Most users are unskilled and not interested in
mechanics

MedChemica
‘Big Data’ analysis for Chemistry
Making and testing compounds is expensive!
• No new compounds to make
• No new testing to do
• Exploit the compounds and data you’ve already paid for
• Accelerate all new projects
• Augment the skills and experience of your chemists
• Mythbusting…
All very cost effective

MedChemica
Help the HiPPOs – or they’ll crush you
6
1. McAfee & Brynjolfsson “Big Data: The Management Revolution”,
Harvard Business Review October 2012
“Companies often make most of
their important decisions by
relying on “HiPPO”—the highest-
paid person’s opinion.”1
Chemistry HiPPs:
• experts in pattern recognition
• judged on their ability to make the best decisions with partial data
• highly trained
• time poor
• delivery focused
• gatekeepers to the adoption of new approaches

MedChemica
Making a real textbook of Medicinal Chemistry
MMPA
MMPA
MMPA
Combine
and
Extract
Rules
Multiple Pharma
ADMET data
>437000 rules
Better
Project
decisions
Increased
Medicinal
Chemistry
learning
Kramer, Robb, Ting, Zheng, Griffen, et al: J. Med. Chem 2017
http://pubs.acs.org/doi/10.1021/acs.jmedchem.7b00935
‘Potency’ Properties Production Patents
• Lead finding
• Potency
improvement
• Pharmacokinetics
• Solubility
• Off target toxicity
• First
successful
route
• ‘Best” route
• Freedom to
Operate

MedChemica
Making the complicated simple: HOT-Fit
Learning from the development of clinical decision support software
Algorithms
Technology
Data
Speed
Benefits
Human
System Use
User
Satisfaction
Organization
Structure
Environment
E.Kilsdonk, L.W.Peute, M.W.M.Jaspers, Factors Influencing Implementation Success of Guideline-based Clinical Decision
Support Systems: a systematic review and gaps analysis, International Journal of Medical Informatics
http://dx.doi.org/10.1016/j.ijmedinf.2016.12.001

MedChemica
Chemistry Knowledge extraction methods
Remember: your HiPPO needs to understand
9
substructures Physical chemistry
descriptors(Hansch,
Taft, Fujita, Abraham)
Atomic, pair, triplet
descriptors
Indices
Counts & descriptive
statistics
MMPA
(M)LR Free Wilson
PLS
Trees / Forests
SVM
Bayesian NN
Deep Learning Dark
Black
Descriptors
Method
It’s a
summit –
but what
else is out
there?

MedChemica
• Matched Molecular Pairs –
Molecules that differ only by a
particular, well-defined
structural transformation
Griffen, E. et al. J. Med. Chem. 2011, 54(22), pp.7739-7750.
Advanced MMPA with MCPairs
• Transformation with environment
capture – MMPs can be recorded
as transformations from A B
Δ Data A-
B
1
2
2
3
3
3
4
4
4
12
23
3
34
4
4
A B
Environment is key - must be captured in the chemical encoding

MMPA: Environment really matters
HMe:
• Median Dlog(Solubility)
• 225 different
environments
2.5log
1.5log
HMe:
• Median Dlog(Clint)
Human microsomal
clearance
• 278 different
environments
MedChemica

MedChemica
Matched Molecular Pair methods matter
If you don’t use both you’ll miss 12-56% of the pairs
2 Methods:
Maximum Common SubStructure(MCSS) Fragment and Index(FI)
Warner, Sheridan Hussein & Rea
Strengths:
Ring replacement linker and core swaps
Macrocycle ring pairs
12
EGF D1 Cav3.2
fF+I
fMCSS
0.1
0.9
0.1 0.9 0.1 0.9 0.1 0.9
0.1
0.9
0.1
0.9
Leach et al J.Chem. Inf. Model. 2017 http://dx.doi.org/10.1021/acs.jcim.7b00335

MedChemica
Identify and group matching SMIRKS
Calc ulate statistical parameters for eac h unique
SMIRKS(n, median, sd, se, n_up/ n_down)
Is n ≥ 6?
Not enough data:
ignore transformation
Is the | median| ≤ 0.05 and the
interc entile range (10-90%) ≤ 0.3?
Perform two-tailed binomial test on the
transformation to determine the
signific anc e of the up/ down frequenc y
transformation is
c lassified as ‘neutral’
Transformation c lassified as
‘NED’ (No Effec t Determined)
Transformation c lassified as
‘increase’ or ‘ decrease’
depending on whic h direc tion the
property is c hanging
passfail
yesno
yesno
Rule selection
0 +ve-ve
Median data difference
Neutral IncreaseDecrease
NED
• No assumption of normal
distribution
• Manage ‘censored’ = qualified
/ out-of-range data

MedChemica
Making the complicated simple: HOT-Fit
Algorithms
Technology
Human
Organization
Data
Speed
System Use
User
Satisfaction
Structure
Environment
Benefits

MedChemica
Where to get data?
• Public data is unrepresentative
• Censored by publication bias
• Pharma data – can’t share
structures due to IP.
• Use chemical transformations to
encode knowledge from matched
molecular pair (MMP) analysis 
now sharable
Novartis: Kramer, C.; Kalliokoski,
T. et al The Experimental
Uncertainty of Heterogeneous
Public Ki Data J. Med. Chem
2012, 55, 5165
If project data really looked like
that, there would be no problem
in the Pharma industry.

MedChemica
Data Sources
Roche
Database
AZ
Data
MMP
finder
AZ
Database
MMP
finder
MMP
finder
Roche
Data
Genentech
Data
Grand Rule
Database
Grand Rule
Database
Grand Rule
Database
Grand Rule
Database
AZ
Exploitation
Roche
Exploitation
Genentech
Exploitation
>500 million pairs
MedChemica
Aggregation
Individual
company
firewall
Genentech
Database
0.5 million rules

MedChemica
Merge
Pharma 1 100k rules
Pharma 2 92k rules
Pharma 3 37k rules
5.8k rules in common (pre-merge) ~ 2%
New Rules 88k
~26% of total
Combining data yields brand new rules
Gains: 300 - 900%
Merging knowledge – GRDv1

MedChemica
Knowledge Extracted
Numbers of statistically valid transforms
Grouped Datasets Number of Rules
logD7.4 153449
Merged solubility 46655
In vitro microsomal clearance:
Human, rat, mouse, cyno, dog
88423
In vitro hepatocyte clearance :
Human, rat, mouse, cyno, dog 26627
MCDK permeability A-B / B – A efflux 1852
Cytochrome P450 inhibition:
2C9, 2D6 , 3A4 , 2C19 , 1A2
40605
Cardiac ion channels
NaV 1.5, hERG ion channel inhibition
15636
Glutathione Stability 116
plasma protein or albumin binding
Human, rat, mouse, cyno, dog
64622
Grand Rule
Database
v3

MedChemica
Single company vs merged
Comparison between Roche-only and GRD rules for human
microsomal clearance. Overall R2 is 0.76 and RMSE 0.11.

MedChemica
Chemists use logD as a benchmark:
• Standard to use lipophilicity as a design surrogate
• Provides a context for changes
• Key multi-objective design issues are centered round
conflicting logD correlations:
• Solubility & metabolic stabilitypotency & permeability
• Particularly useful to look at chemical transformations that
‘ break the dogma’ of logD correlation

MedChemica
Solubility : logD – trends & exceptions
>=20 examples per rule, n=13,453
R2 = 0.66, slope = -0.57, intercept = 0.
Magenta line: line of slope -1, intercept 0, dark blue line linear best fit, pale blue density ellipse contains
99% and the mid blue ellipse contains 50% of the transformations.

MedChemica
Exceptional Solubility transformations
Transformation median ΔlogD ±std
(nPairs)
median ΔlogSol ±std
(nPairs)
Comment
0.00 ± 0.67
(91)
0.73 ± 0.72
(87)
DlogD ==
Solubility 
-0.10 ±0.83
(83)
0.65 ± 0.96
(69)
0.07 ± 0.50
(108)
0.52 ± 0.77
(80)
-0.10 ± 0.54
(208)
0.40 ± 0.78
(115)
-0.59 ± 0.49
(82)
0.03 ± 0.72
(98)
DlogD 
Solubility ==

MedChemica
Clearance : logD – trends & exceptions
>=20 examples per rule, n=11,572
R2 = 0.40, slope 0.23, intercept = 0.
Magenta line: line of slope 1, intercept 0, dark blue line linear best fit, pale blue density
ellipse contains 99% and the mid blue ellipse contains 50% of the transformations.

MedChemica
Exceptional HLM transformations
Transformation median ΔlogD ±std
(nPairs)
HLM
median Δlog(Clint) ±std
(nPairs)
Comment
0.35±0.45
(15)
-0.34±0.71
(13)
DlogD 
Clint

0.70±0.74
(117)
-0.32±0.51
(53)
0.73±0.61
(26)
-0.23±0.36
(18)
0.00±0.11
(19)
-0.59±0.38
(14)
DlogD ==
Clint

-0.69±0.42
(8)
0.76±0.59
(7)
DlogD 
Clint 

MedChemica
MMPA: Engineering challenges
• Quick to implement on a small scale
• Always becomes an n2 problem….
• ‘Challenging’ at enterprise scales 100,000+
- Cheminformatics ‘gotchas’
• Tautomers, charge states
• Unusual aromatic systems
• Highly symmetric molecules
• Capturing and coding environments accurately
- Structure and data integrity
- Assay ontologies
- Database schema optimized for cluster I/O
Speed at scale essential – time poor users

MedChemica
Interface Design depends on the User
27
• > 2 x 1012 searches / year
• Totally unskilled users
• Simple consistent interface
• Rocket scientists
?
Meet your HiPPO where they’re skilled
• Intuitive ( = fast & familiar)
• Summary data + option to drill into the
detail
• Web browsers
• Excel

MedChemica
Exploiting Knowledge for Compound Optimization
Measured
Data
rule
finder
Rule
Database
Compounds
from Rules
Problem molecule
New molecule
suggestions
rule
finder
MCPairs=
“..it’s like asking 150 of your peers
for ideas in just a few seconds” –
AZ Principal Scientist

MedChemica
Exploiting Knowledge for Compound Optimization
https://www.youtube.com/watch?v=nQxXddJDTfc

MedChemica
More examples of Success
30
Thompson; M.J. et al J. Med. Chem., 2015, 58 (23), pp 9309–9333
DOI: 10.1021/acs.jmedchem.5b01312

MedChemica
“Me-Betters” on a Massive scale
Enumerator
System
1162
Marketed
Drugs
Wealth of
Follow-on
opportunities
Grand Rule
Database
v3
Improve solubility & metabolism
= lower dose
= uid from bid/tid
Safer, better compliance
~425 improvement
suggestions / drug

MedChemica
‘Instant’ SAR exploration
https://www.youtube.com/watch?v=_FGSnD6PG3I

MedChemica
• MMP based clustering
• QSAR from MMPA
• Matched molecular series
•Interface design is key
There is so much more…
?

MedChemica
What can we do with Advanced Analytics?
Accelerate Chemistry by using:
• right algorithms that our users understand
• as much data as possible
• fast, “user appropriate” interfaces
deliver better products into development faster.
34

MedChemica
Collaborators and Users - experience

SCI What can Big Data do for Chemistry 2017 MedChemica

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (9)

Similaire à SCI What can Big Data do for Chemistry 2017 MedChemica

Similaire à SCI What can Big Data do for Chemistry 2017 MedChemica (20)

Dernier

Dernier (20)

SCI What can Big Data do for Chemistry 2017 MedChemica

Notes de l'éditeur