Advantages of Hiring UIUX Design Service Providers for Your Business
Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes
1. Structure-‐Ac)vity
Rela)onships
and
Networks:
A
Generalized
Approach
to
Exploring
Structure-‐Ac)vity
Landscapes
Rajarshi
Guha
NIH
Chemical
Genomics
Center
/
NIH
Center
for
Transla9onal
Therapeu9cs
March
29,
2011
2. NIH
Chemical
Genomics
Center
• Founded
2004
as
part
of
NIH
Roadmap
Molecular
Libraries
Ini9a9ve
– NCGC
staffed
with
90+
scien9sts
–
biologists,
chemists,
informa9cians,
engineers
– Post-‐doc
program
• Mission
– MLPCN
(screening
&
chemical
synthesis;
compound
repository;
PubChem
database;
funding
for
assay,
library
and
technology
development
)
• Complements
individual
inves9gator-‐ini9ated
research
programs
• Enables
“pharma-‐level”
HTS
and
early
chemical
op9miza9on
– Develop
new
chemical
probes
for
basic
research
and
leads
for
therapeu9c
development,
par9cularly
for
rare/neglected
diseases
– New
paradigms
&
applica9ons
of
HTS
for
chemical
biology
/
chemical
genomics
• All
NCGC
projects
are
collabora9ons
with
a
target
or
disease
expert;
currently
>200
collabora9ons
with
inves9gators
worldwide
– 75%
NIH
extramural,
10%
NIH
intramural,
15%
Founda9ons/Research
Consor9a/Pharma/
Biotech
4. qHTS:
High
Throughput
Dose
Response
Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fitting
A
(high:~ 100 μM)
1536-well plates, inter-plate dilution series
and classification. 300K samples
C
Assay volumes 2 – 5 μL
B
Automated concentration-response data collection
~1 CRC/sec
5. Background
• Cheminforma9cs
methods
– QSAR,
diversity
analysis,
virtual
screening,
fragments,
polypharmacology,
networks
• More
recently
– RNAi
screening,
high
content
imaging
• Extensive
use
of
machine
learning
• All
9ed
together
with
socware
development
– User-‐facing
GUI
tools
– Low
level
programma9c
libraries
• Believer
&
prac99oner
of
Open
Source
6. Outline
• Structure-‐ac9vity
rela9onships
• Characterizing
ac9vity
cliffs
• Working
with
the
structure-‐ac9vity
landscape
7. Structure
Ac)vity
Rela)onships
• Similar
molecules
will
have
similar
ac9vi9es
• Small
changes
in
structure
will
lead
to
small
changes
in
ac9vity
• One
implica9on
is
that
SAR’s
are
addi9ve
• This
is
the
basis
for
QSAR
modeling
Mar9n,
Y.C.
et
al.,
J.
Med.
Chem.,
2002,
45,
4350–4358
8. Excep)ons
Are
Easy
to
Find
F3C Cl Cl F3C Cl Cl
NH2 NH2
N N
N N
NH2 NH
O O
O
Ki
=
39.0
nM
Ki
=
1.8
nM
F3C Cl Cl F3C Cl Cl
NH2 NH2
N N
N N
NH NH
O NH2 O
O O NH2
Ki
=
10.0
nM
Ki
=
1.0
nM
Tran,
J.A.
et
al.,
Bioorg.
Med.
Chem.
Le2.,
2007,
15,
5166–5176
9. Structure
Ac)vity
Landscapes
• Rugged
gorges
or
rolling
hills?
– Small
structural
changes
associated
with
large
ac9vity
changes
represent
steep
slopes
in
the
landscape
– But
tradi9onally,
QSAR
assumes
gentle
slopes
– Machine
learning
is
not
very
good
for
special
cases
Maggiora,
G.M.,
J.
Chem.
Inf.
Model.,
2006,
46,
1535–1535
11. Characterizing
the
Landscape
• A
cliff
can
be
numerically
characterized
• Structure
Ac9vity
Landscape
Index
(SALI)
Ai − A j
SALIi, j =
1− sim(i, j)
• Cliffs
are
characterized
by
elements
of
the
matrix
with
very
large
values
€
Guha,
R.;
Van
Drie,
J.H.,
J.
Chem.
Inf.
Model.,
2008,
48,
646–658
13. Fingerprints
1 0 1 1 0 0 0 1 0
• Lots
of
types
of
fingerprints
• Indicates
the
presence
or
absence
of
a
structural
feature
• Length
can
vary
from
166
to
4096
bits
or
more
• Fingerprints
usually
compared
using
the
Tanimoto
metric
14. Varying
Fingerprint
Methods
BCI 1052 bit MACCS 166 bit CDK 1024 bit
8
8
8
6
6
6
Density
Density
Density
4
4
4
2
2
2
0
0
0
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 0.9 1.0
Tanimoto Similarity Tanimoto Similarity Tanimoto Similarity
• Shorter
fingerprints
will
lead
to
more
“similar”
pairs
• Requires
a
higher
cutoff
to
focus
on
significant
cliffs
16. Different
Ac)vity
Representa)ons
• Using
the
Hill
parameters
from
a
dose-‐response
curve
represents
richer
data
than
a
single
IC50
SInf
⎧ S0 ⎫
⎪ ⎪
⎪ Sinf ⎪ d(Pi ,P j )
SALIi, j =
50%
⎨ ⎬
Activity
⎪ AC50 ⎪ 1− sim(i, j)
⎪ H ⎪
⎩ ⎭
S0
AC50
Concentration €
17. Visualizing
SALI
Values
• Alterna9ves?
– A
heatmap
is
an
easy
to
understand
visualiza9on
– Coupled
with
brushing,
can
be
a
handy
tool
– A
more
flexible
approach
is
to
consider
a
network
view
of
the
matrix
• The
SALI
graph
– Compounds
are
nodes
– Nodes
i,j
are
connected
if
SALI(i,j)
>
X
– Only
display
connected
nodes
21. What
Can
We
Do
With
SALI’s?
• SALI
characterizes
cliffs
&
non-‐cliffs
• For
a
given
molecular
representa9on,
SALI’s
gives
us
an
idea
of
the
smoothness
of
the
SAR
landscape
• Models
try
and
encode
this
landscape
• Use
the
landscape
to
guide
descriptor
or
model
selec9on
22. Descriptor
Space
Smoothness
gatifloxacin
granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin
moxifloxacin grepafloxacin sildenafil
sparfloxacin diltiazem amitriptyline
dolasetron granisetron imipramine perhexiline
400
Number of Edges in SALI Graph
mibefradil chlorpromazine azimilide bepridil
cisapride E-4031 sertindole pimozide dofetilide droperidol thioridazine haloperidol domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine
halofantrine mizolastine loratadine domperidone verapamil terfenadine
sertindole dofetilide haloperidol thioridazine droperidol
300
E-4031 cisapride pimozide
astemizole
astemizole
200
grepafloxacin sildenafil moxifloxacin gatifloxacin
100
0
0.0 0.2 0.4 0.6 0.8 1.0 astemizole
SALI Cutoff
• Edge
count
of
the
SALI
graph
for
varying
cutoffs
• Measures
smoothness
of
the
descriptor
space
• Can
reduce
this
to
a
single
number
(AUC)
23. Other
Examples
400
• Instead
of
fingerprints,
Number of Edges in SALI Graph
300
we
use
molecular
200 2D
descriptors
100
• SALI
denominator
now
0
uses
Euclidean
distance
0.0 0.2 0.4 0.6
SALI Cutoff
0.8 1.0
• 2D
&
3D
random
descriptor
sets
400
Number of Edges in SALI Graph
– None
are
really
good
300
3D
– Too
rough,
or
200
– Too
flat
100
0
0.0 0.2 0.4 0.6 0.8 1.0
SALI Cutoff
24. Feature
Selec)on
Using
SALI
• Surprisingly,
exhaus9ve
search
of
66,000
4-‐
descriptor
combina9ons
did
not
yield
semi-‐
smoothly
decreasing
curves
• Not
en9rely
clear
what
type
of
curve
is
desirable
25. SALI
Graphs
&
Predic)ve
Models
• The
graph
view
allows
us
to
view
SAR’s
and
iden9fy
trends
easily
• The
aim
of
a
QSAR
model
is
to
encode
SAR’s
• Tradi9onally,
we
consider
the
quality
of
a
model
in
terms
of
RMSE
or
R2
• But
in
general,
we’re
not
as
interested
in
RMSE’s
as
we
are
in
whether
the
model
predicted
something
as
more
ac9ve
than
something
else
– What
we
want
to
have
is
the
correct
ordering
– We
assume
the
model
is
sta9s9cally
significant
26. Measuring
Model
Quality
• A
QSAR
model
should
easily
encode
the
“rolling
hills”
• A
good
model
captures
the
most
significant
cliffs
• Can
be
formalized
as
How
many
of
the
edge
orderings
of
a
SALI
graph
does
the
model
predict
correctly?
• Define
S
(X
),
represen9ng
the
number
of
edges
correctly
predicted
for
a
SALI
network
at
a
threshold
X
• Repeat
for
varying
X
and
obtain
the
SALI
curve
28. Model
Search
Using
the
SCI
• We’ve
used
the
SALI
to
retrospec9vely
analyze
models
• Can
we
use
SALI
to
develop
models?
– Iden9fy
a
model
that
captures
the
cliffs
• Tricky
– Cliffs
are
fundamentally
outliers
– Op9mizing
for
good
SALI
values
implies
overfivng
– Need
to
trade-‐off
between
SALI
&
generalizability
29. The
Objec)ve
Func)on
• S0
is
a
measure
of
the
models
1.0
ability
to
summarize
the
dataset
0.9
S100
S(X)
0.8
(analogous
to
RMSE)
S
0.7
0
• S100
measures
the
models
0.6
ability
to
capture
cliffs
0.0 0.2 0.4 0.6 0.8 1.0
SALI Cutoff
• Ideally,
the
curve
starts
high
and
stays
high
1 1 (S100 − S0 ) 1
F= F= + F=
S100 S0 2 SCI
30. SALI
Based
Model
Selec)on
RMSE SCI S(100)
• Considered
the
BZR
dataset
0.5
from
Sutherland
et
al
S(X)
0.0
• Iden9fied
“best”
models
-0.5
using
a
GA
to
select
from
a
0.0 0.2 0.4 0.6
SALI Cutoff
0.8 1.0
pool
of
2D
descriptors
RMSE SCI S(100)
• While
SALI
based
op9miza9on
0.5
can
lead
to
a
“bexer”
curve,
S(X)
0.0
it
doesn’t
give
the
best
model
-0.5
0.00 0.02 0.04 0.06 0.08
SALI Cutoff
Sutherland,
J
et
al,
J.
Chem.
Inf.
Comput.
Sci.,
2003,
43,
1906-‐1915
31. SALI
Based
Model
Selec)on
RMSE SCI S(0) + D/2
• 107
aryl
azoles
as
ER-‐β
agonists
0.5
S(X)
0.0
• Used
a
GA
and
2D
descriptors
-0.5
to
iden9fy
models
0.0 0.2 0.4 0.6 0.8 1.0
• In
this
case,
a
SALI
based
RMSE
SALI Cutoff
SCI S(0) + D/2
objec9ve
func9on
was
able
to
iden9fy
the
best
model
0.5
• Interes9ngly,
SCI
does
not
S(X)
0.0
seem
to
perform
very
well
-0.5
0.00 0.02 0.04 0.06 0.08
SALI Cutoff
Malamas,
M.S.
et
al,
J
Med
Chem,
2004,
47,
5021-‐5040
32. SALI
Based
Model
Selec)on
• The
size
of
the
solu9on
space
explored
depends
on
the
SALI
objec9ve
func9on
1.15
BZR
ER-‐β
0.65
1.10
1.05
0.60
RMSE
RMSE
1.00
0.95
0.55
0.90
RMSE S(100) SCI 1/S(0) + D/2 RMSE SCI
Objective Function Objective Function
33. Predic)ng
the
Landscape
• Rather
than
predic9ng
ac9vity
directly,
we
can
try
to
predict
the
SAR
landscape
• Implies
that
we
axempt
to
directly
predict
cliffs
– Observa9ons
are
now
pairs
of
molecules
• A
more
complex
problem
– Choice
of
features
is
trickier
– S9ll
face
the
problem
of
cliffs
as
outliers
– Somewhat
similar
to
predic9ng
ac9vity
differences
Scheiber
et
al,
StaHsHcal
Analysis
and
Data
Mining,
2009,
2,
115-‐122
34. Predic)ng
Cliffs
• Dependent
variable
are
pairwise
SALI
values,
calculated
using
fingerprints
• Independent
variables
are
molecular
descriptors
–
but
considered
pairwise
– Absolute
difference
of
descriptor
pairs,
or
– Geometric
mean
of
descriptor
pairs
– …
• Develop
a
model
to
correlate
pairwise
descriptors
to
pairwise
SALI
values
35. A
Test
Case
• We
first
consider
the
Cavalli
CoMFA
dataset
of
30
molecules
with
pIC50’s
• Evaluate
topological
and
physicochemical
descriptors
• Developed
random
forest
models
– On
the
original
observed
values
(30
obs)
– On
the
SALI
values
(435
observa9ons)
Cavalli,
A.
et
al,
J
Med
Chem,
2002,
45,
3844-‐3853
36. Double
Coun)ng
Structures?
• The
dependent
and
GeoMean
independent
variables
both
60
50
encode
structure.
40
30
• But
prexy
low
correla9ons
20
between
individual
pairwise
10
Percent of Total
0
descriptors
and
the
SALI
AbsDiff
60
values
50
40
30
20
10
0
0.00 0.05 0.10 0.15
R2
38. Test
Case
2
• Considered
the
Holloway
docking
dataset,
32
molecules
with
pIC50’s
and
Einter
• Similar
strategy
as
before
• Need
to
transform
SALI
values
• Descriptors
show
minimal
correla9on
50
30
40
Percent of Total
Percent of Total
30
20
20
10
10
0 0
0 20 40 60 80 100 120 -1 0 1 2
Holloway,
M.K.
et
al,
J
Med
Chem,
1995,
38,
305-‐317
SALI log10 (SALI)