Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes

Structure-‐Ac)vity
Rela)onships
and

Networks:
A
Generalized
Approach

to
Exploring
Structure-‐Ac)vity

Landscapes

Rajarshi
Guha

NIH
Chemical
Genomics
Center
/

NIH
Center
for
Transla9onal
Therapeu9cs

March
29,
2011

NIH
Chemical
Genomics
Center

•  Founded
2004
as
part
of
NIH
Roadmap
Molecular
Libraries
Ini9a9ve

–  NCGC
staﬀed
with
90+
scien9sts
–
biologists,
chemists,
informa9cians,
engineers

–  Post-‐doc
program

•  Mission

–  MLPCN
(screening
&
chemical
synthesis;
compound
repository;
PubChem
database;

funding
for
assay,
library
and
technology
development
)

•  Complements
individual
inves9gator-‐ini9ated
research
programs

•  Enables
“pharma-‐level”
HTS
and
early
chemical
op9miza9on

–  Develop
new
chemical
probes
for
basic
research
and
leads
for
therapeu9c
development,

par9cularly
for
rare/neglected
diseases

–  New
paradigms
&
applica9ons
of
HTS
for
chemical
biology
/
chemical
genomics

•  All
NCGC
projects
are
collabora9ons
with
a
target
or
disease
expert;

currently
>200

collabora9ons
with
inves9gators
worldwide

–  75%
NIH
extramural,
10%
NIH
intramural,
15%
Founda9ons/Research
Consor9a/Pharma/
Biotech

NCGC
Project
Diversity

(A) Disease areas (B) Target types

(C) Detection methods

qHTS:

High
Throughput
Dose
Response

Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fitting

A
(high:~ 100 μM)
1536-well plates, inter-plate dilution series
and classification. 300K samples

C

Assay volumes 2 – 5 μL

B
Automated concentration-response data collection
~1 CRC/sec

Background

•  Cheminforma9cs
methods

–  QSAR,
diversity
analysis,
virtual
screening,

fragments,
polypharmacology,
networks

•  More
recently

–  RNAi
screening,
high
content
imaging

•  Extensive
use
of
machine
learning

•  All
9ed
together
with
socware

development

–  User-‐facing
GUI
tools

–  Low
level
programma9c
libraries

•  Believer
&
prac99oner
of
Open
Source

Outline

•  Structure-‐ac9vity
rela9onships

•  Characterizing
ac9vity
cliﬀs

•  Working
with
the
structure-‐ac9vity
landscape

Structure
Ac)vity
Rela)onships

•  Similar
molecules
will
have
similar
ac9vi9es

•  Small
changes
in
structure
will
lead
to
small

changes
in
ac9vity

•  One
implica9on
is
that
SAR’s
are
addi9ve

•  This
is
the
basis
for
QSAR
modeling

Mar9n,
Y.C.
et
al.,
J.
Med.
Chem.,
2002,
45,
4350–4358

Excep)ons
Are
Easy
to
Find

F3C Cl Cl F3C Cl Cl
NH2 NH2

N N

N N
NH2 NH

O O
O

Ki
=
39.0
nM
Ki
=
1.8
nM

F3C Cl Cl F3C Cl Cl
NH2 NH2

N N

N N
NH NH

O NH2 O
O O NH2

Ki
=
10.0
nM
Ki
=
1.0
nM

Tran,
J.A.
et
al.,
Bioorg.
Med.
Chem.
Le2.,
2007,
15,
5166–5176

Structure
Ac)vity
Landscapes

•  Rugged
gorges
or
rolling
hills?

–  Small
structural
changes
associated
with
large

ac9vity
changes
represent
steep
slopes
in
the

landscape

–  But
tradi9onally,
QSAR
assumes
gentle
slopes

–  Machine
learning
is
not
very
good
for
special

cases

Maggiora,
G.M.,
J.
Chem.
Inf.
Model.,
2006,
46,
1535–1535

Structure
Ac)vity
Landscapes

Characterizing
the
Landscape

•  A
cliﬀ
can
be
numerically
characterized

•  Structure
Ac9vity
Landscape
Index
(SALI)

Ai − A j
SALIi, j =
1− sim(i, j)
•  Cliﬀs
are
characterized
by
elements
of
the

matrix
with
very
large
values

€
Guha,
R.;
Van
Drie,
J.H.,
J.
Chem.
Inf.
Model.,
2008,
48,
646–658

Visualizing
the
SALI
Matrix

Fingerprints

1 0 1 1 0 0 0 1 0

•  Lots
of
types
of
ﬁngerprints

•  Indicates
the
presence
or
absence
of
a
structural

feature

•  Length
can
vary
from
166
to
4096
bits
or
more

•  Fingerprints
usually
compared
using
the

Tanimoto
metric

Varying
Fingerprint
Methods

BCI 1052 bit MACCS 166 bit CDK 1024 bit

8

8

8
6

6

6
Density

Density

Density
4

4

4
2

2

2
0

0

0
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 0.9 1.0

Tanimoto Similarity Tanimoto Similarity Tanimoto Similarity

•  Shorter
fingerprints
will
lead
to
more
“similar”
pairs

•  Requires
a
higher
cutoff
to
focus
on
significant
cliffs

Varying
the
Similarity
Metric

Diﬀerent
Ac)vity
Representa)ons

•  Using
the
Hill
parameters
from
a
dose-‐response

curve
represents
richer
data
than
a
single
IC50

SInf

⎧ S0 ⎫
⎪ ⎪
⎪ Sinf ⎪ d(Pi ,P j )
SALIi, j =
50%

⎨ ⎬
Activity

⎪ AC50 ⎪ 1− sim(i, j)
⎪ H ⎪
⎩ ⎭
S0

AC50
Concentration €

Visualizing
SALI
Values

•  Alterna9ves?

–  A
heatmap
is
an
easy
to
understand
visualiza9on

–  Coupled
with
brushing,
can
be
a
handy
tool

–  A
more
ﬂexible
approach
is
to
consider
a
network

view
of
the
matrix

•  The
SALI
graph

–  Compounds
are
nodes

–  Nodes
i,j
are
connected
if
SALI(i,j)
>
X

–  Only
display
connected
nodes

Visualizing
SALI
Values

•  The
SALI
graph

–  Compounds
are
nodes

–  Nodes
i,j
are
connected
if
SALI(i,j)
>
X

–  Only
display
connected
nodes

!
17 !!!!!!!!!
7 13 29 43 49 45 54 59 76

!
15 !
28 ! !!!!!!!
6 52 44 50 46 55 60 75

! !
3 18 !!
2 35 !! !
20 22 9 !
64 !
69 !
21 !
34 !
38

!
8 !
65 !
24 ! !
1 71 !!
12 58 !!
63 10 !! ! !!
68 27 23 41 42 !!!!
72 73 31 51 !
39

!
5 ! !
19 62 !
25 !
57 !
56 !!!
30 53 37

!
4 !
40

!
66

Varying
the
Cutoff

•  The
cutoff
controls
the
complexity
of
the
graph

•  Higher
cut
offs
will
highlight
the
most
significant

ac9vity
cliffs

Cutoff = 90% Cutoff = 50% Cutoff = 20%

! !!!!!!!!!
! ! ! ! ! !!!!! ! !!!!!!
17 7 13 29 43 49 45 54 59 76
9 17 15 13 12 22 23 29 38 41 64 43 45 49 54 59 63 ! !
9 17
!
15
! ! ! !!! !
13 12 21 22 29 35 38
!64
!!!!!!
43 45 49 54 59 63

!
15
!
28
! !!!!!!!
6 52 44 50 46 55 60 75

! !!
1 28 3
!! !!!!!!!!!!!!!
6 19 24 25 52 39 57 42 56 44 46 50 55 60 62 ! !!
1 28 3
!! ! !!! !!!! !!!!!!!!
6 19 23 24 52 65 39 41 42 56 58 66 44 46 50 55 60 62

! !
3 18
!!
2 35
!! !
20 22 9
!
64
!
69
!
21 !
34
!
38

!
2
! 8
!40 !
2
! 8
! !
40 25
!
37
!57 !
8
!
65
!
24 ! !
1 71 !!
12 58 !!
63 10
!! ! !!
68 27 23 41 42 !!!!
72 73 31 51 !
39

!
5
! !
19 62
!
25
!
57
!
56
!!!
30 53 37

!
5 !
5

!
4 !
40

! 4 ! 4
!
66

BePer
Visualiza)on
-‐
SALIViewer

hPp://sali.rguha.net

What
Can
We
Do
With
SALI’s?

•  SALI
characterizes
cliﬀs
&
non-‐cliﬀs

•  For
a

given
molecular
representa9on,
SALI’s

gives
us
an
idea
of

the

smoothness
of
the

SAR
landscape

•  Models
try
and
encode

this
landscape

•  Use
the
landscape
to
guide

descriptor
or
model

selec9on

Descriptor
Space
Smoothness

gatifloxacin

granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin

moxifloxacin grepafloxacin sildenafil

sparfloxacin diltiazem amitriptyline

dolasetron granisetron imipramine perhexiline
400

Number of Edges in SALI Graph
mibefradil chlorpromazine azimilide bepridil
cisapride E-4031 sertindole pimozide dofetilide droperidol thioridazine haloperidol domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine

halofantrine mizolastine loratadine domperidone verapamil terfenadine

sertindole dofetilide haloperidol thioridazine droperidol
300

E-4031 cisapride pimozide

astemizole

astemizole

200

grepafloxacin sildenafil moxifloxacin gatifloxacin

100

0

0.0 0.2 0.4 0.6 0.8 1.0 astemizole

SALI Cutoff

•  Edge
count
of
the
SALI
graph
for
varying
cutoﬀs

•  Measures
smoothness
of
the
descriptor
space

•  Can
reduce
this
to
a
single
number
(AUC)

Other
Examples

400

•  Instead
of
ﬁngerprints,

300

we
use
molecular

200 2D

descriptors
100

•  SALI
denominator
now

0

uses
Euclidean
distance
0.0 0.2 0.4 0.6

SALI Cutoff
0.8 1.0

•  2D
&
3D
random

descriptor
sets

400

–  None
are
really
good

300

3D

–  Too
rough,
or

200

–  Too
ﬂat

100

0

0.0 0.2 0.4 0.6 0.8 1.0

SALI Cutoff

Feature
Selec)on
Using
SALI

•  Surprisingly,
exhaus9ve
search
of
66,000
4-‐
descriptor
combina9ons
did
not
yield
semi-‐
smoothly
decreasing
curves

•  Not
en9rely
clear
what
type
of
curve
is
desirable

SALI
Graphs
&
Predic)ve
Models

•  The
graph
view
allows
us
to
view
SAR’s
and
iden9fy

trends
easily

•  The
aim
of
a
QSAR
model
is
to
encode
SAR’s

•  Tradi9onally,
we
consider
the
quality
of
a
model
in

terms
of
RMSE
or
R2

•  But
in
general,
we’re
not
as
interested
in
RMSE’s
as

we
are
in
whether
the
model
predicted
something

as
more
ac9ve
than
something
else

–  What
we
want
to
have
is
the
correct
ordering

–  We
assume
the
model
is
sta9s9cally
signiﬁcant

Measuring
Model
Quality

•  A
QSAR
model
should
easily
encode
the
“rolling

hills”

•  A
good
model
captures
the
most
significant
cliffs

•  Can
be
formalized
as

How
many
of
the
edge
orderings
of
a
SALI
graph

does
the
model
predict
correctly?

•  Define
S
(X
),
represen9ng
the
number
of
edges

correctly
predicted
for
a
SALI
network
at
a
threshold

X

•  Repeat
for
varying
X
and
obtain
the
SALI
curve

SALI
Curves

1.0
1.0

0.5
0.5

S(X)
S(X)

0.0
0.0

!0.5
!0.5

3!descriptor
5!descriptor
Scrambled 3!descriptor !1.0
SCI = 0.12
!1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

X X

Model
Search
Using
the
SCI

•  We’ve
used
the
SALI
to
retrospec9vely
analyze

models

•  Can
we
use
SALI
to
develop
models?

–  Iden9fy
a
model
that
captures
the
cliffs

•  Tricky

–  Cliffs
are
fundamentally
outliers

–  Op9mizing
for
good
SALI
values
implies
overfivng

–  Need
to
trade-‐off
between
SALI
&
generalizability

The
Objec)ve
Func)on

•  S0
is
a
measure
of
the
models
1.0

ability
to
summarize
the
dataset
0.9

S100

S(X)
0.8

(analogous
to
RMSE)
S

0.7
0

•  S100
measures
the
models

0.6

ability
to
capture
cliﬀs

0.0 0.2 0.4 0.6 0.8 1.0

SALI Cutoff

•  Ideally,
the
curve
starts
high
and
stays
high

1 1 (S100 − S0 ) 1
F= F= + F=
S100 S0 2 SCI

SALI
Based
Model
Selec)on

RMSE SCI S(100)

•  Considered
the
BZR
dataset

0.5

from
Sutherland
et
al

S(X)
0.0

•  Iden9ﬁed
“best”
models

-0.5

using
a
GA
to
select
from
a

0.0 0.2 0.4 0.6

SALI Cutoff
0.8 1.0

pool
of
2D
descriptors
RMSE SCI S(100)

•  While
SALI
based
op9miza9on
0.5

can
lead
to
a
“bexer”
curve,

S(X)
0.0

it
doesn’t
give
the
best
model
-0.5

0.00 0.02 0.04 0.06 0.08

SALI Cutoff

Sutherland,
J
et
al,
J.
Chem.
Inf.
Comput.
Sci.,
2003,
43,
1906-‐1915

SALI
Based
Model
Selec)on

RMSE SCI S(0) + D/2

•  107
aryl
azoles
as
ER-‐β
agonists

0.5

S(X)
0.0

•  Used
a
GA
and
2D
descriptors
-0.5

to
iden9fy
models

0.0 0.2 0.4 0.6 0.8 1.0

•  In
this
case,
a
SALI
based

RMSE
SALI Cutoff

SCI S(0) + D/2

objec9ve
func9on
was
able
to

iden9fy
the
best
model
0.5

•  Interes9ngly,
SCI
does
not

S(X)
0.0

seem
to
perform
very
well
-0.5

0.00 0.02 0.04 0.06 0.08

SALI Cutoff

Malamas,
M.S.
et
al,
J
Med
Chem,
2004,
47,
5021-‐5040

SALI
Based
Model
Selec)on

•  The
size
of
the
solu9on
space
explored

depends
on
the
SALI
objec9ve
func9on

1.15

BZR
ER-‐β

0.65
1.10
1.05

0.60
RMSE
RMSE

1.00
0.95

0.55
0.90

RMSE S(100) SCI 1/S(0) + D/2 RMSE SCI

Objective Function Objective Function

Predic)ng
the
Landscape

•  Rather
than
predic9ng
ac9vity
directly,
we
can

try
to
predict
the
SAR
landscape

•  Implies
that
we
axempt
to
directly
predict
cliffs

–  Observa9ons
are
now
pairs
of
molecules

•  A
more
complex
problem

–  Choice
of
features
is
trickier

–  S9ll
face
the
problem
of
cliffs
as
outliers

–  Somewhat
similar
to
predic9ng
ac9vity
differences

Scheiber
et
al,
StaHsHcal
Analysis
and
Data
Mining,
2009,
2,
115-‐122

Predic)ng
Cliffs

•  Dependent
variable
are
pairwise
SALI
values,

calculated
using
fingerprints

•  Independent
variables
are
molecular
descriptors

–
but
considered
pairwise

–  Absolute
difference
of
descriptor
pairs,
or

–  Geometric
mean
of
descriptor
pairs

–  …

•  Develop
a
model
to
correlate
pairwise

descriptors
to
pairwise
SALI
values

A
Test
Case

•  We
ﬁrst
consider
the
Cavalli
CoMFA
dataset
of
30

molecules
with
pIC50’s

•  Evaluate
topological
and
physicochemical

descriptors

•  Developed
random
forest

models

–  On
the
original
observed

values
(30
obs)

–  On
the
SALI
values

(435
observa9ons)

Cavalli,
A.
et
al,
J
Med
Chem,
2002,
45,
3844-‐3853

Double
Coun)ng
Structures?

•  The
dependent
and

GeoMean

independent
variables
both

60

50

encode
structure.

40

30

•  But
prexy
low
correla9ons

20

between
individual
pairwise

10

Percent of Total
0

descriptors
and
the
SALI

AbsDiff
60

values

50

40

30

20

10

0

0.00 0.05 0.10 0.15

R2

Model

Summaries

Original
pIC50
SALI,
AbsDiﬀ
SALI,
GeoMean

9 RMSE
=
0.97
RMSE
=
1.10
RMSE
=
1.04

6 6 !
8
Predicted pIC50

! !! !

Predicted SALI

Predicted SALI
! ! !
! ! !
! ! ! !!!
! ! ! ! ! !! !
! ! !
! ! ! ! ! !! !
7 ! !
! ! !!! ! ! !
! ! ! ! ! !
! !
!
!
!
4 ! ! ! !! !
! ! !
4 !! !! ! ! !
!
! ! ! !!! !!
! !
! ! ! !! !! ! ! ! !
! !!
! ! ! ! !! ! !
! ! ! !! ! !! ! !
! !! ! !
! ! !! ! ! !!
! ! ! ! ! !! !
!! ! !! ! ! !!! ! !
!
! ! !!!!!!! ! ! ! ! !
! ! ! !! !
!!
! !! !
! ! ! ! ! !! ! !
6 ! ! ! !! ! ! ! !
! ! !! ! ! ! !!
! ! !!!! ! !!!!!!!! ! ! ! ! ! !
! ! !!! !!
! ! ! ! ! ! !!! ! ! ! ! ! !!!! ! ! !
!! ! !!! !!
! ! ! !!! ! !!!!! !
!
! ! ! !! !!
! !
! ! !! ! !!!!! ! !!!! !
! ! ! ! !! !! ! !! ! !
! ! ! ! ! ! !! ! ! !! !!!!!! !!!!! !!
! ! !! ! !
!
! ! !
!! ! ! !!! !!!! !!!! !!! ! ! !
!!
! !!!!! !! ! ! ! ! ! ! ! ! ! ! !! ! !
! ! !
! !! !! ! !! !! !! !! ! !!
! ! !! ! !!
! ! !!! !!!!!!!!!! !! ! ! !!
!! !!!! ! ! ! !
!! !! ! ! ! !!!!!!!!!!!! ! !! ! !
!
!! !!!!!!!!! !!!!! !!
! ! ! !! ! !
! 2 !!!!!!! ! !!
! ! ! ! !
! !
! !! !! ! !
! !!!! !!!! ! !!
!
! !!!! ! ! ! ! !!
! !!!!!!! !!! !! 2 !
! ! !!!!!!! !!! !
! !!!!!! ! ! ! ! !
! ! ! ! !!!! ! !! !
! ! ! !
!!!!!!!!!!!! !! ! !
! !! ! !!!
! !
! !! !!!!! ! !! ! ! !
! !
! !!!!!!!
! !!! !
!!! !! ! ! ! ! !
! ! ! ! !!! ! ! ! !
5 ! ! !
! !
! ! !!!!! ! !
! !! ! ! ! !!! !!! !!!!! !
!!! !!! !!!! ! !
!! ! ! ! !
! !
!! !
! ! ! ! !
!
! ! !! ! ! !
! ! ! ! !!!! ! !
! ! !! !!
! !! !
!! !
! ! ! ! !
! !!! !
!! !
!!
!!
4
0 0

4 5 6 7 8 9 0 2 4 6 0 2 4 6

Observed pIC50 Observed SALI Observed SALI

•  All
models
explain
similar
%
of
variance
of

their
respec9ve
datasets

•  Using
geometric
mean
as
the
descriptor

aggrega9on
func9on
seems
to
perform
best

•  SALI
models
are
more
robust
due
to
larger
size

of
the
dataset

Test
Case
2

•  Considered
the
Holloway
docking
dataset,
32

molecules
with
pIC50’s
and
Einter

•  Similar
strategy
as
before

•  Need
to
transform
SALI
values

•  Descriptors
show
minimal

correla9on
50

30

40
Percent of Total

Percent of Total
30
20

20

10

10

0 0

0 20 40 60 80 100 120 -1 0 1 2
Holloway,
M.K.
et
al,
J
Med
Chem,
1995,
38,
305-‐317
SALI log10 (SALI)

Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes

Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes

Recommandé

Recommandé

Contenu connexe

Similaire à Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes

Similaire à Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes (9)

Plus de Rajarshi Guha

Plus de Rajarshi Guha (20)

Dernier

Dernier (20)