Chemogenomics in the cloud: Is the sky the limit?

Chemogenomics
in
the
cloud

Is
the
sky
the
limit?

Rajarshi
Guha,
Ph.D.

NIH
Center
for
Transla:onal
Therapeu:cs

June
28,
2012

The
cloud
as
infrastructure

•  Cloud
compu:ng
is
a
service
for

–  Infrastructure

–  PlaForm

–  SoHware

•  Much
of
the
beneﬁts
of
cloud
compu:ng
are

–  Economic

–  Poli:cal

•  Won’t
be
discussing
the
remote
hos:ng
aspects

of
clouds

Characteris8cs
of
the
cloud

Virtually Pay-per-use
assemble

Offsite Cloud Shared
technology Computing workloads

Massive
On-demand scale
self service

hPp://www.slideshare.net/haslinatuanhim/slides-‐cloud-‐compu:ng

Parallel
compu8ng
in
the
cloud

•  Modern
cloud
vendors
make
provisioning

compute
resources
easy

–  Allows
one
to
handle
unpredictable
loads
easily

–  Pay
only
for
what
you
need

•  Chemistry
applica:ons
don’t
usually
have
very

dynamic
loads

•  But
large
scale
resources
are
an
opportunity
for

large
scale
(parallel)
computa:ons

Storing
chemical
informa8on

•  Fill
up
a
hard
drive,
mail
to
Amazon

•  Copy
over
the
network

–  Aspera

–  GridFTP

•  S:ll
need
to
pay
for

storage
space

•  Lots
of
op:ons
on
the

cloud
–
S3,
rela:onal
DB’s

•  See
Chris
Dagdigian’s
talk
for
views
on
storage

hPp://www.slideshare.net/chrisdag/2012-‐trends-‐from-‐the-‐trenches

Recoding
for
the
cloud?

•  Only
if
we
really
have
to

•  Large
amounts
of
legacy
code,

runs
perfectly
well
on
local
clusters

–  May
not
make
sense
to
recode

as
a
map-‐reduce
job

–  May
not
be
possible
to

?

•  Diﬀerent
levels
of
HPC
on
the
cloud

–  Legacy
HPC

–  ‘Cloudy’
HPC

–  Big
Data
HPC

hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud

Recoding
for
the
cloud?

•  Use
cloud
resources
in
•  Make
use
of
cloud
•  Huge
datasets

the
same
way
as
a
local
capabili:es
•  Candidates
for
map-‐
cluster
•  Old
algorithms,
new
reduce

•  MIT
StarCluster
makes
infrastructure
•  Involves
algorithm

this
easy
to
do
•  Spot
instances,
SNS,
(re)design

SQS
SimpleDB,
S3,
etc

Legacy
Cloudy
Big
Data

HPC
HPC
HPC

hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud

How
does
the
cloud
enable
science?

•  How
does
the
cloud
change
computa:onal

chemistry,
cheminforma:cs,
…

–  The
way
we
do
them

–  The
scale
at
which
we
do
them

Are
there
problems
that
we
can
address
that

we
could
not
have
if
we
didn’t
have
on-‐demand,

scalable
cloud
resources?

Big
data
&
cheminforma8cs

•  Computa:on
over
large
chemical
databases

–  Pubchem,
ChEMBL,
…

•  What
types
of
computa:ons?

–  Searches
(substructure,
pharmacophore,
….)

–  QSAR
models
over
large
data

–  Predic:ons
for
large
data

•  Certain
applica:ons
just
need
structures

•  Access
to
correspondingly
massive
experimental

datasets
is
tough
(impossible?)

Big
data
&
cheminforma8cs

•  GDB-‐13
is
a
truly
big
database
–
977
million

diﬀerent
structures

–  Current
search
interface
is
based
on
NN
searches

using
a
reduced
representa:on

–  Could
be
a
good
candidate
for
a
Hadoop
based

analysis

•  More
generally,
enumerated
virtual
libraries
can

also
lead
to
very
big
data

–  Time
required
to
enumerate
is
a
boPleneck

Big
data
&
cheminforma8cs

•  Fundamentally,
“big
chemical
data”
lets
us

explore
larger
chemical
spaces

–  Can
plow
through
large
catalogs

–  e.g.,
iden:fying
PKR
inhibitors
by
LBVS
of
the

ChemNavigator
collec:on
[Bryk
et
al]

•  This
can
push
predic:ve
models
to
their
limits

–  Brings
us
back
to
the
global
vs
local
arguments

The
Hadoop
ecosystem

•  A
framework
for
the
map-‐reduce
agorithm

–  Not
something
you
can
download
and
just
run

–  Need
to
implement
the
infrastructure
and
then

develop
code
to
run
using
the
infrastructure

•  Low
level
Hadoop
programs
can
be
large,

complex
and
tedious

•  Abstrac:ons
have
been
developed
that
make

Hadoop
queries
more
SQL-‐like
–
results
in
much

more
concise
code

The
Hadoop
ecosystem

Chukwa Zookeeper Flume Pig

HBase Mahout Avro Whirr

Map Reduce Engine Hama

Hadoop Distributed
Hive
Filesystem

Hadoop Common

Based
on
hPp://www.slideshare.net/informa:cacorp/101111-‐part-‐3-‐maP-‐asleP-‐the-‐hadoop-‐ecosystem

Simplifying
Hadoop
applica8ons

•  Raw
Hadoop

programs
can

be
very

tedious
to

write

SMARTS
based

substructure
search

Pig
&
Pig
La8n

•  Pig
La:n
programs
are
much
simpler
to
write

and
get
translated
to
!"#"$%&'"()*'+,)-.)+("&."/.)+$*.012&3&33&456"

Hadoop
code

7"#"8$9*3"!":4";*9-3<,2&-'1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56"
.9%3*"7"+;9%"(%,9=,9-9F9(6"

SMARTS
search
in

•  SQL-‐like,
requires

Pig
La:n

!"#$%&'&$())'*+,-./'012034)'5%$2065"3&'7'

UDF
to
be

'''')2(8&'*+,9-*:"06;-<<$')=2>)2(8&'7'
''''''''26;'7'
'''''''''''')=2'?'30@'*+,9-*:"06;-<<$AB.BC>'

implemented
to

''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'
''''''''''''*;)20IJ<"2J!6%32$3A0C>'
''''''''D'
''''D'

perform

'''')2(8&'*I%$0)K(6)06')!'?'30@'*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC>'
'
''''!"#$%&'O<<$0(3'010&A-"!$0'2"!$0C'2E6<@)'QMH1&0!8<3'7'

non-‐standard
tasks
''''''''%L'A2"!$0'??'3"$$'RR'2"!$0J)%S0AC'T'UC'602"63'L($)0>'
''''''''*26%3P'2(6P02'?'A*26%3PC'2"!$0JP02AVC>'
''''''''*26%3P'="06;'?'A*26%3PC'2"!$0JP02AWC>'
''''''''26;'7' UDF
for
SMARTS
search

'''''''''''')=2J)02*I(62)A="06;C>'
''''''''''''Q,2<I.<32(%306'I<$'?')!J!(6)0*I%$0)A2(6P02C>'
''''''''''''602"63')=2JI(2&E0)AI<$C>'
''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'
''''''''''''2E6<@'X6(!!04QMH1&0!8<3J@6(!ABH66<6'%3'*+,9-*'!(Y063'<6'*+QZH*')26%3P'B[="06;'0C>'
''''''''D'
''''D'
D'

Working
on
top
of
Hadoop

•  Hadoop
doesn’t
know
anything
about

cheminforma:cs

–  Need
to
write
your
own
code,
UDF’s
etc

•  But
applica:on
layers
have
been
developed
for

other
purposes

– 

Apache
Mahout:
a
library
for
machine
learning

on
data
stored
in
Hadoop
clusters

–  Possible
to
build
virtual
screening
pipelines
based
on

the
Hadoop
framework

What
Hadoop
is
not
for

•  Doesn’t
replace
an
actual
database

•  It’s
not
uniformly
fast
or
eﬃcient

•  Not
good
for
ad
hoc
or
real:me
analysis

•  Not
eﬀec:ve
unless
dealing
with
massive

datasets

•  All
algorithms
are
not
amenable
to
the
map-‐
reduce
method

–  CPU
bound
methods
and
those
requiring

communica:on

Cheminforma8cs
on
Hadoop

•  Hadoop
and
Atom
Coun:ng

•  Hadoop
and
SD
Files

•  Cheminforma:cs,
Hadoop
and
EC2

•  Pig
and
Cheminforma:cs

But
are
cheminforma1cs
problems

really
big
enough
to
jus1fy
all
of
this?

How
big
is
big?

•  Bryk
et
al
performed
a
LBVS
of
5
million

compounds
to
iden:fy
PKR
inhibitors

–  Pharmacophore
fingerprints
+
perceptron

–  Required
conformer
genera:on

•  Given
that
conformer
and
descriptor
genera:on

are
one-‐:me
tasks,
screening
5M
compounds

doesn’t
take
long

•  Example:
RF
models
built
on
512
bit
binary

fingerprints
gives
us
predic:ons
for
5M

fingerprints
in
12
min
[Single
core,
3
GHz
Xeon,
OS
X
10.6.8]

Going
beyond
chunking?

•  All
the
preceding
use
cases
are
embarrassingly

parallel

–  Chunking
the
input
data
and
applying
the
same

opera:on
to
each
chunk

–  Very
nice
when
you
have
a
big
cluster

Are
there
algorithms
in

cheminforma1cs
that

can
employ

map-‐reduce
at
the
algorithmic
level?

Going
beyond
chunking?

•  Applica:ons
that
make
use
of
pairwise
(or
higher

order)
calcula:ons
could
beneﬁt
from
a
map-‐
reduce
incarna:on

–  Doesn’t
always
avoid
the
O(N2)
barrier

–  Bioisostere
iden:ﬁca:on
is
one
case
that
could
be

rephrased
as
a
map-‐reduce
problem

•  Search
algorithms
such
as
GA’s,
par:cle
swarms

can
make
use
of
map-‐reduce

–  GA
based
docking

–  Feature
selec:on
for
QSAR
models

Going
beyond
chunking?

•  Machine
learning
for
massive
chemical
datasets?

–  MR
jobs
(descriptor
genera:on)
+
Mahout
(model

building)
lets
us
handle
this
in
a
straight
forward

manner

•  But
will
QSAR
models
beneﬁt
from
more
data?

–  Helgee
et
al
suggest
global
models
are
preferable

–  But
diversity
and
the
structure
of
the
chemical
space

will
aﬀect
performance
of
global
models

–  Unsupervised
methods
maybe
more
relevant

–  Philosophical
ques:on?

Going
beyond
chunking?

•  Many
clustering
algorithms
are
amenable
to

map-‐reduce
style

–  K-‐means,
Spectral,
EM,
minhash,
…

–  Many
are
implemented
in
Mahout

Problems
where
we
generate
large
numbers
of

combina8ons
can
be
amenable
to
map-‐reduce

Networks
&
integra8on

•  Network
models
of
molecules,

and
targets
are
common

–  Allows
for
the
incorpora:on
of

lots
of
associated
informa:on

–  Diseases,
pathways,
OTE’s,

Yildirim,
M.A.
et
al

•  When
linked
with
clinical
data

&
outcomes,
we
can
generate
massive
networks

–  Adverse
events
(FDA
AERS)

–  Analysis
by
Cloudera
considered
>
10E6
drug-‐drug-‐
reac:on
triples

Networks
&
integra8on

•  SAR
data
can
be
viewed
in
a

network
form

–  SALI,
SARI
based
networks

–  Usually
requires
pairwise

calcula:ons
of
the
metric
Peltason,
L
et
al
hPp://sali.rguha.net/

•  Current
studies
have
focused
on
small
datasets

(<
1000
molecules)

•  Hadoop
+
Giraph
could
let
us
apply
this
to
HTS-‐
scale
datasets

Networks
&
integra8on

•  When
we
apply
a
network
view

we
can
consider
many
interes:ng

applica:ons
&
make
use
of
cloud

scale
infrastructure

–  Network
based
similarity

–  Community
detec:on
(aka
clustering)
Bauer-‐Mehren
et
al

–  PageRank
style
ranking
(of
targets,
compounds,
…)

–  Generate
network
metrics,
which
can
be
used
as

input
to
predic:ve
models
(for
interac:ons,
eﬀects,

…)

Conclusions

•  Cheminforma:cs
applica:ons
can
be
rewriPen

to
take
advantage
of
cloud
resources

–  Remotely
hosted

–  Embarrassingly
parallel
/
chunked

–  Map/reduce

•  Ability
to
process
larger
structure
collec:ons
lets

us
explore
more
chemical
space

•  Integra:ng
chemistry
with
clinical
&

pharmacological
data
can
lead
to
big
datasets

Conclusions

•  Q:
But
are
cheminforma8cs
problems
really
big

enough
to
jus8fy
all
of
this?

•  A:
Yes
–
virtual
libraries,
integra:ng
chemical

structure
with
other
types
and
scales
of
data

•  Q:
Are
there
algorithms
in
cheminforma8cs
that

can
employ
map-‐reduce
at
the
algorithmic
level?

•  A:
Yes
–
especially
when
we
consider
problems

with
a
combinatorial
ﬂavor

Chemogenomics in the cloud: Is the sky the limit?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Chemogenomics in the cloud: Is the sky the limit?

Similaire à Chemogenomics in the cloud: Is the sky the limit? (20)

Plus de Rajarshi Guha

Plus de Rajarshi Guha (20)

Dernier

Dernier (20)

Chemogenomics in the cloud: Is the sky the limit?