Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Chemogenomics in the cloud: Is the sky the limit?
1. Chemogenomics
in
the
cloud
Is
the
sky
the
limit?
Rajarshi
Guha,
Ph.D.
NIH
Center
for
Transla:onal
Therapeu:cs
June
28,
2012
2. The
cloud
as
infrastructure
• Cloud
compu:ng
is
a
service
for
– Infrastructure
– PlaForm
– SoHware
• Much
of
the
benefits
of
cloud
compu:ng
are
– Economic
– Poli:cal
• Won’t
be
discussing
the
remote
hos:ng
aspects
of
clouds
3. Characteris8cs
of
the
cloud
Virtually Pay-per-use
assemble
Offsite Cloud Shared
technology Computing workloads
Massive
On-demand scale
self service
hPp://www.slideshare.net/haslinatuanhim/slides-‐cloud-‐compu:ng
4. Parallel
compu8ng
in
the
cloud
• Modern
cloud
vendors
make
provisioning
compute
resources
easy
– Allows
one
to
handle
unpredictable
loads
easily
– Pay
only
for
what
you
need
• Chemistry
applica:ons
don’t
usually
have
very
dynamic
loads
• But
large
scale
resources
are
an
opportunity
for
large
scale
(parallel)
computa:ons
5. Storing
chemical
informa8on
• Fill
up
a
hard
drive,
mail
to
Amazon
• Copy
over
the
network
– Aspera
– GridFTP
• S:ll
need
to
pay
for
storage
space
• Lots
of
op:ons
on
the
cloud
–
S3,
rela:onal
DB’s
• See
Chris
Dagdigian’s
talk
for
views
on
storage
hPp://www.slideshare.net/chrisdag/2012-‐trends-‐from-‐the-‐trenches
6. Recoding
for
the
cloud?
• Only
if
we
really
have
to
• Large
amounts
of
legacy
code,
runs
perfectly
well
on
local
clusters
– May
not
make
sense
to
recode
as
a
map-‐reduce
job
– May
not
be
possible
to
?
• Different
levels
of
HPC
on
the
cloud
– Legacy
HPC
– ‘Cloudy’
HPC
– Big
Data
HPC
hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
7. Recoding
for
the
cloud?
• Use
cloud
resources
in
• Make
use
of
cloud
• Huge
datasets
the
same
way
as
a
local
capabili:es
• Candidates
for
map-‐
cluster
• Old
algorithms,
new
reduce
• MIT
StarCluster
makes
infrastructure
• Involves
algorithm
this
easy
to
do
• Spot
instances,
SNS,
(re)design
SQS
SimpleDB,
S3,
etc
Legacy
Cloudy
Big
Data
HPC
HPC
HPC
hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
8. How
does
the
cloud
enable
science?
• How
does
the
cloud
change
computa:onal
chemistry,
cheminforma:cs,
…
– The
way
we
do
them
– The
scale
at
which
we
do
them
Are
there
problems
that
we
can
address
that
we
could
not
have
if
we
didn’t
have
on-‐demand,
scalable
cloud
resources?
9. Big
data
&
cheminforma8cs
• Computa:on
over
large
chemical
databases
– Pubchem,
ChEMBL,
…
• What
types
of
computa:ons?
– Searches
(substructure,
pharmacophore,
….)
– QSAR
models
over
large
data
– Predic:ons
for
large
data
• Certain
applica:ons
just
need
structures
• Access
to
correspondingly
massive
experimental
datasets
is
tough
(impossible?)
10. Big
data
&
cheminforma8cs
• GDB-‐13
is
a
truly
big
database
–
977
million
different
structures
– Current
search
interface
is
based
on
NN
searches
using
a
reduced
representa:on
– Could
be
a
good
candidate
for
a
Hadoop
based
analysis
• More
generally,
enumerated
virtual
libraries
can
also
lead
to
very
big
data
– Time
required
to
enumerate
is
a
boPleneck
11. Big
data
&
cheminforma8cs
• Fundamentally,
“big
chemical
data”
lets
us
explore
larger
chemical
spaces
– Can
plow
through
large
catalogs
– e.g.,
iden:fying
PKR
inhibitors
by
LBVS
of
the
ChemNavigator
collec:on
[Bryk
et
al]
• This
can
push
predic:ve
models
to
their
limits
– Brings
us
back
to
the
global
vs
local
arguments
12. The
Hadoop
ecosystem
• A
framework
for
the
map-‐reduce
agorithm
– Not
something
you
can
download
and
just
run
– Need
to
implement
the
infrastructure
and
then
develop
code
to
run
using
the
infrastructure
• Low
level
Hadoop
programs
can
be
large,
complex
and
tedious
• Abstrac:ons
have
been
developed
that
make
Hadoop
queries
more
SQL-‐like
–
results
in
much
more
concise
code
13. The
Hadoop
ecosystem
Chukwa Zookeeper Flume Pig
HBase Mahout Avro Whirr
Map Reduce Engine Hama
Hadoop Distributed
Hive
Filesystem
Hadoop Common
Based
on
hPp://www.slideshare.net/informa:cacorp/101111-‐part-‐3-‐maP-‐asleP-‐the-‐hadoop-‐ecosystem
15. Pig
&
Pig
La8n
• Pig
La:n
programs
are
much
simpler
to
write
and
get
translated
to
!"#"$%&'"()*'+,)-.)+("&."/.)+$*.012&3&33&456"
Hadoop
code
7"#"8$9*3"!":4";*9-3<,2&-'1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56"
.9%3*"7"+;9%"(%,9=,9-9F9(6"
SMARTS
search
in
• SQL-‐like,
requires
Pig
La:n
!"#$%&'&$())'*+,-./'012034)'5%$2065"3&'7'
UDF
to
be
'''')2(8&'*+,9-*:"06;-<<$')=2>)2(8&'7'
''''''''26;'7'
'''''''''''')=2'?'30@'*+,9-*:"06;-<<$AB.BC>'
implemented
to
''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'
''''''''''''*;)20IJ<"2J!6%32$3A0C>'
''''''''D'
''''D'
perform
'''')2(8&'*I%$0)K(6)06')!'?'30@'*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC>'
'
''''!"#$%&'O<<$0(3'010&A-"!$0'2"!$0C'2E6<@)'QMH1&0!8<3'7'
non-‐standard
tasks
''''''''%L'A2"!$0'??'3"$$'RR'2"!$0J)%S0AC'T'UC'602"63'L($)0>'
''''''''*26%3P'2(6P02'?'A*26%3PC'2"!$0JP02AVC>'
''''''''*26%3P'="06;'?'A*26%3PC'2"!$0JP02AWC>'
''''''''26;'7' UDF
for
SMARTS
search
'''''''''''')=2J)02*I(62)A="06;C>'
''''''''''''Q,2<I.<32(%306'I<$'?')!J!(6)0*I%$0)A2(6P02C>'
''''''''''''602"63')=2JI(2&E0)AI<$C>'
''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'
''''''''''''2E6<@'X6(!!04QMH1&0!8<3J@6(!ABH66<6'%3'*+,9-*'!(Y063'<6'*+QZH*')26%3P'B[="06;'0C>'
''''''''D'
''''D'
D'
16. Working
on
top
of
Hadoop
• Hadoop
doesn’t
know
anything
about
cheminforma:cs
– Need
to
write
your
own
code,
UDF’s
etc
• But
applica:on
layers
have
been
developed
for
other
purposes
–
Apache
Mahout:
a
library
for
machine
learning
on
data
stored
in
Hadoop
clusters
– Possible
to
build
virtual
screening
pipelines
based
on
the
Hadoop
framework
17. What
Hadoop
is
not
for
• Doesn’t
replace
an
actual
database
• It’s
not
uniformly
fast
or
efficient
• Not
good
for
ad
hoc
or
real:me
analysis
• Not
effec:ve
unless
dealing
with
massive
datasets
• All
algorithms
are
not
amenable
to
the
map-‐
reduce
method
– CPU
bound
methods
and
those
requiring
communica:on
18. Cheminforma8cs
on
Hadoop
• Hadoop
and
Atom
Coun:ng
• Hadoop
and
SD
Files
• Cheminforma:cs,
Hadoop
and
EC2
• Pig
and
Cheminforma:cs
But
are
cheminforma1cs
problems
really
big
enough
to
jus1fy
all
of
this?
19. How
big
is
big?
• Bryk
et
al
performed
a
LBVS
of
5
million
compounds
to
iden:fy
PKR
inhibitors
– Pharmacophore
fingerprints
+
perceptron
– Required
conformer
genera:on
• Given
that
conformer
and
descriptor
genera:on
are
one-‐:me
tasks,
screening
5M
compounds
doesn’t
take
long
• Example:
RF
models
built
on
512
bit
binary
fingerprints
gives
us
predic:ons
for
5M
fingerprints
in
12
min
[Single
core,
3
GHz
Xeon,
OS
X
10.6.8]
20. Going
beyond
chunking?
• All
the
preceding
use
cases
are
embarrassingly
parallel
– Chunking
the
input
data
and
applying
the
same
opera:on
to
each
chunk
– Very
nice
when
you
have
a
big
cluster
Are
there
algorithms
in
cheminforma1cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
21. Going
beyond
chunking?
• Applica:ons
that
make
use
of
pairwise
(or
higher
order)
calcula:ons
could
benefit
from
a
map-‐
reduce
incarna:on
– Doesn’t
always
avoid
the
O(N2)
barrier
– Bioisostere
iden:fica:on
is
one
case
that
could
be
rephrased
as
a
map-‐reduce
problem
• Search
algorithms
such
as
GA’s,
par:cle
swarms
can
make
use
of
map-‐reduce
– GA
based
docking
– Feature
selec:on
for
QSAR
models
22. Going
beyond
chunking?
• Machine
learning
for
massive
chemical
datasets?
– MR
jobs
(descriptor
genera:on)
+
Mahout
(model
building)
lets
us
handle
this
in
a
straight
forward
manner
• But
will
QSAR
models
benefit
from
more
data?
– Helgee
et
al
suggest
global
models
are
preferable
– But
diversity
and
the
structure
of
the
chemical
space
will
affect
performance
of
global
models
– Unsupervised
methods
maybe
more
relevant
– Philosophical
ques:on?
23. Going
beyond
chunking?
• Many
clustering
algorithms
are
amenable
to
map-‐reduce
style
– K-‐means,
Spectral,
EM,
minhash,
…
– Many
are
implemented
in
Mahout
Problems
where
we
generate
large
numbers
of
combina8ons
can
be
amenable
to
map-‐reduce
24. Networks
&
integra8on
• Network
models
of
molecules,
and
targets
are
common
– Allows
for
the
incorpora:on
of
lots
of
associated
informa:on
– Diseases,
pathways,
OTE’s,
Yildirim,
M.A.
et
al
• When
linked
with
clinical
data
&
outcomes,
we
can
generate
massive
networks
– Adverse
events
(FDA
AERS)
– Analysis
by
Cloudera
considered
>
10E6
drug-‐drug-‐
reac:on
triples
25. Networks
&
integra8on
• SAR
data
can
be
viewed
in
a
network
form
– SALI,
SARI
based
networks
– Usually
requires
pairwise
calcula:ons
of
the
metric
Peltason,
L
et
al
hPp://sali.rguha.net/
• Current
studies
have
focused
on
small
datasets
(<
1000
molecules)
• Hadoop
+
Giraph
could
let
us
apply
this
to
HTS-‐
scale
datasets
26. Networks
&
integra8on
• When
we
apply
a
network
view
we
can
consider
many
interes:ng
applica:ons
&
make
use
of
cloud
scale
infrastructure
– Network
based
similarity
– Community
detec:on
(aka
clustering)
Bauer-‐Mehren
et
al
– PageRank
style
ranking
(of
targets,
compounds,
…)
– Generate
network
metrics,
which
can
be
used
as
input
to
predic:ve
models
(for
interac:ons,
effects,
…)
27. Conclusions
• Cheminforma:cs
applica:ons
can
be
rewriPen
to
take
advantage
of
cloud
resources
– Remotely
hosted
– Embarrassingly
parallel
/
chunked
– Map/reduce
• Ability
to
process
larger
structure
collec:ons
lets
us
explore
more
chemical
space
• Integra:ng
chemistry
with
clinical
&
pharmacological
data
can
lead
to
big
datasets
28. Conclusions
• Q:
But
are
cheminforma8cs
problems
really
big
enough
to
jus8fy
all
of
this?
• A:
Yes
–
virtual
libraries,
integra:ng
chemical
structure
with
other
types
and
scales
of
data
• Q:
Are
there
algorithms
in
cheminforma8cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
• A:
Yes
–
especially
when
we
consider
problems
with
a
combinatorial
flavor