Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Report: "MolGAN: An implicit generative model for small molecular graphs"
1. MolGAN: An implicit generative
model for small molecular graphs
N. De Cao and T. Kipf
(Informatics Institute, University of Amsterdam)
ICML Deep Generative Models Workshop (2018)
arXiv:1805.11973
Gpat Journal Club 2018.10.12, Ryohei Suzuki
2. Research Summary
• Automatic generation of drug-like small molecules
• Generative Adversarial Net + Graph Neural Network
+ Reinforcement Learning
• Optimization of biochemical properties (e.g., solubility)
→ first step toward in-silico screening by ML
※It is not aimed at designing drugs for specific purposes
3. About the authors
T. Kipf (Ph.D cand.)
• https://tkipf.github.io/
• Supervisor: Max Welling (ML)
N. De Cao (Ph.D cand.)
• https://nicola-decao.github.io/
• Supervisor: Ivan Titov (NLP)
Supervisor of D. Kingma Pupil of G. t’Hooft
(author of Adam, VAE, etc.) (quantum gravity, string theory)
citation count
1999 (electro-weak)
4. Drug design / drug discovery (DD)
Properties required for drugs
• Useful bioactivity
• Controllable side effect
• Synthesizability
• Having effect after metabolism (cf. drug delivery)
Vast time and monetary cost of animal/human experiments
→ in-silico screening using computers
5. Screening by simulation
Case of target drug:
1. Structure determination of
target protein
2. Decision of target site
3. Static affinity prediction
4. Dynamic binding simulation (MD)
days-weeks computation time /molecule
Gefitinib
Mutated EGFR
(non small cell lung cancer)
6. Why is drug design difficult?
1. Very large and high-dimensional search space
- over 60,000 permutation for only 10 C/N/O atoms
- very limited atomic permutations give valid structure
2. Discrete optimization of molecular structure
- continuous/gradual optimization is not possible
3. Slight change in structure results in large effects
- COH and COOH are absolutely different
7. Why is drug design difficult?
4. No appropriate data structure for molecular structure
5. Predicting biochemical properties is essentially difficult
- Even QM/MM has limitation. Wet exp. is necessary
CN1CCC[C@H]1c2cccnc2
Image SMILES representation 3D structure
(important for proteins)
8. Will ML solve the problems?
1. Very large and high-dimensional search space
→ Generative models (e.g. GAN) can
effectively represent complex/high-dimensional data
2. Discrete optimization of molecular structure
→ Goal of this study is just rough screening
(not fine-tuning of specific drugs)
3. Slight change in structure results in large effects
→ Pinpoint affinity prediction can be difficult for ML.
ML suites predicting general properties like solubility
9. Will ML solve the problems?
4. No appropriate data structure for molecular structure
→ Graph representation
+ Graph convolutional neural network
5. Predicting biochemical properties is essentially difficult
→ ML wouldn’t solve this fundamental problem.
Improved simulation methods are also needed
10. Problem definition
Generating molecular structure without specific usages
• Generated molecules are evaluated by:
1. Druglikeness (QED: Bickerton et al., 2012)
2. Synthesizability (Synthetic Accessibility: Ertl & Schuffenhauer, 2009)
3. Solubility (logP: Comer & Tam, 2001)
• Methods are evaluated by:
1. Validness = valid structure / output structure
2. Novelty = ratio of valid structures not included in training dataset
3. Uniqueness = unique valid molecules / total valid molecules
11. Overview
Generator:
Transforms noise
into a structure
Generated
structure Discriminator:
Judges structure
is valid or not
Reward Network:
Predict the properties
of molecular structures
Goal: obtaining a generator that can output
valid molecular structures with good properties
13. Generative models
• classification:judge an image to be cat or dog
• regression:predict f(0.5) from f(0), f(1)
• generation:generate data distribution like training data
https://blog.openai.com/generative-models/
14. Generative models
• 識別モデル:画像を入力してカテゴリ(犬か猫か)を判定
• 回帰モデル:f(0), f(1)が分かってるときのf(0.5)を予測
• 生成モデル:データセットの分布と同じようなデータを生成
https://blog.openai.com/generative-models/
Challenge:
How to calculate the “loss” value to train the model
to generate a “distribution like given dataset?”
15. Generative Adversarial Net (GAN)
“Rat race between fake bill maker vs. police”
• generator:generate data as resemble as possible dataset samples
• discriminator:distinguish real / fake data as precise as possible
→ train two modules alternately
do not calculate actual distribution
→ danger of mode collapse
https://towardsdatascience.com/generative-adversarial-networks-explained-34472718707a
16. Power of GANs
e.g., BigGANs (Brock et al., 2018)
Generated Images
Continuous morphing of input noise
Continuous change of noise
gives semantically continuous
change of Image
=learned useful representation
20. Graph convolution (Kipf&Welling ICLR2017)
Convolution can be also defined for graphs!
http://tkipf.github.io/misc/SlidesCambridge.pdf
21. Reinforcement Learning
Learning framework for robot movement
Action under an environment gives
a reward reflecting the goodness
ex) going toward a hole results in death of Mario
Optimizing the policy to maximize the reward
ex) Jump when a hole is located in front of Mario
https://en.wikipedia.org/wiki/Reinforcement_learning
22. LR for Molecular Design
Action:Generation of a molecule
Environment/Reward:biochemical evaluation of molecule
Policy:Generative model
druglikeness:0.9
synthesizability:0.1
solubility:0.3
…
Feedback
External
software
23. Design of MolGAN (1) GAN
• Gen directly output a graph
in adjacency matrix
• Gen is a MLP
• Dis judges the validness of a
molecule
• Dis is a graph convolutional
• WGAN-GP* loss
*Please refer to the material of Fukuta-san’s lecture
24. Design of MolGAN (2) LR
Deep deterministic policy gradient
• Reward network mimics external
program to evaluate molecules
• Reward network has same structure
as the dis
• Reward loss = output of reward
network
• Blend GAN loss & reward loss
26. Exp.1: valance of GAN/reward loss
Evaluate generated molecules with changing the loss valance
Result:Only reward loss is necessary
27. Exp.2: comparison with other methods
• Validity:
Others: 85-95%
MolGAN: 98-100%
• Uniqueness:
Others: 10-70%
MolGAN: 2%
• Time consumption:
1/10-1/2 to others
28. Exp.2: comparison with other methods
• druglikeness
• synthesizability
• solubility
Higher score than other methods
for all the properties
29. Discussion
Pros
• Very high (~100%) valid output structure ratio
• GraphNN+LR is effective for biochemical optimization
• Light computational cost, fast learning
Cons / Future work
• mode collapse = same structure is repeatedly generated
→ normalization techniques (e.g., spectral norm) are useful?
• Fixed atom count