Application of machine learning and its importance in chemistry, drug discovery, materials science and requirement of the right dataset of chemical structures and activities. Drug Candidate selection criteria is important to avoid failures
Machine Learning in Chemistry and Drug Candidate Selection
1. @giribio
Girinath G. Pillai, PhD @giribio
Machine Learning in Drug Discovery for
Drug Candidate Selection
@giribio
Girinath G. Pillai, PhD
1
2. @giribio
Girinath G. Pillai, PhD @giribio
● We are not yet completely ready with AI/ML in Drug Discovery (it takes time like
Human Genome Project)
● Slides contains contents/pictures/videos taken from web, articles, lectures,
tutorials and its respective authors own their copyrights.
Technical Slides : slideshare.net/giribio
Case Studies : youtube.com/giribio
Workflows & Notebooks : github.com/giribio
NOTE
2
6. @giribio
Girinath G. Pillai, PhD @giribio
AGENDA
01
AI & Machine Learning
What? Why? How?
03
Drug Discovery
How to avoid failures?
02
ML in Chemistry
Chemical data in DS
04
What to do Next?
Are you ready for AI/ML?
6
8. @giribio
Girinath G. Pillai, PhD @giribio
“Learning denotes changes in a system
that ... enable a system to do the same task
more efficiently the next time”
—Herbert Simon
8
9. @giribio
Girinath G. Pillai, PhD @giribio
● Understand and improve efficiency of human learning
○ Improve methods for teaching and tutoring people (better CAI)
● Discover new things or structure that were previously unknown to humans
○ Examples: data mining, scientific discovery
● Fill in skeletal or incomplete specifications about a domain
○ Large, complex AI systems cannot be completely derived by hand and
require dynamic updating to incorporate new information.
○ Learning new characteristics expands the domain or expertise and lessens
the “brittleness” of the system
● Build software agents that can adapt to their users or to other software agents
● Reproduce an important aspect of intelligent behavior
Why Learn?
9
10. @giribio
Girinath G. Pillai, PhD @giribio
Specifying the task T, the performance P and the experience E
defines the learning problem.
Specifying the learning system requires us to define:
– Exactly what knowledge is to be learnt
– How this knowledge is to be represented
– How this knowledge is to be learnt
Specify Learning System
10
16. @giribio
Girinath G. Pillai, PhD @giribio
Machine learning is a branch of computer science which deals
with system programming in order to automatically learn and
improve with experience.
For example: Robots are programed so that they can perform the task based on data
they gather from sensors. It automatically learns programs from data.
Machine Learning
16
17. @giribio
Girinath G. Pillai, PhD @giribio
● Many machine learning systems can be viewed as an iterative process of
○ produce a result,
○ evaluate it against the expected results
○ tweak the system
● Machine learning is also used for systems which discover patterns without prior
expected results.
● May be open or black box
○ Open: changes are clearly visible in KB and understandable to humans
○ Black Box: changes are to a system whose internals are not readily visible or
understandable.
Learning Systems
17
18. @giribio
Girinath G. Pillai, PhD @giribio
● Any learning system needs to somehow implement four components:
○ Knowledge base: what is being learned. Representation of a problem space
or domain.
○ Performer: does something with the knowledge base to produce results
○ Critic: evaluates results produced against expected results
○ Learner: takes output from critic and modifies something in KB or
performer.
● May also need a “problem generator” to test performance against.
Learner Architecture
18
19. @giribio
Girinath G. Pillai, PhD @giribio
● Rote learning
○ Hand-encoded mapping from inputs to stored representation. “Learning by
memorization.”
● Interactive learning
○ Human/system interaction producing explicit mapping.
● Induction
○ Using specific examples to reach general conclusions.
● Analogy
○ Determining correspondence between two different representations. Case-based
reasoning
● Clustering
○ Unsupervised identification of natural groups in data
● Discovery
○ Unsupervised, specific goal not given
● Genetic algorithms
○ “Evolutionary” search techniques, based on an analogy to “survival of the fittest”
Major Paradigms of ML
19
21. @giribio
Girinath G. Pillai, PhD @giribio
a) Supervised Learning
b) Unsupervised Learning
c) Semi-supervised Learning
d) Reinforcement Learning
e) Transduction
f) Learning to Learn
Types of Techniques in ML
21
a) Decision Trees
b) Neural Networks
(back propagation)
c) Probabilistic networks
d) Nearest Neighbor
e) Support vector machines
5 Popular Algorithms in ML
22. @giribio
Girinath G. Pillai, PhD @giribio
a) Model building
b) Model testing
c) Applying the model
Stages of ML
22
a) Artificial Intelligence
b) Rule based inference
What is not ML?
25. @giribio
Girinath G. Pillai, PhD @giribio
ML Workflow in Chemistry
25
Rodrigues Jr et al. A survey on Big Data and Machine Learning for Chemistry
26. @giribio
Girinath G. Pillai, PhD @giribio
Chemistry Data used in ML
26https://chemintelligence.com/
Project-oriented datasets
Fundamentals of working with active learning algorithms
Framework for working with a in-house database
27. @giribio
Girinath G. Pillai, PhD @giribio
Chemistry Data used in ML
27
Public databases
Using NN to predict reaction conditionsData from simulations
https://chemintelligence.com/
28. @giribio
Girinath G. Pillai, PhD @giribio 28
doi.org/10.3389/fchem.2019.00809
DL algorithms for
solving different
chemical challenges
and the respective
tasks
29. @giribio
Girinath G. Pillai, PhD @giribio 29
doi.org/10.3389/fchem.2019.00809
Schematic
representation
of the main
components of
atomistic ML
30. @giribio
Girinath G. Pillai, PhD @giribio 30
Meta-analysis of
DNN-based
model
performance
relative to
state-of-the-art
non-DNN models
in various
computational
chemistry
applications
Deep Learning for Computational Chemistry. Garrett B. Goh, Nathan O. Hodas, Abhinav Vishnu
31. @giribio
Girinath G. Pillai, PhD @giribio
Descriptors for Chemistry
31
Issues for ML:
● arbitrary size
● arbitrary order
Ideal features:
● general
● compact
● unique
● invariant *
● smooth
● fast
010110101010001011100100010001111110
ML methods need a computer-friendly way to input the atomistic system:
easy for us
easy for CPU
* invariants are determined by the physics of the quantity to predict from the descriptor!
32. @giribio
Girinath G. Pillai, PhD @giribio
Descriptors for Chemistry
32
010110101010001011100100010001111110
ML methods need a computer-friendly way to input the atomistic system:
Global
Descriptor
110100011110000110010111111110
110100011110001011100001111110
010110101010001011100001111110
Local/Atomic
Descriptor
36. @giribio
Girinath G. Pillai, PhD @giribio
Mol. Docking - then and now!
36
1894
The Key-Lock Hypothesis
“To exercise a chemical action a ligand interacting with a protein must fit into the
binding cavity like a key into a key hole”.
(Emil Fischer)
2020
This is only half of the truth…
37. @giribio
Girinath G. Pillai, PhD @giribio
Why Molecular Docking?
Determine the optimal binding structure of a
ligand (a drug candidate, a small molecule)
to a receptor (a drug target, a protein or
DNA) and quantify the strength of the
ligand-receptor interaction.
● Where the ligand will bind?
● How will it bind?
● How strong?
● Role of solvation/desolvation?
● What make a ligand binds to the
receptor better than the others?
● Translation and rotation of ligands
● Torsions
37
40. @giribio
Girinath G. Pillai, PhD @giribio
Where is docking score?
40
Green = frequently observed in CSD small molecule crystals
Yellow = unusual, however several times observed in CSD
Red = very rarely observed in CSD
Green = good for affinity
Red = bad for affinity
Larger the size,
stronger the contribution.
47. @giribio
Girinath G. Pillai, PhD @giribio
➔ Identify chemistries with an
● optimal balance of properties
➔ Quickly identify situations when
● such a balance is not possible
➔ Fail fast, fail cheap
➔ Only when confident
➔ Avoid missed opportunities
The Objectives of Drug Discovery
Multi-parameter optimisation
47
49. @giribio
Girinath G. Pillai, PhD @giribio
Apply data to Guide Decisions
49
In silico
In vitro
In vivo
Importance
Uncertainty
Quality
Diversity
‘Manual’
51. @giribio
Girinath G. Pillai, PhD @giribio
● A Rule is a set of property criteria that in combination identify ‘good’
compounds, e.g.
● For example, Lipinski’s Rule of Five:
What is a Rule?
51
52. @giribio
Girinath G. Pillai, PhD @giribio
● 74% of marketed CNS drugs achieved CNS MPO > 4 vs. 60% of Pfizer
candidates
● Correlations observed between high CNS MPO score and good in vitro ADME
properties, e.g. MDCK Papp
, HLM stability, P-gp transport
Avoid Missed Opportunities
52
CNS MPO Score*
CNS MPO = sum of desirabilities for each parameter
54. @giribio
Girinath G. Pillai, PhD @giribio
QSAR
54
Experimental Assay Activity/Property
Chemical, Physical, Biomedical
x
Molecular
Descriptors
y
Response
Variable
Molecular Structure
Statistical/Machine
Learning Modelling
Validation
Prediction
y = f (x)
f (x) ??
Experimental Data
PredictedData
Molecular Structure
Descriptor Calc.
Classification
Feature Selection
Model Generation
Validation
55. @giribio
Girinath G. Pillai, PhD @giribio
• Split data set
• Calculate descriptors (2D
SMARTS, logP, TPSA, MW, charge
etc.)
• Multiple modelling techniques
• Select the best model by
performance on the validation set
• Test with an independent set
Model Generation
55
Data Set
train validate test
Build
models
PLS
RBF
GPs
RF
Best
model
Evaluate
models
Test the
Best model
56. @giribio
Girinath G. Pillai, PhD @giribio
• The diversity of the
training set defines the
domain of applicability
of the model
• The position of a new
compound relative to
chemical space impacts
the confidence in the
prediction
Domain of Applicability
56
Descriptor 1
Descriptor2
58. @giribio
Girinath G. Pillai, PhD @giribio 58
Selected Extensions/Nodes
Some are under Partner, Community
categories
Some requires additionally installed
binaries and some extensions comes
with binaries
Commercial extensions requires
license from providers
Selected Drug Discovery KNIME Nodes
62. @giribio
Girinath G. Pillai, PhD @giribio
Energy = E E is an approximate activation
energy for the reaction of the catalytic site of a
CYP with the molecule at this atom. in kJ/mol.
Accessibility = A The accessibility is a
relative measure of the topological distance for an
atom from the center of the molecule, and is always a
number between 0.5 (atom at the center) and 1 (atom
at the end).
Solvent Accessible Surface Area =
SASA The SASA describes the local accessibility of
an atom and is computed using the 2DSASA algorithm
which predicts this value from the molecular topology
Site of Metabolism - 3A4 , 2D6, and 2C9
62
65. @giribio
Girinath G. Pillai, PhD @giribio
Gerontology
Study of the social, cultural, psychological, cognitive, and
biological aspects of ageing.
Word was coined by Ilya Ilyich Mechnikov in 1903
Geriatrics is a medical specialty focused on care and
treatment of older persons.
65
71. @giribio
Girinath G. Pillai, PhD @giribio
38yrs for Radio to reach 35mi people
13yrs for TV
9yrs for iPhone
3yrs for Internet
1yr for Facebook
9mo for Twitter
35days for Angry Bird
19 days for Pokemon
Where are we/am I?
71
Per second what happens in Internet
3.88mi Google searches
4.3mi Youtube videos
155 emails
1.2mb/per person
65% Students will not have ready Jobs when they
Graduate!
As per current skill set
72. @giribio
Girinath G. Pillai, PhD @giribio
● Graduates with right attitude and aptitude - CBI/Pearson Survey 2015
● Communication
● Individuality – Behavioral Traits
● Critical Thinking
● Collaboration
● Etiquette & Manners
● Accountability & Responsibility
● Work Life Balance & Priorities
● Career Building
● SKILL SET
Future Tech Ready
72
75. @giribio
Girinath G. Pillai, PhD @giribio
CONCLUSION
Initially consider 25% score/qlty & 75% diversity as the size of
the lead reduces consider 75% score/qlty & 25% diversity.
Consider Enrichment factors
Good synergetics between human expertise & computational
tools
Avoid Missed Opportunities
Understand significance of parameters/properties
Evaluate and decide the tool/approach
Check reliability of data used
75
76. @giribio
Girinath G. Pillai, PhD @giribio
CREDITS: This presentation template was created by Slidesgo,
including icons by Flaticon, and infographics & images by Freepik
THANKS
Do you have any questions?
@giribio
76