Basics of QSAR Modeling by Prof Rahul D. Jawarkar.pptx
1. Basics of QSAR Modeling
Prof. Rahul D. Jawarkar,
Department of Pharmaceutical Chemistry,
Dr Rajendra Gode Institute of Pharmacy,
University Mardi Road, Amravati, Maharashtra, India(444602),
2. Drug: Drug is a single active chemical moiety which is found
in medicine and. used for diagnosis, prevention, treatment and
cure of a disease.
Chemotherapy: It is the treatment of infection or malignancy
with the specific chemical which possesses selective adverse
effects on the infecting organism, malignant cell or host cell.
Natural source (80%)
3. Drug Discovery- Finding therapeutic actions of the
molecule. e.g. Penicilin, anti-pletlet action of aspirin,
Drug Designing- Modifying the molecule for high
activity and Absorption-Distribution-Metabolism-
Excretion-Toxicity (ADMET). e.g. Tamiflu, Relinza,
Drug Delivery- Developing methods for drug
administration. e.g. Gelatin, starch, etc.
4. Conventional Procedure for drug
Cost for designing a new drug is
about $300 million
Needs 10-15 years to launch a drug
Resources like time, chemicals, etc.
Slower, frustrating, lower success,
6. QSAR is not theoretical !!!!
• Collection of experimental bioactivity like IC50, EC50,
LD50, Kd, Ki, etc.
• Use of chemical structures of reported molecules only
• Comparison of bioactivity of one molecule with another
• Finding reasons for high and low activity
• Validating analysis using Statistical techniques
• OECD guidelines
In short, the experimental part has been accomplished in
advance, now QSAR analysis is being done for experimental
data to identify the reasons for bio-activity of a molecule.
7. Quantitative Structure-Activity Relationship (QSAR)
“Similar compounds behave similarly
Activity or Property varies with Structure.”
Activity = Lipophilicity + Steric + Electronic + Unknown
A QSAR is a multivariate, mathematical relationship
between a set of 2D- and 3D- physicochemical
properties (molecular descriptors) and a biological
Do you agree?
8. Important steps involved in
• Experimental data collection
• Structure drawing and appropriate 3D-
• Molecular descriptor calculation and pruning
• Model building
• Model validation
• Model interpretation
9. Experimental data collection:
1. ChEMBL Database - EMBL-EBI: ChEMBL is a manually
curated database of bioactive molecules with drug-like
properties. It brings together chemical, bioactivity and
2. Binding Database: BindingDB is a public, web-accessible
database of measured binding affinities, focusing chiefly on
the interactions of protein considered to be drug-targets with
small, drug-like molecules.
3. Enzyme Database – BRENDA: A comprehensive enzyme
information system. https://www.brenda-enzymes.org/
12. Step-2: Calculation of Descriptors
Charge on atom
Hydrogen bond donor/acceptor
Note: At present, more than 45,000 descriptor can be calculated !!!
13. Step-3: Descriptor selection & Model building
All descriptors do not contain useful information.
Many descriptors provide same information.
Use of too many descriptors results in “Over Fitting”.
Use of improper descriptors results in poor and misleading models.
Use of many descriptors can lead to Chancy correlation.
Use SR, GA, MA, etc. to select best descriptors
14. Current Methods for Model Building
A) Multiple Linear Regression (MLR)
Best Multiple Linear Regression (BMLR),
Heuristic Method (HM),
Genetic Algorithm-Multiple Linear Regression (GA-MLR),
Factor Analysis MLR and so on.
B) Partial Least Squares (PLS)
Genetic Partial Least Squares (G/PLS),
Factor Analysis Partial Least Squares (FA-PLS),
Orthogonal Signal Correction Partial Least Squares (OSC-
15. Step-4: Validation of model
a) Leave-One-Out Cross validation:
b) Leave-Many-Out Cross Validation:
c) External validation
d) Use PCA, Simulated Annealing, Automated Relevance
Determination (ARD), etc…
e) Use Bayesian Statistics or Gaussian Processes
since they do not require Cross-Validation!!!
16. Modern trends in QSAR modeling
• Currently, there is much talk about the use of artificial
intelligence (AI) in chemistry.
• AI is the superset of tasks that demonstrate characteristics
of human intelligence, while ML is a subset of AI which
accesses data, analyses trends and generates intelligent,
• Many people use the term AI in the same context as ML in
many data-rich disciplines, ranging from health care to
• In this regard one can say that AI has been used in
chemistry since the 1960’s under the name QSAR.
18. troponin I-interacting
IC50 = 8000 nM*
IC50 = 7800 nM
IC50 = 80 nM*
*Lawhorn, B. G. et al., Identification of purines and 7-deazapurines as potent and
selective type I inhibitors of troponin I-interacting kinase (TNNI3K). J. Med. Chem.
2015, 58, 7431−7448.
19. spleen tyrosine
IC50 = 8.8 nM*
IC50 = 10 nM
IC50 = 0.060 nM*
*Ellis, J. M. et al., Overcoming mutagenicity and ion channel activity: optimization of
selective spleen tyrosine kinase inhibitors. J. Med. Chem. 2015, 58, 1929−1939.
21. QSAR based virtual screening
• Molecular docking can rapidly identify large subsets of
molecules with desired activity from large screening
collections of compounds (105–106 compounds) using
• However, the hit rate ranges between 0.01% and 0.1% !!!
• Most of the screened compounds are routinely reported as
• On the other hand, typical hit rates for QSAR-based virtual
screening range between 1% and 40% !!!!!
Reference: Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN and
Andrade CH (2018) QSAR-Based Virtual Screening: Advances and Applications in
Drug Discovery. Frontiers in Pharmacology 9. doi: 10.3389/fphar.2018.01275
22. QSAR based virtual screening:
• Zhang et al. (2013), a data set of 3,133 compounds reported
as active or inactive against P. falciparum was used to
develop QSAR models.
• QSAR models were applied for VS of the ChemBridge
• After VS, 176 potential antimalarial compounds were
identiﬁed and submitted to experimental validation along
with 42 putative inactive compounds.
• Twenty-five compounds presented antimalarial activity in P.
• All 42 compounds predicted as inactives by the models were
conﬁrmed experimentally to be inactives.
23. • Alves et al. (2020), a data set of 113 compounds (40 actives
and 73 inactives) for the SARS-CoV Mpro.
• QSAR models were applied for VS of the DrugBank database
of FDA approved drugs.
• After VS, 42 potential drugs were identiﬁed but only 11 were
tested for experimental validation.
• Three compounds presented strong activity for the SARS-
QSAR based virtual screening:
1. Zhang, L. et al. (2013) Discovery of novel antimalarial compounds enabled by
QSAR-based virtual screening, J. Chem. Inf. Model. 53, 475–492. DOI:
2. Alves et al. (2020) QSAR Modeling of SARS-CoV Mpro Inhibitors Identifies
Sufugolix, Cenicriviroc, Proglumetacin, and Other Drugs as Candidates for
Repurposing against SARS-CoV-2, Mol inf (Wiley). DOI: 10.1002/minf.202000113
24. Disadvantages of QSAR
•False correlations may arise because biological data that
are subject to considerable experimental error (noisy data).
•If training dataset is not large enough, the data collected
may not reflect the complete property space.
Consequently, many QSAR results cannot be used to
confidently predict the most likely compounds of best
•Features may not be reliable as well. This is particularly
serious for 3D features because 3D structures of ligands
binding to receptor may not be available. Common
approach is to use minimized structure, but that may not
represent the reality well.
25. 1. ACD Chemsketch (www.acdlabs.com)
5. Avogadro software (https://avogadro.cc/)
6. OpenBabel (http://openbabel.org/wiki/Main_Page)
7. MMTK (http://dirac.cnrs-orleans.fr/MMTK.html)
8. PyDescriptor (available from Dr. V. H. Masand)
9. PaDEL (http://www.yapcwsoft.com/dd/padeldescriptor/)
12.‘R’ package like GA-MLR, Carret, etc.
Free Software for QSAR
26. 1. ChEMBL Database - EMBL-EBI: ChEMBL is a
manually curated database of bioactive molecules with
drug-like properties. It brings together chemical,
bioactivity and genomic data
2. Enzyme Database – BRENDA: A comprehensive
enzyme information system.