La chemoinformatica: uno strumento computazionale per la chimica farmaceutica

Matteo Floris

La chemoinformatica
uno strumento computazionale per la chimica farmaceutica

CRS4 - collana seminari 2012
3 Maggio 2012

Mi presento

Matteo Floris

Laurea in C.T.F., Univ di Padova
Master in Bioinformatica, Koeln Univ.
Dottorato in Biochimica, Univ. Roma “La Sapienza”

Chemoinformatica: sviluppo di metodi per ligand based
drug design

Bioinformatica presso il CRS4 per 6 anni (genomica
computazionale)

matteo.floris@gmail.com

Chemoinformatics or cheminformatics?

Chemoinformatics is a vast discipline, standing on the
interface of chemistry, biology and computer science

D. Agrafiotis, J&J

Premessa

Drug design

Rational drug design o rational design

Ricerca di nuovi (potenziali!) farmaci sulla base della
conoscenza di un target biologico

Premessa

Drug design



Drug design spesso si serve di tecniche di modeling
computazionale (computer-aided drug design, CADD)

Premessa

Drug design



Drug design spesso si serve di tecniche di modeling
computazionale (computer-aided drug design, CADD)

Se la struttura tridimensionale del target molecolare è nota,
allora si parla di structure-based drug design.

Premessa

Ligand based CADD
Structure based CADD

Premessa

Ligand based CADD


Basato sulla conoscenza di altre
molecole che in grado di legarsi col
target biologico di interesse.

Queste altre molecole possono
essere utilizzate per costruire una
ipotesi farmacoforica che definisca
le caratteristiche minime richieste
per avere l'interazione.

In alternativa, le techiche
quantitative structure-activity
relationship (QSAR) permettono di
cercare una correlazione tra
proprietà chimico-fisiche della
molecola e l'attività biologica.

Premessa

Ligand based CADD


Basato sulla conoscenza di altre Si basa sulla conoscenza della
molecole che in grado di legarsi col struttura del target biologico di
target biologico di interesse.
interesse, ottenuta tramite tecniche

di x-ray crystallography o
Queste altre molecole possono spetroscopia NMR.
essere utilizzate per costruire una
ipotesi farmacoforica che definisca Qualora la struttura del target non
le caratteristiche minime richieste fosse a disposizione, si può ovviare
per avere l'interazione.
con la costruzione di modelli

tridimensionali per omologia.
In alternativa, le techiche
quantitative structure-activity Con l'ausilio di strumenti
relationship (QSAR) permettono di computazionali è possibile stimare
cercare una correlazione tra l'affinità e la selettività di uno o più
proprietà chimico-fisiche della composti per il target.
molecola e l'attività biologica.

A virtual space odyssey

One of the main goals in drug discovery is to identify and
develop new ligands with high binding affinity towards
a protein target. Today, there is increased reliance on
computer-based tools […]. These help select molecules
from the vast expanse of chemical space and aid
optimization of compounds of interest into drugs.

Cath O'Driscoll, Nature, 2004

L'universo chimico

Chemical space is the space spanned by all possible (i.e.
energetically stable) molecules and chemical compounds –
that is, all stoichiometric combinations of electrons and
atomic nuclei, in all possible topology isomers.

Chemical reactions allow us to move in chemical space.

L'universo chimico

Chemical space is the space spanned by all possible (i.e.
energetically stable) molecules and chemical compounds –
that is, all stoichiometric combinations of electrons and
atomic nuclei, in all possible topology isomers.

Chemical reactions allow us to move in chemical space.

The mapping between chemical space and molecular
properties is often not unique, meaning that there can be
multiple molecules which exhibit the same properties

L'universo chimico

CAS REGISTRY

is the most authoritative collection of disclosed chemical substance
information, containing more than 65 million organic and inorganic
substances and 63 million sequences

67,370,815
Commercially available chemicals in CAS

Pubchem

Pcsubstance contains about 85 million records.

Pccompound contains nearly 30 million unique structures.

PCBioAssay contains more than 585,000 BioAssays. Each BioAssay
contains a various number of data points.

L'universo chimico

GDB-13 enumerates small organic molecules up to 13
atoms of C, N, O, S and Cl following simple chemical
stability and synthetic feasibility rules.

With 977.468.314 structures, GDB-13 is the largest publicly
available small organic molecule database to date

L'universo chimico

150 possibili sostituenti
da mono a 14 sostituenti
10^29 derivati teorici

L'universo chimico

Navigating chemical space for biology and medicine

Christopher Lipinski
& Andrew Hopkins
Nature 432, 855–861 (16 December 2004) doi:10.1038/nature03193
Despite over a century of applying organic synthesis to the search for drugs, we are still far
from even a cursory examination of the vast number of possible small molecules that could
be created. Indeed, a thorough examination of all ‘chemical space’ is practically
impossible. Given this, what are the best strategies for identifying small molecules
that modulate biological targets?

L'universo chimico

Navigating chemical space for biology and medicine

Christopher Lipinski
& Andrew Hopkins
Nature 432, 855–861 (16 December 2004) doi:10.1038/nature03193
Despite over a century of applying organic synthesis to the search for drugs, we are still far
from even a cursory examination of the vast number of possible small molecules that could
be created. Indeed, a thorough examination of all ‘chemical space’ is practically
impossible. Given this, what are the best strategies for identifying small molecules
that modulate biological targets?

Il salvarsan (o arsfenamina o 606) è un
farmaco utilizzato nel trattamento della
sifilide e della tripanosomiasi africana. È
stato il primo agente chemioterapico
conosciuto.

Trust, but verify

Many scientists TRUST chemistry and biology databases that are so
often reused, reanalyzed and integrated with new cheminformatics or
bioinformatics tools.

The authors of such articles do not appear to analyze for problems
caused by poor DATA QUALITY or hypotheses that are incorrect due
to poor underlying data.

Antony Williams, ChemSpider

Rappresentare molecole

MOLECULES
real objects

MOLECULE
REPRESENTATIONS
models

MOLECULAR
DESCRIPTORS
information


Chemical table file

benzene

6 6 0 0 0 0 0 0 0 0 1 V2000
1.9050 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9050 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7531 -0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7531 -2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.3987 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.3987 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0
3 1 2 0 0 0 0
4 2 2 0 0 0 0
5 3 1 0 0 0 0
6 4 1 0 0 0 0
6 5 2 0 0 0 0
M END
$$$$


SMILES ®

Benzene: c1ccccc1
Metano: C
Etino: C#C
Sildenafil citrato (Viagra):
OC(=O)CC(O)(CC(O)=O)C(O)=O.CCCc1nn(C)c2c1nc([nH]c2=O)-c1cc(ccc1OCC)S(=O)
(=O)N1CCN(C)CC1


InChI

The IUPAC International Chemical Identifier (InChI) is a
non-proprietary identifier for chemical substances that can
be used in printed and electronic data sources thus
enabling easier linking of diverse data compilations

http://www.inchi-trust.org/


InChI is short for International Chemical Identifier.

InChIs are text strings comprising different layers and
sublayers of information separated by slashes (/).

Each InChI strings starts with the InChI version number
followed by the main layer. This main layer contains
sublayers for chemical formula, atom connections and
hydrogen atoms.

Depending on the structure of the molecule the main layer
may be followed by additional layers e. g. for charge,
stereochemical and/or isotop information.

InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H


Astrazioni:

grafi
grafi astratti
markush
descrittori (rappresentazioni numeriche)
fingerprints (rappresentazioni binarie)

Descrittori molecolari

"The molecular descriptor is the final result of a logic and mathematical
procedure which transforms chemical information encoded within a symbolic
representation of a molecule into a useful number or the result of some
standardized experiment.

The field of molecular descriptors is strongly interdisciplinary and involves a mass of
different theories. For the definition of molecular descriptors, a knowledge of
algebra, graph theory, information theory, computational chemistry, theories of
organic reactivity and physical chemistry is usually required, although at
different levels.

For the use of the molecular descriptors, a knowledge of statistics, chemometrics,
and the principles of the QSAR/QSPR approaches is necessary in addition to
the specific knowledge of the problem. Moreover, programming, sophisticated
software and hardware are often inseparable fellow-travelers of the researcher in
this field.

From the introduction to the "Handbook of Molecular Descriptors"
by Roberto Todeschini and Viviana Consonni, Wiley-VCH, 2000.

Descrittori molecolari

The main classes of theoretical molecular descriptors are:

• 0D-descriptors (i.e. constitutional descriptors, count descriptors),
• 1D-descriptors (i.e. list of structural fragments, fingerprints),
• 2D-descriptors (i.e. graph invariants),
• 3D-descriptors (such as, for example, 3D-MoRSE descriptors,
WHIM descriptors, GETAWAY descriptors, quantum-chemical
descriptors, size, steric, surface and volume descriptors),
• 4D-descriptors (such as those derived from GRID or CoMFA
methods, Volsurf).

QSAR

More than a century ago, Crum-Brown and Fraser
expressed the idea that the physiological action of a
substance in a certain biological system (A) was a
function (f) of its chemical constitution C:

A = f C

QSAR

More than a century ago, Crum-Brown and Fraser
expressed the idea that the physiological action of a
substance in a certain biological system (A) was a
function (f) of its chemical constitution C:

A = f C

To explain the complex relationships between molecules and
observed quantities, two main streams were developed, the first
related to the search for relationships between molecular
structures and physico-chemical properties (QSPR,
Quantitative Structure-Property Relationships) and the second
between molecular structures and biological activities (QSAR,
Quantitative Structure-Activity Relationships).

QSAR

There is a consensus among current predictive toxicologists that Corwin
Hansch is the founder of modern QSAR. In the classic article it was
illustrated that, in general, biological activity for a group of ‘congeneric’
chemicals can be described by a comprehensive model:

Log 1/C50 = a π + b ε + cS + d

in which C, the toxicant concentration at which an endpoint is manifested
(e.g. 50% mortality or effect), is related to a hydrophobicity term, p, an
electronic and a steric term, S, (typically Taft’s substituent constant,
ES).

Librerie computazionali

CDK

Openbabel

CACTVS

RDKit

SVL/MOE


CDK
Web: cdk.sf.net

Linguaggio: Java (Jython, Groovy)
Openbabel
GUI: n.a.

Pro: licenza LGPL, Jmol
CACTVS
Cons: solo per programmatori

1 AMBIT
2 Bioclipse
RDKit
3 CDK Taverna
4 CDKDescUI
5 Evince

6 HyperDossier

SVL/MOE 7 JChemPaint
8 JOELib
9 Jumbo
10 KNIME CDK feature
11 LICSS
12 NMRShiftDB
13 Nomen
14 PaDEL
15 QueryConstructor
16 rcdk
17 SafeBase(TM)
18 Scaffold Hunter
19 SENECA
20 SmileMS
21 Obsolete projects
21.1 XB Edit (Working title)
22 Jmol


CDK
Web: openbabel.org

Linguaggio: c++, python/java/ perl bindings
Openbabel
GUI: si!

Pro: flessibilita'
CACTVS
Cons:

RDKit

SVL/MOE


CDK
Web: xemistry.com

Linguaggio: Tcl
Openbabel
GUI: a pagamento

Pro: free for academics, team
CACTVS
Cons: Tcl

RDKit

SVL/MOE


CDK
Web: www.rdkit.org/

Linguaggio: Python, c++
Openbabel
GUI: n.a.

Pro: smirks, team
CACTVS
Cons: installazione, eta'

RDKit

SVL/MOE

Algoritmi: similarity search

Similarity measures, calculations that quantify the similarity
of two molecules, and screening, a way of rapidly
eliminating molecules as candidates in a substructure
search, are both processes that use fingerprints.

Fingerprints are a very abstract representation of certain
structural features of a molecule


Structural keys

• The presence/absence of each element, or if an element is common
(nitrogen, for example), several bits might represent "at least 1 N", "at
least 2 N", "at least 4 N", and so forth.
• Unusual or important electronic configurations, such as "sp3 carbon" or
"triple-bonded nitrogen."
• Rings and ring systems, such as cyclohexane, pyridine, or napthalene.
• Common functional groups, such as alcohols, amines, hydrocarbons, and
so forth.
• Functional groups of special importance in a particular database. For
example, a database of organo-metallic molecules might have bits
assigned for metal-containing functional groups; in a drug database one
might have bits for specific skeletal features such as steroids and
barbiturates.


For example, the molecule OC=CN would generate the
following patterns:

0-bond paths:
C
O
N
1-bond paths:
OC
C=C
CN
2-bond paths:
OC=C
C=CN

3-bond paths:
OC=CN


10001001010001001010001001110100100010101
001001011100111010010010000100101011010010

Tanimoto index = c/(a + b + c)

Algoritmi: substructure search

Database pubblici

Pubchem

Chembl

ZINC

Vendors vari

BindingDB

Librerie chimiche

Problematiche

Registrazione

Unicità

Strumenti

1. filtering
2. normalizzazione
3. generazione dei tautomeri
4. stati ionici
5. unicizzazione
6. generazione dei conformeri

MMsINC 1.0

3.967.056 total compounds
3.297.001 parent compounds
449.482 ionic states
220.573 tautomers

283.464.647 conformers (about 30confs/mol);
ordered by empirical E-pot;
max 5 confs/mol (= about 4.6 conformers per compound)

Final number of conformers: 18.461.878 (for which we have ph4-FP and USR
descriptors)

Fanton et al, IEEE, 2008; Masciocchi et al, Nucleic acid research, 2009

MMsINC 2.0

92.355.744 compounds from 65 public data sources and commercial catalogs
71.206.303 after single-vendor-based cleaning
42.073.344 unique compounds after redundancy washing
40 M of alternative tautomers
5 M of ionic states
Expected number of conformers: about 220 M

Average intra-vendor redundancy: 14%
10 vendors with redundancy more than 40%!
4 vendors with redundancy = 0% (small sets, 100 - 2000 comp.)

L'impatto della tautomeria

250000

total pairs taut/neu
different pred
Different AD
diff pred & diff AD
200000

150000

100000

50000

0

Skin DevTox LC50DM LC50FM Carcinogenicity Mutagenicity BCF

Mimicking peptides... in silico

Mimicking peptides... in silico

Floris et al, Nucleic acid research, 2011; Floris M and Moro S, Molecular Informatics, 2011

Screening farmacoforico su larga scala

• 2 minutes for the screening of 1 ph4 model on the CRS4 cluster resources over 17
M of conformers (4 M of commercial compounds)
• Output: SDF with top commercial compounds with highest overlap with the original
pharmacophore hypothesis
• Possibility of multiple simultaneous screenings and parameter tuning in a
reasonable time lapse

La cassetta degli attrezzi del chemoinformatico

• Python, Java
• R, Weka
• Openbabel, CDK
• Marvin Beans
• un database personale
• il BlueObelisk

L'importanza di un ambiente di lavoro sano

Ringraziamenti

• Alessandro Bulfone
• Prof Stefano Moro
• Silvana Urru, Andrea Cristiani, Ricardo Medda, Stefania Olla
• i colleghi di Outreach del CRS4
• i colleghi del CNR (IRGB-CNR, Prof F. Cucca)
• Marco Fanton, Mattia Sturlese, Fabian Cedrati, Davide Sabbadin
• tutti gli altri collaboratori: Alberto Manganaro, Emilio Benfenati, i
colleghi del gruppo ministeriale QSAR-Reach, i colleghi del
BlueObeslik, il gruppo TNBC
• la mia famiglia (Lolli, Ric, Vera, nonni assortiti, sorelle varie)

matteo.floris@gmail.com

La chemoinformatica: uno strumento computazionale per la chimica farmaceutica

Recommandé

Recommandé

Contenu connexe

Similaire à La chemoinformatica: uno strumento computazionale per la chimica farmaceutica

Similaire à La chemoinformatica: uno strumento computazionale per la chimica farmaceutica (20)

Plus de CRS4 Research Center in Sardinia

Plus de CRS4 Research Center in Sardinia (20)

Dernier

Dernier (20)

La chemoinformatica: uno strumento computazionale per la chimica farmaceutica