The PubChemQC Project

The PubChemQC Project
A big data construction by first-principles
calculations of molecules
中田真秀(NAKATA Maho)
ACCC RIKEN
maho@riken.jp
2014/12/3 10:35-11:05
JST CREST International Symposium on Post
Petescale System Software

Background
• Atoms and molecules are all composed of matter.
• The dream of theoretical chemist: do chemistry
without experiment!
• On computers 
• We treat big data in chemistry!
– Chemical space is really huge!
• The number of candidates for drugs
1060
http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/
abstract)
• Cf. Exa: 1018

Current status of computational
chemistry
• Relatively good agreements with experiments.
• Can explain nature in many cases.
– Many good quantum chemistry programs are
available!
– “DFT B3LYP 6-31G*” calculations rule!
• We want to lead chemistry
– We only explain what happened.

Difference between experiment and
calculation/theory
• Finding interesting phenomena or problem
– How we convert from CO2 to O2? N2+H2 to NH3?
– How to synthesize a compound?
• Design a key chemical reaction.
• Calculations
• Experiments
– Analyze
• Analysis of results
• Propose new experiments
Only One Difference

Difference between experiment and
calculation/theory
• No difference as science
• Most important thing is curiosity!
New insights from
big data and
my sensitivity!
Unfortunately, not so many easy-to-use
big data for chemistry

Googling molecule
＋
Give you recommended molecules!

What are needed for Googling molecule?
1. Types, kinds, variety of molecules
– # of molecules are infinity; but cover important ones
2. Required properties of molecules
– Molecular structure, energy, UV excitation energy,
dipole moment
3. Getting properties of molecules by calculation?
– Accuracy of calculation, and computer resources…
4. Coding or Encoding molecule
– IUPAC nomenclature is not suitable
– Do not think about graph theory

Databases for lists of molecules
• PubChem: 50,000,000 molecules listed, made by NIH,
public domain, no curating (imported from catalogs,
etc), can obtain via ftp.
• ChemSpider : 28,000,000 entries, better curating, no
ftp. Restricted for redistribution, download
• Web-GDB13 : 900,000,000 entries, just generated by
combinatorics. No
• Zinc, CheMBL, DrugBank …
• CAS : 70,000,000 molecules, proprietary
• Nikkaji: 6,000,000, proprietary
We use for source of molecules

Ex. A molecule listed in PubChem

Database for molecular properties by
experiments
• We must do some experiments for obtaining
molecular properties.
– No free comprehensive database is known so far.
– Pharmaceutical companies do O(1,000,000)
experiments for high throughput screening.
• Experiments cost huge!
– Time consuming, large facilities, costs, hazardous
We do not do experiments!

Database for molecular properties by computer
calculation
• Golden Standard method “Density functional
theory (B3LYP functional) + 6-31g(d) basis set”
– Accuracy is quite satisfactory (1-10kcal/mol) for
biological systems, organic chemistry.
– Good implementations are available.
– Costs less (fast, just super computer, no hazardous)
– Time for calculations becomes less
• Intel Core i7 (esp. SandyBridge) is very fast.
• Still we need huge resources, though.
We calculate by computer instead!

What is a molecule?
No rigorous definition for a molecule
3D coordinates
Hard to understand
but regours
Easy to understand
But many coner cases
Propionaldehyde
wavefunction
Common name
IUPAC
nomencleature
Structure
Wikipediaより

What is a molecule?
• No rigorous definition for “what is a molecule”
• nomenclature
– 3D coordinates for nucleus
– Structural formula
– IUPAC nomenclature
– Higher abstraction or less abstraction?
• Better molecular encoding method?
– Easy to understand for human
– Easy to understand for computer as well
– Can describe most cases, and less corner cases.
– Compromise between dream and reality

Encoding molecule : SMILES
Encoding molecule
IUPAC nomenclature
tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl)
methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)-
2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]-
6-oxo-1-phenylhexan-2-yl]carbamate
We can encode molecule
• SMILES
CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24
• InChI Made by IUPAC
InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11
-15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/
h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3
…
SMILES is a good encoding method for molecules

What is SMILES?
• Simplified Molecular Input Line Entry System
– A linear representation of molecule using ASCII.
– Conformation is also encoded
– Human readable, and also machine readable.
– Almost one-to-one mapping between a molecule and
SMILES via universal SMILES
• David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES
• InChI by IUPAC
– International Chemical Identifier : open standard (non proprietary)
– NM O’Boyle invented “Universal SMILES” via InChI

Example by SMILES
http://en.wikipedia.org/wiki/SMILES
分子構造SMILES
Nitrogen molecule N≡N N#N
copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-]
oenanthotoxin CCC[C@@H](O)CCC=CC=C
C#CC#CC=CCO
Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C
)nc(N)2
Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c
2cc(OC)c4c3OC(=O)C5=C4CC
C(=O)5

Some corner cases
Two different SMILES for Ferrocene
• C12C3C4C5C1[Fe]23451234C5C1C2C3C45
• [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]

Construction of ab initio chemical
database
• Molecular information is from PubChem
• Properties are calculated from the first principle using
computer
– Many program packages are available
– DFT (B3LYP)
– 6-31G(d) basis set and geometry optimization
– Excited states calculation by TD-DFT 6-31G+(d)
– Best for organic molecules or bio molecules
• Molecular encoding : SMILES / InChI
• Huge computer resources
• Dream come true
– Google like search engine for chemistry

The PubChemQC Project
• http://pubchemqc.riken.jp/
• A open database for molecules
– Public domain
• Ab initio (The first principle) calculation of
molecular properties of PubChem
• 2014/1/15: 13,000 molecules
• 2014/7/29 : 155,792 molecules
• 2014/10/30 : 906,798 molecules
• 2014/12/3 : 1,137,286 molecules

The PubChemQC project
http://pubchemqc.riken.jp/
WIP: no search engine, just data

PubChemQC

Related works
• Related works
– NIST Web Book
• http://webbook.nist.gov/chemistry/
• Small numbers of molecules. Comparing many methods
– Harvard Clean Energy Project
• http://cleanenergy.molecularspace.org/
• 25,000,000 (?), molecules for photo devices made by
combinatrics
– Sugimoto et al :2013CBI symposium poster
• Almost same as our database, currently not open to the
public(now??)

How we do?
• Generate initial 3D conformation by OpenBABEL
– SDF contains 3D conformation but we don’t use.
– OpenBABEL –h (add hydrogen) --gen3d (generation of 3d
coordinate)
• Ab initio calculation by GAMESS+firefly
– Using Gaussian can lead to a political problem(?)
– PM3 optimization
– Hartree-Fock/STO-6G geometry optimization
– Firefly+GAMESS geometry optimization in B3LYP/6-31G*
– Ten excitation energies by TDDFT/6-31G+* (no geom
optimization)

How we do?
• Heavily using OpenBABEL
• Extraction Molecular information
– Sort by molecular weight of PubChem compouds
– OpenBABEL
• Encoded by SMILES
– Isomeric smiles: 3D conformation retained
– OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@
@H](O)1
– CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO
– CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C

Our way to pubchem Compound to
quantum chemistry calculation
aflatoxin
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Ab initio calculation by
OpenBABEL

Final results will be
• Uploaded to http://pubchemqc.riken.jp/
• Currently we upload
– input file (ground / excited state)
– Output file (ground / excited state)
– Final geometry in Mol file

Scaling of computation
• Embarrassingly parallel for each molecule
• Very roughly speaking, required time for
calculation scales like N^4
– N : molecular weight
• Problems are very hard (complexity theory)
– Hartree-Fock calculation
– DFT (b3lyp) calculation
– geometry optimization
• Practically many molecules can be solved
efficiently

Computer Resources
• RICC : Intel Xeon 5570 Westmere, 2.93GHz 8
cores/node) x 1000
– 1000-10000 molecules/day (MW 160)
– Heavily depend on conditions of other users
– Time limit: 8 hours
• Quest : Intel Core2 duo (1.6GHz/node) x 700
– 3000-8000 molecules / day (MW 160)
– 100-1000 molecules / day (MW 200-300)
– Time limit: 20 hours
• Some compounds fail to calculate are ignored for
this time.

Computer Resources
• Storage
– Approx. 500GB for 1,000,000 molecules (xz
compressed)
– Approx. 20 TB for 40,000,000 molecules (xz
compressed)

Molecular weight and Lipinski Rule
• Lipinski’s five rule (Pfizer's rule of five): rule of
thumb for drug discovery
• No more than 5 hydrogen bond donors
• Not more than 10 hydrogen bond acceptors
• A molecular mass less than 500 daltons
• An octanol-water partition coefficient log P not greater than 5
• Molecular weight should be smaller than 500 is
very good for computational chemistry
– For routine calculations without experimental data
other than molecular formula
– If larger than 500, secondary or higher structure
becomes important. E.g., protein

Molecular Weight distribution at
PubChem
Lipinski limit MW=500
We are still here
30,000,000 molecules
(excluding mixtures)

How long it will take to finish?
• For drug design, we need to calculate all
molecules of MW < 500
• Total 30,000,000 molecules
– This number may increase in the future
• Current (2014/12/4) 1,100,000 molecules
– Only 3%
• 10,000 molecules/day -> 8.2years

How long it will take to finish?
• 10+ years? No, maybe far less.
• 25 years ago (1990) computers are so slow
– Even ab initio calculations are very difficult on
486DX@25MHz or
68000@10MHz

Outlook, prospect, hope…
• Far better in silico screening
– Less or no experiment is necessary
• Even more faster calculation using machine learning
– 10,000 molecules / second ?
– Using our data as learning set.
– Not difficult for bio or organic molecules
– Far better initial guess
• Database for chemical reaction
– Precise calculation is required
– GRRM method + machine learning (?)
• Geometry optimization for Protein (PDB)
– Only X ray crystal structures are available

Difficulties in this project
• Parameters needed for calculations varies by
molecules
• Properties can be different by initial guess
• Computer Resources
– Raspberry Pi? NVIDIA Jetson? Bonic?
• Molecular encoding never ends
– SMILES or InChI is not complete
– Some corner cases may be chemically interesting.

The PubChemQC Project

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

En vedette

En vedette (12)

Similaire à The PubChemQC Project

Similaire à The PubChemQC Project (20)

Plus de Maho Nakata

Plus de Maho Nakata (20)

Dernier

Dernier (20)

The PubChemQC Project