Carles Bo, d'ICIQ, presenta IoChem-BD, un repositori de dades en química computacional. L'objectiu és elaborar una base de dades de forma normalitzada, definint processos, què es guarda i com es fa.
Aquesta presentació ha tingut lloc a la TSIUC'14, celebrada a la Universitat Autònoma de Barcelona el passat 2 de desembre de 2014, sota el títol "Reptes en Big Data a la universitat i la Recerca".
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
1. 30/11/14
1
una
solució
per
ges/onar
el
Big
Data
en
Química
Computacional
TSIUC’14
Universitat
Autònoma
de
Barcelona,
2-‐XII-‐2014
Carles
Bo
ICIQ
-‐
URV
cbo@iciq.cat
Computa?onal
Chemistry
2. 30/11/14
2
NOBEL PRIZE IN CHEMISTRY 2013
Computa?onal
Chemistry
Taking
experiment
to
cyberspace
Nobel
Prize
Chemistry
2013
(1981,
1998)
POPULAR SCIENCE BACKGROUND
Taking the experiment to cyberspace
Chemical reactions occur at lightning speed; electrons jump between atoms hidden from the prying
eyes of scientists. The Nobel Laureates in Chemistry 2013 have made it possible to map the mysteri-ous
ways of chemistry by using computers. Detailed knowledge of chemical processes makes it pos-sible
to optimize catalysts, drugs and solar cells.
Chemists all over the world devise and carry out experiments on their computers on a daily basis.
With the help of the methods that Martin Karplus, Michael Levitt and Arieh Warshel began to
develop in the 1970s, they examined every tiny little step in complex chemical processes that are
invisible to the naked eye.
In order for you, the reader, to get an idea of how mankind can benefit from this, we begin with an
example. Put your lab coat on, because we have a challenge for you: to create artificial photosyn-thesis.
The chemical reaction occurring in green leaves fills the atmosphere with oxygen and is one
prerequisite for life on Earth. But it is also interesting from an environmental perspective. If you can
mimic the photosynthesis you will be able create more efficient solar cells. When water molecules
are split oxygen is created, but also hydrogen that could be used to power our vehicles. So there is
ample reason for you to get engaged in this project. If you succeed, you could contribute to solving
the problem with greenhouse effect.
Nobel Prize® is a registered trademark of the Nobel Foundation.
Figure 1. Today chemists experiment just as much on their computers as they do in their labs. Theoretical results
from computers are confirmed by real experiments that yield new clues to how the world of atoms works. Theory and
practice cross-fertilize each other.
Permanent
storage.
Cer/fy
results.
Re-‐use
results.
3. 30/11/14
3
Our
Big
Data
Problem
(1)
Help
researchers
in
their
daily
tasks
(manage
&
store
results,
apps
&
tools)
Our
Big
Data
Problem
(2)
Manage
files
of
former
group
members
4. 30/11/14
4
Our
Big
Data
Problem
(3)
Suppor/ng
Informa/on
files
Cer/fy
results
-‐
Reuse
results
Yes,
Comp
Chem
is
a
Big
Data
Problem
5. 30/11/14
5
5
★
Open
Data
Tim
Berners-‐Lee
OL:
Open
license
OF:
Open
format
LD:
Linked
RE:
Readable
data
URI:
Accessible
Scien?sts
Submit
jobs
Data
Collec?on
Manually
Reports
(pdf
files)
Manually
HPC
Files
TeraBytes
>95%
waste
Publishers
Files
Public
Informa?on
Present
6. 30/11/14
6
Scien?sts
Submit
jobs
Workflows
Data
Collec?on
Automated
Reports
XML
Automated
Cloud
HPC
HPC
on
demand
Results
Databases
XML
Publishers
Informa?on
Public
Files
Informa?on
Future
Scien?sts
Submit
jobs
Data
Collec?on
Manually
Reports
XML
Automated
HPC
HPC
Results
Databases
XML
Publishers
Files
Public
Files
Informa?on
ioChem-‐BD
7. 30/11/14
7
5
★
Open
Data
Tim
Berners-‐Lee
Present
ioChem-‐BD
Defini?on
ioChem-‐BD
is
a
Digital
Repository
aimed
to
manage
and
store
Computa/onal
Chemistry
files
(inputs
&
outputs),
and
comes
to
fill
the
gap
between
results
genera?on
and
manuscripts
publica?on,
and
raise
data
to
5*
quality.
Created
by
the
fusion
of
previous
projects:
8. 30/11/14
8
Goals
• Build
a
distributed
database
of
computa?onal
chemistry
results:
reduce
size
and
increase
value.
• Set
a
common
data
standard
among
all
quantum
chemistry
legacy
formats
(XML
-‐
CML).
• Become
a
daily
tool
in
data
management,
search
and
manipula?on
• Redefine
workflows:
store
results
and
publishing,
open-‐data
• Be
open
to
add
future
func?onali?es
for
data
manipula?on
and
analysis
ioChem-‐BD
features
• Dynamic
independent
templates
for
data
extrac?on
of
data
display
• Data
representa?on
set
on
top
of
priori?es
(XML-‐CML)
• Responsive
design
(any
device
is
able
to
render
our
content)
• Data
easily
exportable
to
other
formats
• Secure
connec?ons
• Fully
compliant
with
latest
web
standards
9. 30/11/14
9
Performance
of
our
new
extrac?on
library
450
400
350
300
250
200
150
100
50
0
Conversion
/me
vs
File
size
Plain
text
to
CompChem
CML
jumbo-‐converters
jumbo-‐saxon
jumbo-‐saxon
with
keep
field
112.73
502.88
1,012.32
1,914.19
1,914.19
2,559.18
2,573.73
3,421.10
3,486.16
5,076.22
30,229.58
68,328.04
Parsing
/me
(s)
File
size
(kB)
≈14x
≈4x
User interfaces Upload Convert Store
Shell
WEB
User
files
(input/output)
Conversion
templates
Search
Create
&
Browse
Manage
Convert
Share Publish
10. 30/11/14
10
Workflow
steps
(1):
Create
Results
files
are
uploaded
from
user’s
disk
space
-‐
Create
shell
client
-‐
Create
web
interface
-‐
Cer/ficate
results
(True
Data)
-‐
Valida/on
(Convergence
WF,
Geometries)
Create:
Shell
client
11. 30/11/14
11
Create:
Shell
client
Basic
commands
Command
Descrip/on
start-‐rep-‐shell
Connect
to
repository
(mandatory)
exit-‐rep
Disconnect
from
repository
lspro
List
current
path
contents
pwdpro
Print
current
path
Project
related
commands
Command
Descrip/on
catpro
Display
project
informa?on
cdpro
Change
to
project
cpro
Create
a
new
project
mpro
Modify
a
project
dpro
Delete
a
project
findpro
Find
project
by
it’s
name
(regex
allowed)
Calcula?on
related
commands
Command
Descrip/on
loadcalc
Load
calcula?on
into
repository
viewcalc
View
calcula?on
informa?on
Create:
Web
interface
12. 30/11/14
12
Workflow
steps
(2):
Create
The
Create
module
manages
results
and
facilitates
advanced
data
treatment
Create:
Web
interface
• Manage
–
Post-‐processing
– Organize
projects
collec?ons
– Enrich
Data:
Descrip?on,
keywords,
addi?onal
files
– Reports:
Generate
Sup.
Info.
files
(pdf)
for
publishing
– Reac?on
Energy
paths
– Consistency
(level
of
theory)
– Thermodynamic
correc?ons
– Kine?c
Analysis
(
TOF,
%
e.e.)
– Molecular
descriptors
(QSAR)
– etc
…
13. 30/11/14
13
Workflow
steps
(3):
Browse
Results
can
then
be
published
and
made
available
for
viewing
and
downloading
by
general
public
on
Browse
module
Handle
URL
generator
Rich
XML
Suppor?ng
Informa?on
files
Linked
to
a
published
manuscript
Browse:
Web
interface
14. 30/11/14
14
Current
project
status
• Private
&
Demo
servers
up
(
www.iochem-‐bd.org)
• Supported
formats:
– Gaussian,
ADF,
VASP
– Molcas
(50%)
• Tes?ng
integrity
(user-‐driven
tests)
• Checking
Data
captured
&
displayed
• Reports
Module
(50%)
• To
do:
sindicate
distributed
browsers,
links
to
external
databases,
…
Acknowledgements
Moises
Álvarez
N.
Lopez,
F.
Maseras,
J.
M.
Poblet,
C.
De
Graaf