DevEX - reference for building teams, processes, and platforms
Exploring Large Chemical Data Sets
1. Exploring Large Chemical
Data Sets
Interactive Analysis and Visualization
Kyle Lutz and Marcus D. Hanwell
August 21, 2012
Skolnik Symposium
2. Overview
● An open-source, cross-platform
cheminformatics tool
● A general-purpose tool for chemical data
exploration and analysis
● Interactive, editable and queryable
database of chemical data on the desktop
● Part of the Open Chemistry application
suite (Avogadro and MoleQueue)
● Leverages several open-source projects:
Qt, VTK, Chemkit, Open Babel, MongoDB
3. Architecture
● Native, cross-platform C++ application built with Qt
● Stores chemical data in a NoSQL MongoDB database
● Uses VTK for 2D and 3D data set visualization
8. Charts and Plots
Scatter Plot Histogram of logP
of Polar Surface Area (TPSA)
against Volume (VABC)
9. Multidimensional Analysis
● Provide tools for viewing and analyzing large
amounts of data with multiple dimensions
○ Scatter Plot Matrix
○ Parallel Coordinates
○ K-Means Clustering
● Interactive charts supporting selection
● Easy to add new chemical descriptors
10. Scatter Plot Matrix
Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
11. Parallel Coordinates
Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
12. K-Means Clustering
● ~30 numeric molecular descriptors
● 1D, 2D, and 3D visualization
● Selection and extraction of molecules from clusters
15. ChemicalJSON
Example: ethane.cjson
● JSON (JavaScript Object Notation) is
a "lightweight data-interchange
format"
● Store molecular structure, geometry,
identifiers and descriptors all as a
single JSON object
● Benefits:
○ More compact than XML/CML
○ Native language of MongoDB and
JSON-RPC
○ Easily converted to a binary
representation (BSON)
Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
16. ChemicalJSON in MongoDB
● Nearly identical to what is stored in a file
○ A few extra fields stored
■ 2D diagram (as PNG)
■ Heavy atom count (for substructure searching)
■ Binary fingerprints (for similarity searching)
■ InChIKey for indexing and as a unique key
■ Mongo's OID ("_id") field
● Trivial to write out to a .cjson file:
db.molecules.find({"name" : "ethanol"},
{"diagram" : 0,
"heavyAtomCount" : 0,
"fp2_fingerprint" : 0,
"_id" : 0})
17. Open Chemistry with ParaViewWeb
● Uses ParaView's client-server architecture
● Interactive 3D rendering
● Runs in any modern web browser
URL: http://paraviewweb.kitware.com/OpenChemistry/
19. RPC / Avogadro Integration
● Uses JSON-RPC to communicate with other
applications (most notably Avogadro)
● Visualize data directly from the database
● Uses ChemicalJSON to represent molecular
structures and transfer molecular information
20. Future Directions
● Direct integration with 3rd party databases
(PubChem, PDB, ...)
● Broader support for storing and analyzing
computational job results
○ Linked with molecular structures
○ Direct from CML or converted/parsed
● Plugins to facilitate extension
○ Descriptors
○ Visualization
○ Chemical file input/output
● Scaling studies, working with multiple data
servers and terabytes of data