SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Some "challenges" on the
open-source/open-data front
Along with a few thoughts on solutions
Greg Landrum
MIOSS, Hinxton
May 2016
T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
This work is licensed under a
Creative Commons Attribution 4.0
International License.
T5 Informatics 2
First things first: what's T5 Informatics?
● Commercial organization built around the open-source RDKit toolkit.
● Very new: founded in March 2016
● Offers maintenance contracts, support, training, for the RDKit as well as
custom development work
● Still very much an experiment
● Some thoughts about the business model here: https://medium.com/@greg.
landrum_t5
T5 Informatics 3
Background
T5 Informatics 4
Flashback to earlier this year
T5 Informatics 5
The interoperability problem
The simple, one-slide version
# Rotatable
bonds
Exact Mass
AMW
TPSAcalculated
logP
# Heavy Atoms
Donors and acceptors, oh my!
RDKit output
CDK output
Task: generate a set of standard “Lipinski” parameters for Esomeprazole
Good luck if any of those descriptors are used in your QSAR model and you
pick the wrong software.
T5 Informatics 6
Looking things up is hard too...
ChEMBL
PubChem ChemSpider
Amusingly, they all have different structure drawings
T5 Informatics 7
The interoperability problem
● Processing chemical and biological data is hard and people have different
workflows.
● We will always be using multiple tools to analyze and present results
● There are standard algorithms, but different implementations lead to different
results
● One help would be to have a single implementation that’s useable in many
different places
● If the source is open, it can be archived and packaged to provide
reproducibility and allow new work to build on a standard framework
● This is the approach we’ve taken with the RDKit
Note: there’s another big mess around file formats and data quality, but that’s the
topic for another session (or three)
T5 Informatics 8
The RDKit code ecosystem1
C++ :
Core data structures and algorithms
PostgreSQL
Boost.Python SWIG
Python Java C#
Jupyter Pandas KNIME
1
“ecodesystem”? Probably not.
The exact same implementation is available in all endpoints
T5 Informatics 9
● Business-friendly BSD license
● Runs on Linux/Mac/Windows
● Commercial support available
● Releases every six months
● Active and engaged community
● Usable from Python (2 or 3), C++, C#, or Java
● Basic functionality highlights:
○ Chemical reactions
○ 2D depiction
○ Substructure searching
○ Canonical SMILES
○ Gasteiger-Marsili charges
○ Molecular standardization
● 2D Functionality highlights:
○ RECAP and BRICS support
○ Multi-molecule MCS
○ Similarity maps
○ Functional group filters
○ Diversity picking
● Supported fingerprint highlights:
○ Morgan/Feature Morgan (ECFP/FCFP-like)
○ RDKit (Daylight-like)
○ Atom-pairs and topological torsions
○ MACCS keys
○ Avalon
○ Fast similarity searching from FPB files
● Descriptor highlights:
○ Hall-Kier and descriptors
○ SLogP, SMR, TPSA
○ MQN
○ “MOE-like” VSA
○ Compositional (number of donors, number of
rings, number of heterocycles, etc.)
● 3D Functionality highlights:
○ 2D->3D conversion/conformational analysis
via distance geometry
○ UFF and MMFF94/MMFF94S
implementations for cleaning up structures
○ Feature maps and feature-map vectors
○ Shape-based similarity
○ RMSD-based molecule-molecule alignment
○ Open3DAlign implementation
○ Integration with PyMOL
○ Torsion Fingerprint Differences
The RDKit
An open-source toolkit for cheminformatics
www.rdkit.org
T5 Informatics 10
Let's go back a few slides
T5 Informatics 11
End of the flashback
T5 Informatics 12
Some questions
1. Where are our most common file/interchange formats actually defined? How
do we know what they mean?
2. Do we need new interchange format(s)?
3. How should we standardize molecules?
T5 Informatics 13
Question 3: standardizing molecules
● I want to see this molecule the way it'd be stored in pubchem, or ChEMBL, or
OpenPhacts, or ...
● I want to standardize this molecule so that I can register it, if necessary
● … but I want to standardize it using my rules.
Looks like we're going to be talking about this tomorrow.
T5 Informatics 14
Question 1: formats
● Definitions, what's the syntax? What does this term mean?:
○ SMILES:
■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
■ OpenSMILES: http://www.opensmiles.org/opensmiles.html
○ CTAB/MOL/SDF:
■ ctfile.pdf (somewhat publicly available)
■ Various MDL/Symyx/Accelrys manuals (not publicly available)
○ SMARTS:
■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
● Testing/Visualization, is this valid? What does this represent?
○ SMILES: used to be the depict.cgi server.
○ CTAB/MOL/SDF: your most trusted chemical editor, maybe two of them
○ SMARTS: used to be depictmatch.cgi
I picked this subset because I think it covers the most common molecular
interchange formats. There are of course many other possibilities
T5 Informatics 15
Reasons you might want this:
● Is "C1.C1" a valid SMILES? What does it correspond to?
● Is "C1CCC=1" a valid SMILES? What does it correspond to?
● What does this mean?
Formats and validation
Amusing fact: there's a 12+ page explanation of how
tetrahedral stereochemistry should be handled in
MOL blocks in one of those non-public documents
That's bad enough and I didn't even talk about S-groups, R-groups or
query features in CTAB/MOL...
… or recursive SMARTS
T5 Informatics 16
A concrete suggestion
● Formats:
○ OpenSMILES: revive this effort and address outstanding questions (already happening)
○ OpenSMARTS: find a group of interested participants and assemble and publish an open
definition (similar to what happened with OpenSMILES).
■ Requires: organizer, participants, sample data
○ OpenCTAB: find a group of interested participants, agree on the subset that will be included,
and assemble and publish an open definition
■ Requires: organizer, participants, sample data
● Validation/Visualization:
○ A fully open-source (and permissively licensed) web service that returns images (PNG or
SVG) for a provided input in one of the supported formats. This service would ideally have
good error reporting to help identify problems in the input
○ A hosted version of this service useable by the community
○ A fully open-source (and permissively licensed) basic web application for providing input and
seeing the results
○ A hosted version of the web application
As long as we don't extend any of the formats, we don't need to worry (too
much) about adoption or vendor support: it's already there
T5 Informatics 17
Question 2: new format(s)?
Some possible reasons for this:
● Efficiently storing large groups of molecules with associated data. Perhaps
data beyond basic types like text and numbers
● Having something well documented and clear
● Having something a bit easier to parse (for both computers and humans)
● Andrew provided others in his talk
Functional:
● Doing something reasonable with partial or "odd" stereochemistry
● Doing something reasonable with non-traditional bond types (like what you
find in organometallics)
T5 Informatics 18
Dealing with metals
Just a quick example to show what a train-wreck things currently are
T5 Informatics 19
Dealing with metals: cisplatin
T5 Informatics 20
Dealing with metals: cisplatin
T5 Informatics 21
Dealing with metals: cisplatin
T5 Informatics 22
Dealing with metals: cisplatin
T5 Informatics 23
Dealing with metals: hemin
Representation from DrugBank
Representation from PubChem
T5 Informatics 24
Dealing with metals: hemin
T5 Informatics 25
A concrete suggestion
Ok, really just a collection of bullet points, mainly reasons why this is nuts
● The biggest problem is going to be adoption
● Assumption: anything that is used only (or mostly) by toolkits is going to be
easier than anything requiring a sketcher
● Some parts are easier than others:
○ A format for dealing with large numbers of molecules + data is probably not that bad. Adoption
is at the toolkit level
○ A format for molecules is harder… It needs support within both sketchers and readers. Oh,
and reference data that can be used to develop and validate the format.
● Still, maybe HELM and (maybe) MMTF show that this is possible?
● Get a group of interested people together and start a discussion?
T5 Informatics 26
Wrapping up
The questions:
1. Where are our most common file/interchange formats actually defined?
2. Do we need new interchange format(s)?
3. How should we standardize molecules?
And the RDKit:
● Liberally licensed open-source chemistry toolkit accessible from many places
T5 Informatics 27
Thanks!
greg.landrum@t5informatics.com
Interested? Want More?
www.rdkit.org
5th User Group meeting 26-28 October in Basel
@RDKit_org
@dr_greg_landrum

Contenu connexe

Tendances

Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
MLconf
 

Tendances (20)

10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
GraphQL & Ratpack
GraphQL & RatpackGraphQL & Ratpack
GraphQL & Ratpack
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...
 
On Contracts and Sandboxes for JavaScript
On Contracts and Sandboxes for JavaScriptOn Contracts and Sandboxes for JavaScript
On Contracts and Sandboxes for JavaScript
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
 
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algori...
 
PechaKucha (FormaliSE'2018)
PechaKucha (FormaliSE'2018)PechaKucha (FormaliSE'2018)
PechaKucha (FormaliSE'2018)
 

En vedette

Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
richiandres
 
יהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגונייהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגוני
יהודה הופמן
 
Elaboración de material didáctico maría vede
Elaboración de material didáctico maría vedeElaboración de material didáctico maría vede
Elaboración de material didáctico maría vede
mariavede
 
CV Bolaños 2016
CV Bolaños 2016CV Bolaños 2016
CV Bolaños 2016
Bol nene
 
Public Opinion Landscape - Election 2016
Public Opinion Landscape  - Election 2016 Public Opinion Landscape  - Election 2016
Public Opinion Landscape - Election 2016
GloverParkGroup
 
기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용
gojipcap
 
Synthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrateSynthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrate
Diponegoro University
 

En vedette (15)

Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2Cuestionario  de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
Cuestionario de ava y ova ricardo andres paz acu%80%a0%a0%f1a 2
 
יהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגונייהודה_הופמן - יועץ_ארגוני
יהודה_הופמן - יועץ_ארגוני
 
Elaboración de material didáctico maría vede
Elaboración de material didáctico maría vedeElaboración de material didáctico maría vede
Elaboración de material didáctico maría vede
 
CV Bolaños 2016
CV Bolaños 2016CV Bolaños 2016
CV Bolaños 2016
 
111011 jlpt n2_presen
111011 jlpt n2_presen111011 jlpt n2_presen
111011 jlpt n2_presen
 
Public Opinion Landscape - Election 2016
Public Opinion Landscape  - Election 2016 Public Opinion Landscape  - Election 2016
Public Opinion Landscape - Election 2016
 
기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용기조발제 황상민 다양성이 경쟁력이다 인쇄용
기조발제 황상민 다양성이 경쟁력이다 인쇄용
 
Photos from the Microsoft Challenge
Photos from the Microsoft ChallengePhotos from the Microsoft Challenge
Photos from the Microsoft Challenge
 
Workshop Usability
Workshop UsabilityWorkshop Usability
Workshop Usability
 
Role of IT in Mangement by Prof. Amit Chandra - GSBA College
Role of IT in Mangement by Prof. Amit Chandra - GSBA CollegeRole of IT in Mangement by Prof. Amit Chandra - GSBA College
Role of IT in Mangement by Prof. Amit Chandra - GSBA College
 
Fotos 1°
Fotos 1°Fotos 1°
Fotos 1°
 
Very Technology: Marketing on Mobile Platforms
Very Technology: Marketing on Mobile PlatformsVery Technology: Marketing on Mobile Platforms
Very Technology: Marketing on Mobile Platforms
 
Synthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrateSynthesis of chromium(ii)acetate hydrate
Synthesis of chromium(ii)acetate hydrate
 
The tablighi jamat pashto by abul hassan zaid farooqi
The tablighi jamat pashto by abul hassan zaid farooqiThe tablighi jamat pashto by abul hassan zaid farooqi
The tablighi jamat pashto by abul hassan zaid farooqi
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 

Similaire à Some "challenges" on the open-source/open-data front

Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
gustavosouto
 

Similaire à Some "challenges" on the open-source/open-data front (20)

ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
GenAi LLMs Zero to Hero: Mastering GenAI
GenAi LLMs Zero to Hero: Mastering GenAIGenAi LLMs Zero to Hero: Mastering GenAI
GenAi LLMs Zero to Hero: Mastering GenAI
 
Model Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisModel Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model Analysis
 
【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~
【FIT2016チュートリアル】ここから始める情報処理  ~機械学習編~【FIT2016チュートリアル】ここから始める情報処理  ~機械学習編~
【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 

Plus de Greg Landrum

How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 

Plus de Greg Landrum (15)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Dernier

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Dernier (20)

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

Some "challenges" on the open-source/open-data front

  • 1. Some "challenges" on the open-source/open-data front Along with a few thoughts on solutions Greg Landrum MIOSS, Hinxton May 2016 T5 Informatics GmbH greg.landrum@t5informatics.com @dr_greg_landrum This work is licensed under a Creative Commons Attribution 4.0 International License.
  • 2. T5 Informatics 2 First things first: what's T5 Informatics? ● Commercial organization built around the open-source RDKit toolkit. ● Very new: founded in March 2016 ● Offers maintenance contracts, support, training, for the RDKit as well as custom development work ● Still very much an experiment ● Some thoughts about the business model here: https://medium.com/@greg. landrum_t5
  • 4. T5 Informatics 4 Flashback to earlier this year
  • 5. T5 Informatics 5 The interoperability problem The simple, one-slide version # Rotatable bonds Exact Mass AMW TPSAcalculated logP # Heavy Atoms Donors and acceptors, oh my! RDKit output CDK output Task: generate a set of standard “Lipinski” parameters for Esomeprazole Good luck if any of those descriptors are used in your QSAR model and you pick the wrong software.
  • 6. T5 Informatics 6 Looking things up is hard too... ChEMBL PubChem ChemSpider Amusingly, they all have different structure drawings
  • 7. T5 Informatics 7 The interoperability problem ● Processing chemical and biological data is hard and people have different workflows. ● We will always be using multiple tools to analyze and present results ● There are standard algorithms, but different implementations lead to different results ● One help would be to have a single implementation that’s useable in many different places ● If the source is open, it can be archived and packaged to provide reproducibility and allow new work to build on a standard framework ● This is the approach we’ve taken with the RDKit Note: there’s another big mess around file formats and data quality, but that’s the topic for another session (or three)
  • 8. T5 Informatics 8 The RDKit code ecosystem1 C++ : Core data structures and algorithms PostgreSQL Boost.Python SWIG Python Java C# Jupyter Pandas KNIME 1 “ecodesystem”? Probably not. The exact same implementation is available in all endpoints
  • 9. T5 Informatics 9 ● Business-friendly BSD license ● Runs on Linux/Mac/Windows ● Commercial support available ● Releases every six months ● Active and engaged community ● Usable from Python (2 or 3), C++, C#, or Java ● Basic functionality highlights: ○ Chemical reactions ○ 2D depiction ○ Substructure searching ○ Canonical SMILES ○ Gasteiger-Marsili charges ○ Molecular standardization ● 2D Functionality highlights: ○ RECAP and BRICS support ○ Multi-molecule MCS ○ Similarity maps ○ Functional group filters ○ Diversity picking ● Supported fingerprint highlights: ○ Morgan/Feature Morgan (ECFP/FCFP-like) ○ RDKit (Daylight-like) ○ Atom-pairs and topological torsions ○ MACCS keys ○ Avalon ○ Fast similarity searching from FPB files ● Descriptor highlights: ○ Hall-Kier and descriptors ○ SLogP, SMR, TPSA ○ MQN ○ “MOE-like” VSA ○ Compositional (number of donors, number of rings, number of heterocycles, etc.) ● 3D Functionality highlights: ○ 2D->3D conversion/conformational analysis via distance geometry ○ UFF and MMFF94/MMFF94S implementations for cleaning up structures ○ Feature maps and feature-map vectors ○ Shape-based similarity ○ RMSD-based molecule-molecule alignment ○ Open3DAlign implementation ○ Integration with PyMOL ○ Torsion Fingerprint Differences The RDKit An open-source toolkit for cheminformatics www.rdkit.org
  • 10. T5 Informatics 10 Let's go back a few slides
  • 11. T5 Informatics 11 End of the flashback
  • 12. T5 Informatics 12 Some questions 1. Where are our most common file/interchange formats actually defined? How do we know what they mean? 2. Do we need new interchange format(s)? 3. How should we standardize molecules?
  • 13. T5 Informatics 13 Question 3: standardizing molecules ● I want to see this molecule the way it'd be stored in pubchem, or ChEMBL, or OpenPhacts, or ... ● I want to standardize this molecule so that I can register it, if necessary ● … but I want to standardize it using my rules. Looks like we're going to be talking about this tomorrow.
  • 14. T5 Informatics 14 Question 1: formats ● Definitions, what's the syntax? What does this term mean?: ○ SMILES: ■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html ■ OpenSMILES: http://www.opensmiles.org/opensmiles.html ○ CTAB/MOL/SDF: ■ ctfile.pdf (somewhat publicly available) ■ Various MDL/Symyx/Accelrys manuals (not publicly available) ○ SMARTS: ■ Daylight's reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html ● Testing/Visualization, is this valid? What does this represent? ○ SMILES: used to be the depict.cgi server. ○ CTAB/MOL/SDF: your most trusted chemical editor, maybe two of them ○ SMARTS: used to be depictmatch.cgi I picked this subset because I think it covers the most common molecular interchange formats. There are of course many other possibilities
  • 15. T5 Informatics 15 Reasons you might want this: ● Is "C1.C1" a valid SMILES? What does it correspond to? ● Is "C1CCC=1" a valid SMILES? What does it correspond to? ● What does this mean? Formats and validation Amusing fact: there's a 12+ page explanation of how tetrahedral stereochemistry should be handled in MOL blocks in one of those non-public documents That's bad enough and I didn't even talk about S-groups, R-groups or query features in CTAB/MOL... … or recursive SMARTS
  • 16. T5 Informatics 16 A concrete suggestion ● Formats: ○ OpenSMILES: revive this effort and address outstanding questions (already happening) ○ OpenSMARTS: find a group of interested participants and assemble and publish an open definition (similar to what happened with OpenSMILES). ■ Requires: organizer, participants, sample data ○ OpenCTAB: find a group of interested participants, agree on the subset that will be included, and assemble and publish an open definition ■ Requires: organizer, participants, sample data ● Validation/Visualization: ○ A fully open-source (and permissively licensed) web service that returns images (PNG or SVG) for a provided input in one of the supported formats. This service would ideally have good error reporting to help identify problems in the input ○ A hosted version of this service useable by the community ○ A fully open-source (and permissively licensed) basic web application for providing input and seeing the results ○ A hosted version of the web application As long as we don't extend any of the formats, we don't need to worry (too much) about adoption or vendor support: it's already there
  • 17. T5 Informatics 17 Question 2: new format(s)? Some possible reasons for this: ● Efficiently storing large groups of molecules with associated data. Perhaps data beyond basic types like text and numbers ● Having something well documented and clear ● Having something a bit easier to parse (for both computers and humans) ● Andrew provided others in his talk Functional: ● Doing something reasonable with partial or "odd" stereochemistry ● Doing something reasonable with non-traditional bond types (like what you find in organometallics)
  • 18. T5 Informatics 18 Dealing with metals Just a quick example to show what a train-wreck things currently are
  • 19. T5 Informatics 19 Dealing with metals: cisplatin
  • 20. T5 Informatics 20 Dealing with metals: cisplatin
  • 21. T5 Informatics 21 Dealing with metals: cisplatin
  • 22. T5 Informatics 22 Dealing with metals: cisplatin
  • 23. T5 Informatics 23 Dealing with metals: hemin Representation from DrugBank Representation from PubChem
  • 24. T5 Informatics 24 Dealing with metals: hemin
  • 25. T5 Informatics 25 A concrete suggestion Ok, really just a collection of bullet points, mainly reasons why this is nuts ● The biggest problem is going to be adoption ● Assumption: anything that is used only (or mostly) by toolkits is going to be easier than anything requiring a sketcher ● Some parts are easier than others: ○ A format for dealing with large numbers of molecules + data is probably not that bad. Adoption is at the toolkit level ○ A format for molecules is harder… It needs support within both sketchers and readers. Oh, and reference data that can be used to develop and validate the format. ● Still, maybe HELM and (maybe) MMTF show that this is possible? ● Get a group of interested people together and start a discussion?
  • 26. T5 Informatics 26 Wrapping up The questions: 1. Where are our most common file/interchange formats actually defined? 2. Do we need new interchange format(s)? 3. How should we standardize molecules? And the RDKit: ● Liberally licensed open-source chemistry toolkit accessible from many places
  • 27. T5 Informatics 27 Thanks! greg.landrum@t5informatics.com Interested? Want More? www.rdkit.org 5th User Group meeting 26-28 October in Basel @RDKit_org @dr_greg_landrum