SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
minutes
Python for Chemistry in 21 days



                Dr. Noel O'Boyle

   Dr. John Mitchell and Prof. Peter Murray-Rust

                 UCC Talk, Sept 2005
 Available at http://www-mitchell.ch.cam.ac.uk/noel/
Introduction
●   This talk will cover
     –   what Python is
     –   why you should be interested in it
     –   how you can use Python in chemistry
Introduction
●   This talk will cover
     –   what Python is
     –   why you should be interested in it
     –   how you can use Python in chemistry
●   This talk will not cover
     –   how to program in Python
●   See references at end of talk



                       Word of warning!!
“My mental eye could now distinguish larger structures, of
manifold conformation; long rows, sometimes more closely
fitted together; all twining and twisting in snakelike motion.
But look! What was that? One of the snakes had seized
hold of its own tail, and the form whirled mockingly before
my eyes. As if by the flash of lightning I awoke...Let us learn
to dream, gentlemen”
                          Friedrich August Kekulé (1829-1896)
What is Python?
●   For a computer scientist...
    –   a high-level programming language
    –   interpreted (byte-compiled)
    –   dynamic typing
    –   object-oriented
What is Python?
●   For a computer scientist...
    –   a high-level programming language
    –   interpreted (byte-compiled)
    –   dynamic typing
    –   object-oriented
●   For everyone else...
    –   a scripting language (like Perl or Ruby) released by
        Guido von Rossum in 1991
    –   easy to learn
    –   easy to read (!)
What is Python?
●   For a computer scientist...
    –   a high-level programming language
    –   interpreted (byte-compiled)
    –   dynamic typing
    –   object-oriented
●   For everyone else...
    –   a scripting language (like Perl or Ruby) released by
        Guido von Rossum in 1991
    –   easy to learn
    –   easy to read (!)
    –   named after Cambridge comedians
The Great Debate

Sir Lancelot:
We were in the nick of time. You were in great Perl.
Sir Galahad:
I don't think I was.
Sir Lancelot:
You were, Sir Galahad. You were in terrible Perl.
Sir Galahad:
Look, let me go back in there and face the Perl.
Sir Lancelot:
No, it's too perilous.

                      (adapted from Monty Python and the Holy Grail)
Why you should be interested (1)

●   Python has been adopted by the cheminformatics community
●   For example, AstraZeneca has moved some of its codebase
    from 'the other scripting language' to Python


Job Description – Research Software Developer/Informatician
[section deleted]
Required Skills:

At least one object oriented programming language, e.g., Python, C++, Java.
Web-based application development (design/construction/maintenance)
UNIX, UNIX scripting & Linux OS

                            Position in AstraZeneca R&D, 02-09-05
Why you should be interested (2)

●   Scientific computing: scipy/pylab (like Matlab)
●   Molecular dynamics: MMTK
●   Statistics: scipy, rpy (R), pychem
●   3D-visualisation: VTK (mayavi)
●   2D-visualisation: scipy, pylab, rpy
●   coming soon, a wrapper around OpenBabel
●   cheminformatics: OEChem, frowns, PyDaylight, pychem
●   bioinformatics: BioPython
●   structural biology: PyMOL
●   computational chemistry: GaussSum
●   you can still use Java libraries...like the CDK
>>> from scipy import *
         >>> from pylab import *
         >>> a = arange(0,4)       > a <- seq(0,3)
         >>> a                     >a
scipy/                                               R
         [0,1,2,3]                 [1] 0 1 2 3
pylab    >>> mean(a)               > mean(a)
         1.5                       [1] 1.5
         >>> a**2                  > a**2
         [0,1,4,9]                 [1] 0 1 4 9
         >>> plot(a,a**2)          > plot(a,a**2)
         >>> show()                > lines(a,a**2)
Scipy
–   cluster: information theory functions (currently, vq and kmeans)
–   weave: compilation of numeric expressions to C++ for fast execution
–   cow: parallel programming via a Cluster Of Workstations
–   fftpack: fast Fourier transform module based on fftpack and fftw when
    available
–   ga: genetic algorithms
–   io: reading and writing numeric arrays, MATLAB .mat, and Matrix
    Market .mtx files
–   integrate: numeric integration for bounded and unbounded ranges. ODE
    solvers.
–   interpolate: interpolation of values from a sample data set.
–   optimize: constrained and unconstrained optimization methods and
    root-finding algorithms
–   signal: signal processing (1-D and 2-D filtering, filter design, LTI
    systems, etc.)
–   special: special function types (bessel, gamma, airy, etc.)
–   stats: statistical functions (stdev, var, mean, etc.)
–   linalg: linear algebra and BLAS routines based on the ATLAS
    implementation of LAPACK
–   sparse: Some sparse matrix support. LU factorization and solving
    sparse linear systems
Scipy statistical functions

●   descriptive statistics: variance, standard deviation, standard error, mean,
    mode, median
●   correlation: Pearson r, Spearman r, Kendall tau
●   statistical tests: chi-squared, t-tests, binomial, Wilcoxon, Kruskal,
    Kolmogorov-Smirnov, Anderson, etc.

●   linear regression

●   analysis of variance (ANOVA)

●   (and more)
pylab
pychem: Using scipy for chemoinformatics

●   Many multivariate analysis techniques are based on matrix
    algebra
●   scipy has wrappers around well-known C and Fortran
    numerical libraries (ATLAS, LAPACK)
●   pychem can do:
     –   principal component analysis, partial least squares
         regression, Fisher's discriminant analysis
     –   clustering: k-means, hierarchical (used Open Source
         clustering library)
     –   feature selection: genetic algorithms with PLS
     –   additional methods can be added
Python and R
  ●   Advantages of R:
       –   a large number of statistical libraries are available
  ●   Disadvantages of R:
       –   difficult to write algorithms
       –   slow (most R libraries are written in C)
       –   chokes on large datasets (use scan instead of read.table)

           Reading in data                   Principal component analysis
Method         300K    600K    1.6M        Method         300K    600K     1.6M
Python             6.8    13.9      41     Python             2.2      3.6      42
R (read.table)      42     105             R (read.table)       5       10
R (scan)             9      20      56     R (scan)             3        5      29




                                http://www.redbrick.dcu.ie/~noel/RversusPython.html
Python and R
●   rpy module allows Python programs to interface with R
●   have the best of both worlds
     –   access to the statistical functions of R
     –   access to the numerous modules available for Python
     –   can program in Python, instead of in R!!

             Python                                    R
>>> from rpy import r
>>> x = [5.05, 6.75, 3.21, 2.66]         > x <- c(5.05, 6.75, 3.21, 2.66)
>>> y = [1.65, 26.5, -5.93, 7.96]        > y <- c(1.65, 26.5, -5.93, 7.96)
>>> print r.lsfit(x,y)['coefficients']   > lsfit(x, y)$coefficients
{'X': 5.3935773611970212,                 Intercept        X
 'Intercept': -16.281127993087839}       -16.281128 5.393577
Python and R
                                             > hc$merge
Problem: Analyse a hierarchical clustering         [,1] [,2]
                                               [1,] -32 -33
Solution: Use R to cluster, and Python to      [2,] -39 -71
analyse the merge object of the cluster        [3,] -43 -47
                                               [4,] -10 -55
                                               [5,] -19 -36
                                               [6,] -5 -24
                                               [7,] -62 -63
                                               [8,] -74 -75
                                               [9,] -35 -76
                                              [10,] -1 -84
                                              [11,] -41 -42
                                              [12,] -83 -96
                                              [13,] -2 -29
                                              [14,] -7 -21
                                              [15,] -61 4
Problem 1
Graphically show the distribution of molecular weights of
molecules in an SD file. The molecular weight is stored in
a field of the SD file.

1,2-Diaminoethane
  MOE2004           3D

  6 3 0 0 0 0 0 0 0 0999 V2000
   -0.6900   -0.6620  0.0000 C 0   0   0   0   0   0   0   0   0   0   0   0
    0.5850   -1.9590  0.8240 H 0   0   0   0   0   0   0   0   0   0   0   0
    0.5350    1.5040  0.0000 N 0   0   0   0   0   0   0   0   0   0   0   0
    0.6060    2.0500  0.8460 H 0   0   0   0   0   0   0   0   0   0   0   0
    1.3040    0.8520 -0.0460 H 0   0   0   0   0   0   0   0   0   0   0   0
  1 2 1 0 0 0 0
  1 3 1 0 0 0 0
  1 4 1 0 0 0 0
M END
> <chem.name>
1,2-Diaminoethane

> <molecular.weight>
60.0995

$$$$
Object-oriented approach

 object                        SD file



 object           Molecule 1      Molecule 2      Molecule n


attributes                               fields
               name        atoms

  The code for creating these objects is stored in sdparser.py,
  which can be imported and used by any scripts that need to
  parse SD files.
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Problem 2
Every molecule in an SD file is missing the name. To be
compatible with proprietary program X, we need to set the
name equal to the value of the field “chem.name”.

(MISSING NAME!)
  MOE2004          3D

  6 3 0 0 0 0 0 0 0 0999 V2000
   -0.6900   -0.6620  0.0000 C 0   0   0   0   0   0   0   0   0   0   0   0
    0.5850   -1.9590  0.8240 H 0   0   0   0   0   0   0   0   0   0   0   0
    0.5350    1.5040  0.0000 N 0   0   0   0   0   0   0   0   0   0   0   0
    0.6060    2.0500  0.8460 H 0   0   0   0   0   0   0   0   0   0   0   0
    1.3040    0.8520 -0.0460 H 0   0   0   0   0   0   0   0   0   0   0   0
  1 2 1 0 0 0 0
  1 3 1 0 0 0 0
  1 4 1 0 0 0 0
M END
> <chem.name>
1,2-Diaminoethane

> <molecular.weight>
60.0995

$$$$
Solution

from sdparser import SDFile

inputfile = "mddr_complete.sd"
outputfile = "mddr_withnames.sd"

for molecule in SDFile(inputfile):
   molecule.name = molecule.fields['chemical.name']
   outputfile.write(molecule)

inputfile.close()
outputfile.close()

print "We are the knights who say....SD!!!"
Solution

from sdparser import SDFile

inputfile = "mddr_complete.sd"
outputfile = "mddr_withnames.sd"

for molecule in SDFile(inputfile):
   molecule.name = molecule.fields['chemical.name']
   outputfile.write(molecule)

inputfile.close()
outputfile.close()

print "We are the knights who say....SD!!!"
Python and Java
 ●  It's easy to use Java libraries from Python
      – using either Jython or JPype
      – see http://www.redbrick.dcu.ie/~noel/CDKJython.html

Example: using the CDK to calculate the number of rings in a molecule
(given a string variable containing CML)
from jpype import *
startJVM("jdk1.5.0_03/jre/lib/i386/server/libjvm.so")
cdk = JPackage("org").openscience.cdk
SSSRFinder = cdk.ringsearch.SSSRFinder
CMLReader = cdk.io.CMLReader

def getNumRings(molecule):
  # Convert to a CDK molecule
  reader = CMLReader(java.io.StringReader(molXmlValue))
  chemFile = reader.read(cdk.ChemFile())
  cdkMol = chemFile.getChemSequence(0).getChemModel(0).getSetOfMolecules().getMolecule(0)
  # Calculate the number of rings
  sssrFinder = SSSRFinder(cdkMol)
  sssr = sssrFinder.findSSSR().size()
  return sssr
3D visualisation
●   VTK (Visualisation Toolkit) from Kitware
    –   open source, freely available
    –   scalar, tensor, vector and volumetric methods
    –   advanced modeling techniques such as implicit modelling,
        polygon reduction, mesh smoothing, cutting, contouring, and
        Delaunay triangulation
●   MayaVi
    –   easy to use GUI interface to VTK, written in Python
    –   can create input files and visualise them using Python scripts

                                 Demo
Python Resources
●   http://www.python.org
●   Guido's Tutorial
     –   http://www.python.org/doc/current/tut/tut.html
●   O'Reilly's “Learning Python” or Visual Quickstart Guide
    to Python
     –   Make sure it's Python 2.3 or 2.4 though
●   For Windows, consider the Enthought edition
     –   http://www.enthought.com/
Python for Chemistry

Contenu connexe

Tendances

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy PresentationCreating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
dmurali2
 

Tendances (20)

.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
.NET Fest 2019. Николай Балакин. Микрооптимизации в мире .NET
 
The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84
 
The Ring programming language version 1.2 book - Part 80 of 84
The Ring programming language version 1.2 book - Part 80 of 84The Ring programming language version 1.2 book - Part 80 of 84
The Ring programming language version 1.2 book - Part 80 of 84
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
 
A compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCoreA compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCore
 
The Ring programming language version 1.2 book - Part 82 of 84
The Ring programming language version 1.2 book - Part 82 of 84The Ring programming language version 1.2 book - Part 82 of 84
The Ring programming language version 1.2 book - Part 82 of 84
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
 
PyTorch crash course
PyTorch crash coursePyTorch crash course
PyTorch crash course
 
The Ring programming language version 1.2 book - Part 84 of 84
The Ring programming language version 1.2 book - Part 84 of 84The Ring programming language version 1.2 book - Part 84 of 84
The Ring programming language version 1.2 book - Part 84 of 84
 
Deep Learning in the Wild with Arno Candel
Deep Learning in the Wild with Arno CandelDeep Learning in the Wild with Arno Candel
Deep Learning in the Wild with Arno Candel
 
RDKit Gems
RDKit GemsRDKit Gems
RDKit Gems
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy PresentationCreating Profiling Tools to Analyze and Optimize FiPy Presentation
Creating Profiling Tools to Analyze and Optimize FiPy Presentation
 
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
 
Basanta jtr2009
Basanta jtr2009Basanta jtr2009
Basanta jtr2009
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
Genetic Programming in Python
Genetic Programming in PythonGenetic Programming in Python
Genetic Programming in Python
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 

En vedette

Social Media Tactics for Rotarians-7070 2011 DC
Social Media Tactics for Rotarians-7070 2011 DCSocial Media Tactics for Rotarians-7070 2011 DC
Social Media Tactics for Rotarians-7070 2011 DC
Terepac Corporation
 
Inps lecce invalidi civili
Inps lecce invalidi civiliInps lecce invalidi civili
Inps lecce invalidi civili
guestc8825e
 
Motilal Oswal MOST50
Motilal  Oswal MOST50Motilal  Oswal MOST50
Motilal Oswal MOST50
granny2010
 
Umnenge Lodge De Kelders
Umnenge Lodge De KeldersUmnenge Lodge De Kelders
Umnenge Lodge De Kelders
umnenge
 
Breathing systems open circuit- shoeib
Breathing systems  open circuit- shoeibBreathing systems  open circuit- shoeib
Breathing systems open circuit- shoeib
havalprit
 
Anaesthetic mgt for pt with hyperthyroidism pritam
Anaesthetic mgt for pt with hyperthyroidism  pritamAnaesthetic mgt for pt with hyperthyroidism  pritam
Anaesthetic mgt for pt with hyperthyroidism pritam
havalprit
 
Sterilization of ot & ot equipments pritam
Sterilization of ot & ot equipments  pritamSterilization of ot & ot equipments  pritam
Sterilization of ot & ot equipments pritam
havalprit
 
Face masks, laryngeal tube, airways yuvaraj
Face masks, laryngeal tube, airways  yuvarajFace masks, laryngeal tube, airways  yuvaraj
Face masks, laryngeal tube, airways yuvaraj
havalprit
 
Anatomy & physiology of neuromuscular junction & monitoring
Anatomy & physiology of neuromuscular junction & monitoringAnatomy & physiology of neuromuscular junction & monitoring
Anatomy & physiology of neuromuscular junction & monitoring
havalprit
 

En vedette (19)

Evaluation
EvaluationEvaluation
Evaluation
 
Ogs designs
Ogs designsOgs designs
Ogs designs
 
Reliance Small Cap Fund
Reliance Small Cap Fund Reliance Small Cap Fund
Reliance Small Cap Fund
 
Slideshare atg
Slideshare atgSlideshare atg
Slideshare atg
 
Review Reliance Index Fund - Nifty Plan
Review Reliance Index Fund - Nifty Plan Review Reliance Index Fund - Nifty Plan
Review Reliance Index Fund - Nifty Plan
 
2010 District 7070 Conference - Social Media
2010 District 7070 Conference - Social Media2010 District 7070 Conference - Social Media
2010 District 7070 Conference - Social Media
 
Benchmark InfraBees Fund
Benchmark InfraBees FundBenchmark InfraBees Fund
Benchmark InfraBees Fund
 
Social Media Tactics for Rotarians-7070 2011 DC
Social Media Tactics for Rotarians-7070 2011 DCSocial Media Tactics for Rotarians-7070 2011 DC
Social Media Tactics for Rotarians-7070 2011 DC
 
Inps lecce invalidi civili
Inps lecce invalidi civiliInps lecce invalidi civili
Inps lecce invalidi civili
 
Motilal Oswal MOST50
Motilal  Oswal MOST50Motilal  Oswal MOST50
Motilal Oswal MOST50
 
Umnenge Lodge De Kelders
Umnenge Lodge De KeldersUmnenge Lodge De Kelders
Umnenge Lodge De Kelders
 
Baroda Pioneer Infrastructure Fund Presentation
Baroda Pioneer Infrastructure Fund PresentationBaroda Pioneer Infrastructure Fund Presentation
Baroda Pioneer Infrastructure Fund Presentation
 
Social Media & Rotary
Social Media & RotarySocial Media & Rotary
Social Media & Rotary
 
Breathing systems open circuit- shoeib
Breathing systems  open circuit- shoeibBreathing systems  open circuit- shoeib
Breathing systems open circuit- shoeib
 
Reliance Small Cap Fund
Reliance Small Cap FundReliance Small Cap Fund
Reliance Small Cap Fund
 
Anaesthetic mgt for pt with hyperthyroidism pritam
Anaesthetic mgt for pt with hyperthyroidism  pritamAnaesthetic mgt for pt with hyperthyroidism  pritam
Anaesthetic mgt for pt with hyperthyroidism pritam
 
Sterilization of ot & ot equipments pritam
Sterilization of ot & ot equipments  pritamSterilization of ot & ot equipments  pritam
Sterilization of ot & ot equipments pritam
 
Face masks, laryngeal tube, airways yuvaraj
Face masks, laryngeal tube, airways  yuvarajFace masks, laryngeal tube, airways  yuvaraj
Face masks, laryngeal tube, airways yuvaraj
 
Anatomy & physiology of neuromuscular junction & monitoring
Anatomy & physiology of neuromuscular junction & monitoringAnatomy & physiology of neuromuscular junction & monitoring
Anatomy & physiology of neuromuscular junction & monitoring
 

Similaire à Python for Chemistry

Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009
A Jorge Garcia
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
ActiveState
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
rtelmore
 

Similaire à Python for Chemistry (20)

Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009
 
Python Orientation
Python OrientationPython Orientation
Python Orientation
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
DS LAB MANUAL.pdf
DS LAB MANUAL.pdfDS LAB MANUAL.pdf
DS LAB MANUAL.pdf
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
Python For Scientists
Python For ScientistsPython For Scientists
Python For Scientists
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
R and RMarkdown crash course
R and RMarkdown crash courseR and RMarkdown crash course
R and RMarkdown crash course
 

Dernier

Poster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdfPoster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdf
Alexander Litvinenko
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
Krashi Coaching
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
heathfieldcps1
 

Dernier (20)

TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
Poster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdfPoster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdf
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
 
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading RoomImplanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
 

Python for Chemistry

  • 1. minutes Python for Chemistry in 21 days Dr. Noel O'Boyle Dr. John Mitchell and Prof. Peter Murray-Rust UCC Talk, Sept 2005 Available at http://www-mitchell.ch.cam.ac.uk/noel/
  • 2. Introduction ● This talk will cover – what Python is – why you should be interested in it – how you can use Python in chemistry
  • 3. Introduction ● This talk will cover – what Python is – why you should be interested in it – how you can use Python in chemistry ● This talk will not cover – how to program in Python ● See references at end of talk Word of warning!!
  • 4. “My mental eye could now distinguish larger structures, of manifold conformation; long rows, sometimes more closely fitted together; all twining and twisting in snakelike motion. But look! What was that? One of the snakes had seized hold of its own tail, and the form whirled mockingly before my eyes. As if by the flash of lightning I awoke...Let us learn to dream, gentlemen” Friedrich August Kekulé (1829-1896)
  • 5. What is Python? ● For a computer scientist... – a high-level programming language – interpreted (byte-compiled) – dynamic typing – object-oriented
  • 6. What is Python? ● For a computer scientist... – a high-level programming language – interpreted (byte-compiled) – dynamic typing – object-oriented ● For everyone else... – a scripting language (like Perl or Ruby) released by Guido von Rossum in 1991 – easy to learn – easy to read (!)
  • 7. What is Python? ● For a computer scientist... – a high-level programming language – interpreted (byte-compiled) – dynamic typing – object-oriented ● For everyone else... – a scripting language (like Perl or Ruby) released by Guido von Rossum in 1991 – easy to learn – easy to read (!) – named after Cambridge comedians
  • 8. The Great Debate Sir Lancelot: We were in the nick of time. You were in great Perl. Sir Galahad: I don't think I was. Sir Lancelot: You were, Sir Galahad. You were in terrible Perl. Sir Galahad: Look, let me go back in there and face the Perl. Sir Lancelot: No, it's too perilous. (adapted from Monty Python and the Holy Grail)
  • 9. Why you should be interested (1) ● Python has been adopted by the cheminformatics community ● For example, AstraZeneca has moved some of its codebase from 'the other scripting language' to Python Job Description – Research Software Developer/Informatician [section deleted] Required Skills: At least one object oriented programming language, e.g., Python, C++, Java. Web-based application development (design/construction/maintenance) UNIX, UNIX scripting & Linux OS Position in AstraZeneca R&D, 02-09-05
  • 10. Why you should be interested (2) ● Scientific computing: scipy/pylab (like Matlab) ● Molecular dynamics: MMTK ● Statistics: scipy, rpy (R), pychem ● 3D-visualisation: VTK (mayavi) ● 2D-visualisation: scipy, pylab, rpy ● coming soon, a wrapper around OpenBabel ● cheminformatics: OEChem, frowns, PyDaylight, pychem ● bioinformatics: BioPython ● structural biology: PyMOL ● computational chemistry: GaussSum ● you can still use Java libraries...like the CDK
  • 11. >>> from scipy import * >>> from pylab import * >>> a = arange(0,4) > a <- seq(0,3) >>> a >a scipy/ R [0,1,2,3] [1] 0 1 2 3 pylab >>> mean(a) > mean(a) 1.5 [1] 1.5 >>> a**2 > a**2 [0,1,4,9] [1] 0 1 4 9 >>> plot(a,a**2) > plot(a,a**2) >>> show() > lines(a,a**2)
  • 12. Scipy – cluster: information theory functions (currently, vq and kmeans) – weave: compilation of numeric expressions to C++ for fast execution – cow: parallel programming via a Cluster Of Workstations – fftpack: fast Fourier transform module based on fftpack and fftw when available – ga: genetic algorithms – io: reading and writing numeric arrays, MATLAB .mat, and Matrix Market .mtx files – integrate: numeric integration for bounded and unbounded ranges. ODE solvers. – interpolate: interpolation of values from a sample data set. – optimize: constrained and unconstrained optimization methods and root-finding algorithms – signal: signal processing (1-D and 2-D filtering, filter design, LTI systems, etc.) – special: special function types (bessel, gamma, airy, etc.) – stats: statistical functions (stdev, var, mean, etc.) – linalg: linear algebra and BLAS routines based on the ATLAS implementation of LAPACK – sparse: Some sparse matrix support. LU factorization and solving sparse linear systems
  • 13. Scipy statistical functions ● descriptive statistics: variance, standard deviation, standard error, mean, mode, median ● correlation: Pearson r, Spearman r, Kendall tau ● statistical tests: chi-squared, t-tests, binomial, Wilcoxon, Kruskal, Kolmogorov-Smirnov, Anderson, etc. ● linear regression ● analysis of variance (ANOVA) ● (and more)
  • 14. pylab
  • 15. pychem: Using scipy for chemoinformatics ● Many multivariate analysis techniques are based on matrix algebra ● scipy has wrappers around well-known C and Fortran numerical libraries (ATLAS, LAPACK) ● pychem can do: – principal component analysis, partial least squares regression, Fisher's discriminant analysis – clustering: k-means, hierarchical (used Open Source clustering library) – feature selection: genetic algorithms with PLS – additional methods can be added
  • 16. Python and R ● Advantages of R: – a large number of statistical libraries are available ● Disadvantages of R: – difficult to write algorithms – slow (most R libraries are written in C) – chokes on large datasets (use scan instead of read.table) Reading in data Principal component analysis Method 300K 600K 1.6M Method 300K 600K 1.6M Python 6.8 13.9 41 Python 2.2 3.6 42 R (read.table) 42 105 R (read.table) 5 10 R (scan) 9 20 56 R (scan) 3 5 29 http://www.redbrick.dcu.ie/~noel/RversusPython.html
  • 17. Python and R ● rpy module allows Python programs to interface with R ● have the best of both worlds – access to the statistical functions of R – access to the numerous modules available for Python – can program in Python, instead of in R!! Python R >>> from rpy import r >>> x = [5.05, 6.75, 3.21, 2.66] > x <- c(5.05, 6.75, 3.21, 2.66) >>> y = [1.65, 26.5, -5.93, 7.96] > y <- c(1.65, 26.5, -5.93, 7.96) >>> print r.lsfit(x,y)['coefficients'] > lsfit(x, y)$coefficients {'X': 5.3935773611970212, Intercept X 'Intercept': -16.281127993087839} -16.281128 5.393577
  • 18. Python and R > hc$merge Problem: Analyse a hierarchical clustering [,1] [,2] [1,] -32 -33 Solution: Use R to cluster, and Python to [2,] -39 -71 analyse the merge object of the cluster [3,] -43 -47 [4,] -10 -55 [5,] -19 -36 [6,] -5 -24 [7,] -62 -63 [8,] -74 -75 [9,] -35 -76 [10,] -1 -84 [11,] -41 -42 [12,] -83 -96 [13,] -2 -29 [14,] -7 -21 [15,] -61 4
  • 19. Problem 1 Graphically show the distribution of molecular weights of molecules in an SD file. The molecular weight is stored in a field of the SD file. 1,2-Diaminoethane MOE2004 3D 6 3 0 0 0 0 0 0 0 0999 V2000 -0.6900 -0.6620 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5850 -1.9590 0.8240 H 0 0 0 0 0 0 0 0 0 0 0 0 0.5350 1.5040 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.6060 2.0500 0.8460 H 0 0 0 0 0 0 0 0 0 0 0 0 1.3040 0.8520 -0.0460 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 M END > <chem.name> 1,2-Diaminoethane > <molecular.weight> 60.0995 $$$$
  • 20. Object-oriented approach object SD file object Molecule 1 Molecule 2 Molecule n attributes fields name atoms The code for creating these objects is stored in sdparser.py, which can be imported and used by any scripts that need to parse SD files.
  • 21. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 22. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 23. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 24. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 25.
  • 26. Problem 2 Every molecule in an SD file is missing the name. To be compatible with proprietary program X, we need to set the name equal to the value of the field “chem.name”. (MISSING NAME!) MOE2004 3D 6 3 0 0 0 0 0 0 0 0999 V2000 -0.6900 -0.6620 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5850 -1.9590 0.8240 H 0 0 0 0 0 0 0 0 0 0 0 0 0.5350 1.5040 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.6060 2.0500 0.8460 H 0 0 0 0 0 0 0 0 0 0 0 0 1.3040 0.8520 -0.0460 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 M END > <chem.name> 1,2-Diaminoethane > <molecular.weight> 60.0995 $$$$
  • 27. Solution from sdparser import SDFile inputfile = "mddr_complete.sd" outputfile = "mddr_withnames.sd" for molecule in SDFile(inputfile): molecule.name = molecule.fields['chemical.name'] outputfile.write(molecule) inputfile.close() outputfile.close() print "We are the knights who say....SD!!!"
  • 28. Solution from sdparser import SDFile inputfile = "mddr_complete.sd" outputfile = "mddr_withnames.sd" for molecule in SDFile(inputfile): molecule.name = molecule.fields['chemical.name'] outputfile.write(molecule) inputfile.close() outputfile.close() print "We are the knights who say....SD!!!"
  • 29. Python and Java ● It's easy to use Java libraries from Python – using either Jython or JPype – see http://www.redbrick.dcu.ie/~noel/CDKJython.html Example: using the CDK to calculate the number of rings in a molecule (given a string variable containing CML) from jpype import * startJVM("jdk1.5.0_03/jre/lib/i386/server/libjvm.so") cdk = JPackage("org").openscience.cdk SSSRFinder = cdk.ringsearch.SSSRFinder CMLReader = cdk.io.CMLReader def getNumRings(molecule): # Convert to a CDK molecule reader = CMLReader(java.io.StringReader(molXmlValue)) chemFile = reader.read(cdk.ChemFile()) cdkMol = chemFile.getChemSequence(0).getChemModel(0).getSetOfMolecules().getMolecule(0) # Calculate the number of rings sssrFinder = SSSRFinder(cdkMol) sssr = sssrFinder.findSSSR().size() return sssr
  • 30. 3D visualisation ● VTK (Visualisation Toolkit) from Kitware – open source, freely available – scalar, tensor, vector and volumetric methods – advanced modeling techniques such as implicit modelling, polygon reduction, mesh smoothing, cutting, contouring, and Delaunay triangulation ● MayaVi – easy to use GUI interface to VTK, written in Python – can create input files and visualise them using Python scripts Demo
  • 31. Python Resources ● http://www.python.org ● Guido's Tutorial – http://www.python.org/doc/current/tut/tut.html ● O'Reilly's “Learning Python” or Visual Quickstart Guide to Python – Make sure it's Python 2.3 or 2.4 though ● For Windows, consider the Enthought edition – http://www.enthought.com/