2013 03-15- Institut Jacques Monod - bioinfoclub

Doing computational science better
Some sources of inspiration
Some tools
Getting help

A vous

Education

A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America

Introduction understanding your work or who may be under a common root directory. The
evaluating your research skills. Most com- exception to this rule is source code or
Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.
es on algorithms, with perhaps some few months from now, you may not Each such program might have a project
components devoted to learning pro- remember what you were up to when you directory of its own.
gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level
use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-
fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and
ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample
curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.
day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a
sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a
experiments. In practice, the principles which is actually more like a version of results directory for tracking computa-
behind organizing and documenting Murphy’s Law: Everything you do, you tional experiments peformed on that data,
computational experiments are often will probably have to do over again. a doc directory with one subdirectory per
learned on the fly, and this learning is Inevitably, you will discover some flaw in manuscript, and directories such as src
strongly influenced by personal predilec- your initial preparation of the data being for source code and bin for compiled
tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.
with collaborators or colleagues. data, or you will decide that your param- Within the data and results directo-
The purpose of this article is to describe eterization of a particular model was not ries, it is often tempting to apply a similar,
one good strategy for carrying out com- broad enough. This means that the logical organization. For example, you
putational experiments. I will not describe experiment you did last week, or even may have two or three data sets against
profound issues such as how to formulate the set of experiments you’ve been work- which you plan to benchmark your
hypotheses, design experiments, or draw ing on over the past month, will probably algorithms, so you could create one
conclusions. Rather, I will focus on need to be redone. If you have organized directory for each of them under data.
relatively mundane issues such as organiz- and documented your work clearly, then In my experience, this approach is risky,
ing files and directories and documenting repeating the experiment with the new because the logical structure of your final

Education

Projects

behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
Figure names are typeface, and filenames are
Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
the files are shown here. NoteLaw: Everything you sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per
what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
files in
automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-
generates the three subdirectories split1, split2, and in
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
sqt.py
your initial of the runall of scripts. for source code and bin for compiled
doi:10.1371/journal.pcbi.1000424.g001
with collaborators or colleagues. with this approach,or you will decide that Lab Notebook
data, the distinction be- The your param- Within the data and results a complete
These types of entries provide directo-
The purpose of this article is to describe data and results may of a particular model was not
tween eterization not be useful. ries, it is often tempting to apply of the project
picture of the development a similar,
In parallel with this chronological
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
Instead, could
enough. This directory that I useful
over time.
directory called something like experi- In practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
ments experiment you did maintain a or even may lab research three put their lab against
profound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides
the set of experiments you’veroot of the results directory andyou online, behind benchmark your
in the been work-
which plan to password protection if
hypotheses, design experiments, or draw might ing on over word past month, will probably
name also include a or two necessary. When I meet with a member
that records your progress algorithms, ofso lab or a could team, we can one
indicating the topic of the the experiment in detail. my you project create refer
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
on need a be redone. If and they should be relatively verbose, with to the online them under data.
Entries in the notebook directory
relatively mundane issues such as organiz-
will often require more than one day of the current entry but scrolling up to
and documented your work clearly, thenimages In my experience, entries approach is risky, this
work, and so you may end up working a links or embedded or tables previous as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo-
few days more before the experiment with the new because logical structure remote final

Education

Projects

behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
Figure names are typeface, and filenames are
Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
the files are shown here. NoteLaw: Everything you

In each results folder:
sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per
what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
files in
automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-
generates the three subdirectories split1, split2, and in

•script: getResults.rb or WHATIDID.txt
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
sqt.py
your initial of the runall of scripts. for source code and bin for compiled
doi:10.1371/journal.pcbi.1000424.g001
with collaborators or colleagues. with this approach,or you will decide that Lab Notebook
data, the distinction be- The your param- Within the data and results a complete
These types of entries provide directo-

•intermediates
The purpose of this article is to describe data and results may of a particular model was not
tween eterization not be useful. ries, it is often tempting to apply of the project
picture of the development a similar,
In parallel with this chronological
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
Instead, could
enough. This directory that I useful
over time.
directory called something like experi- In practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
maintain a or even may lab research three put their lab against

•output
ments experiment you did
profound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides
the set of experiments you’veroot of the results directory andyou online, behind benchmark your
in the been work-
which plan to password protection if
hypotheses, design experiments, or draw might ing on over word past month, will probably
name also include a or two necessary. When I meet with a member
that records your progress algorithms, ofso lab or a could team, we can one
indicating the topic of the the experiment in detail. my you project create refer
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
on need a be redone. If and they should be relatively verbose, with to the online them under data.
Entries in the notebook directory
relatively mundane issues such as organiz-
will often require more than one day of the current entry but scrolling up to
and documented your work clearly, thenimages In my experience, entries approach is risky, this
work, and so you may end up working a links or embedded or tables previous as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo-
few days more before the experiment with the new because logical structure remote final

Best Practices for Scientific Computing
Greg Wilson ∗ , D.A. Aruliah † , C. Titus Brown ‡ , Neil P. Chue Hong § , Matt Davis ¶ , Richard T. Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using and open source software development [61
software. However, most scientists are never taught how to do this ical studies of scientific computing [4, 31,
efficiently. As a result, many are unaware of tools and practices that development in general (summarized in
would allow them to write more reliable and maintainable code with practices will guarantee efficient, error-fr
less effort. We describe a set of best practices for scientific software
ment, but used in concert they will red
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie
software. the authors of the software time and effo
focusing on the underlying scientific ques
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively 1. Write programs for people, not c
on computational problems, to traditional laboratory and field
Scientists writing software need to write
scientists, more and more of the daily operation of science re-
cutes correctly and can be easily read and
volves around computers. This includes the development of
programmers (especially the author’s fut
new algorithms, managing and analyzing the large amounts
cannot be easily read and understood it is
of data that are generated in single research projects, and
to know that it is actually doing what it i
combining disparate datasets to assess synthetic problems.
be productive, software developers must t
Scientists typically develop their own software for these
aspects of human cognition into account
purposes because doing so requires substantial domain-specific
human working memory is limited, huma

(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
∗
Scientists spend an increasing amount of time building and using a
software. However, most scientists are never taught how to do this i
efficiently. As a result, many are unaware of tools and practices that d
would allow them to write more reliable and maintainable code with p

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie
the authors of the software time and effo
Software is as important to modern scientific research as
software.
focusing on the underlying scientific ques
telescopesasand test tubes. From groups that work exclusively
Software is important to modern scientific research as
1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less effort. We
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software. scientific research as
1
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
2. Automate repetitive tasks. scientiﬁc research as
1
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientiﬁc research
1
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
on computational problems, control.
5. Use version
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
5. Use version
volves aroundDon’t repeat yourself (or others).
6. computers. This includes the development of c
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
5. Use version
volves 7. Plan for mistakes.
around computers. This includes the development of
combining
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
5. Use version
of data algorithms, managing andworksand p
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
that are generated in single research projects, correctly.the large amounts
to
be productive, software developers must tc
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
5. Use version
to
9. Document the designown software single research projects, and must t
and purpose ofthese rather than itssoftware developers
code be productive, mechanics. c
of Scientists that are generated in for
data typically develop their
aspects of human cognition into account
t

∗
∗

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
less eﬀort. We
software.
1
5. Use version
to
9. Document the designown software single research projects, and must t
and purpose ofthese rather than itssoftware developers
code be productive, mechanics. c
of Scientists that are generated in for
data typically develop their
purposes because doing so code reviews.
10. Conduct requires substantial domain-speciﬁc aspects of human cognition into account
t

Ruby.
(or maybe python)

“Friends don’t let friends do Perl” - reddit user

Programming better
• “being
able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway

Programming better
• “being able to use understand and improve your code in 6
• variable naming

Programming better
• variable naming

• coding width: 100 characters

Programming better
• variable naming

• indenting

Programming better
• variable naming

• indenting

• Followconventions -eg “Google R Style”
or https://github.com/hadley/devtools/wiki/
Style

Programming better
• variable naming

• indenting

Style
• Versioning: DropBox & http://github.com/

Programming better
• variable naming

• indenting

Style
• Automated testing. e.g.:

Programming better
• variable naming

• indenting

Style
preprocess_snps <- function(snp_table, testing=FALSE) {
• Automated testing. e.g.: if (testing) {
# run a bunch of tests of extreme situations.
# quit if a test gives a weird result.
}
# real part of function.
}

Take notes in Markdown to html, pdf,

knitr (sweave)Analyzing & Reporting in a single ﬁle.
MyFile.Rnw

MyFile.Rnw
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
usepackage{url}

begin{document}

<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@

title{A Minimal Demo of knitr}

author{Yihui Xie}

maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:

<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@

The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:

<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@

Do the above chunks work? You should be able to compile the TeX{}

### in R:
MyFile.Rnw library(knitr)
knit(“MyFile.Rnw”)
usepackage{url}

begin{document}
# --> creates MyFile.tex
### in shell:
@
pdflatex MyFile.tex
title{A Minimal Demo of knitr} # --> creates MyFile.pdf
author{Yihui Xie}

maketitle
get started with some boring random numbers:

<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@


<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
boxplot(x)
hist(x,main='')
@

Do the above chunks work? You should be able to compile the TeX{}

### in R:
MyFile.Rnw library(knitr)
knit(“MyFile.Rnw”)
usepackage{url}

begin{document}
# --> creates MyFile.tex
### in shell:
@
pdflatex MyFile.tex
title{A Minimal Demo of knitr} # --> creates MyFile.pdf
author{Yihui Xie}

maketitle A Minimal Demo of knitr
get started with some boring random numbers: Yihui Xie
<<boring-random,echo=TRUE,cache=TRUE>>= February 26, 2012
set.seed(1121)
(x=rnorm(20))
mean(x);var(x) You can test if knitr works with this minimal demo. OK, let’s get started with s
@ numbers:

set.seed(1121)
(x <- rnorm(20))
## [1] 0.14496 0.43832 0.15319 1.08494 1.99954 -0.81188 0.16027 0
<<boring-plots,cache=TRUE,echo=TRUE>>= ## [10] -0.02531 0.15088 0.11008 1.35968 -0.32699 -0.71638 1.80977 0
## two plots side by side ## [19] 0.13272 -0.15594
boxplot(x) mean(x)
hist(x,main='')
@ ## [1] 0.3217

Do the above chunks work? You should be able to compile the TeX{} var(x)

2013 03-15- Institut Jacques Monod - bioinfoclub

2013 03-15- Institut Jacques Monod - bioinfoclub

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (16)

Similaire à 2013 03-15- Institut Jacques Monod - bioinfoclub

Similaire à 2013 03-15- Institut Jacques Monod - bioinfoclub (20)

Plus de Yannick Wurm

Plus de Yannick Wurm (18)

2013 03-15- Institut Jacques Monod - bioinfoclub