3. Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction understanding your work or who may be under a common root directory. The
evaluating your research skills. Most com- exception to this rule is source code or
Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.
es on algorithms, with perhaps some few months from now, you may not Each such program might have a project
components devoted to learning pro- remember what you were up to when you directory of its own.
gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level
use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-
fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and
ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample
curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.
day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a
sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a
experiments. In practice, the principles which is actually more like a version of results directory for tracking computa-
behind organizing and documenting Murphy’s Law: Everything you do, you tional experiments peformed on that data,
computational experiments are often will probably have to do over again. a doc directory with one subdirectory per
learned on the fly, and this learning is Inevitably, you will discover some flaw in manuscript, and directories such as src
strongly influenced by personal predilec- your initial preparation of the data being for source code and bin for compiled
tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.
with collaborators or colleagues. data, or you will decide that your param- Within the data and results directo-
The purpose of this article is to describe eterization of a particular model was not ries, it is often tempting to apply a similar,
one good strategy for carrying out com- broad enough. This means that the logical organization. For example, you
putational experiments. I will not describe experiment you did last week, or even may have two or three data sets against
profound issues such as how to formulate the set of experiments you’ve been work- which you plan to benchmark your
hypotheses, design experiments, or draw ing on over the past month, will probably algorithms, so you could create one
conclusions. Rather, I will focus on need to be redone. If you have organized directory for each of them under data.
relatively mundane issues such as organiz- and documented your work clearly, then In my experience, this approach is risky,
ing files and directories and documenting repeating the experiment with the new because the logical structure of your final
4. Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction understanding your work or who may be under a common root directory. The
evaluating your research skills. Most com- exception to this rule is source code or
Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.
es on algorithms, with perhaps some few months from now, you may not Each such program might have a project
components devoted to learning pro- remember what you were up to when you directory of its own.
gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level
use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-
fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and
ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample
curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.
day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a
sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a
experiments. In practice, the principles which is actually more like a version of results directory for tracking computa-
behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
Figure names are typeface, and filenames are
Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
the files are shown here. NoteLaw: Everything you sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per
what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
files in
automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-
generates the three subdirectories split1, split2, and in
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
sqt.py
your initial of the runall of scripts. for source code and bin for compiled
doi:10.1371/journal.pcbi.1000424.g001
tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.
with collaborators or colleagues. with this approach,or you will decide that Lab Notebook
data, the distinction be- The your param- Within the data and results a complete
These types of entries provide directo-
The purpose of this article is to describe data and results may of a particular model was not
tween eterization not be useful. ries, it is often tempting to apply of the project
picture of the development a similar,
In parallel with this chronological
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
Instead, could
enough. This directory that I useful
over time.
directory called something like experi- In practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
ments experiment you did maintain a or even may lab research three put their lab against
profound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides
the set of experiments you’veroot of the results directory andyou online, behind benchmark your
in the been work-
which plan to password protection if
hypotheses, design experiments, or draw might ing on over word past month, will probably
name also include a or two necessary. When I meet with a member
that records your progress algorithms, ofso lab or a could team, we can one
indicating the topic of the the experiment in detail. my you project create refer
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
on need a be redone. If and they should be relatively verbose, with to the online them under data.
Entries in the notebook directory
relatively mundane issues such as organiz-
will often require more than one day of the current entry but scrolling up to
and documented your work clearly, thenimages In my experience, entries approach is risky, this
work, and so you may end up working a links or embedded or tables previous as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo-
few days more before the experiment with the new because logical structure remote final
5. Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction understanding your work or who may be under a common root directory. The
evaluating your research skills. Most com- exception to this rule is source code or
Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.
es on algorithms, with perhaps some few months from now, you may not Each such program might have a project
components devoted to learning pro- remember what you were up to when you directory of its own.
gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level
use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-
fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and
ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample
curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.
day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a
sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a
experiments. In practice, the principles which is actually more like a version of results directory for tracking computa-
behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
Figure names are typeface, and filenames are
Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
the files are shown here. NoteLaw: Everything you
In each results folder:
sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per
what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
files in
automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-
generates the three subdirectories split1, split2, and in
•script: getResults.rb or WHATIDID.txt
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
sqt.py
your initial of the runall of scripts. for source code and bin for compiled
doi:10.1371/journal.pcbi.1000424.g001
tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.
with collaborators or colleagues. with this approach,or you will decide that Lab Notebook
data, the distinction be- The your param- Within the data and results a complete
These types of entries provide directo-
•intermediates
The purpose of this article is to describe data and results may of a particular model was not
tween eterization not be useful. ries, it is often tempting to apply of the project
picture of the development a similar,
In parallel with this chronological
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
Instead, could
enough. This directory that I useful
over time.
directory called something like experi- In practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
maintain a or even may lab research three put their lab against
•output
ments experiment you did
profound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides
the set of experiments you’veroot of the results directory andyou online, behind benchmark your
in the been work-
which plan to password protection if
hypotheses, design experiments, or draw might ing on over word past month, will probably
name also include a or two necessary. When I meet with a member
that records your progress algorithms, ofso lab or a could team, we can one
indicating the topic of the the experiment in detail. my you project create refer
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
on need a be redone. If and they should be relatively verbose, with to the online them under data.
Entries in the notebook directory
relatively mundane issues such as organiz-
will often require more than one day of the current entry but scrolling up to
and documented your work clearly, thenimages In my experience, entries approach is risky, this
work, and so you may end up working a links or embedded or tables previous as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo-
few days more before the experiment with the new because logical structure remote final
6. Best Practices for Scientific Computing
Greg Wilson ∗ , D.A. Aruliah † , C. Titus Brown ‡ , Neil P. Chue Hong § , Matt Davis ¶ , Richard T. Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using and open source software development [61
software. However, most scientists are never taught how to do this ical studies of scientific computing [4, 31,
efficiently. As a result, many are unaware of tools and practices that development in general (summarized in
would allow them to write more reliable and maintainable code with practices will guarantee efficient, error-fr
less effort. We describe a set of best practices for scientific software
ment, but used in concert they will red
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie
software. the authors of the software time and effo
focusing on the underlying scientific ques
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively 1. Write programs for people, not c
on computational problems, to traditional laboratory and field
Scientists writing software need to write
scientists, more and more of the daily operation of science re-
cutes correctly and can be easily read and
volves around computers. This includes the development of
programmers (especially the author’s fut
new algorithms, managing and analyzing the large amounts
cannot be easily read and understood it is
of data that are generated in single research projects, and
to know that it is actually doing what it i
combining disparate datasets to assess synthetic problems.
be productive, software developers must t
Scientists typically develop their own software for these
aspects of human cognition into account
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
7. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie
the authors of the software time and effo
Software is as important to modern scientific research as
software.
focusing on the underlying scientific ques
telescopesasand test tubes. From groups that work exclusively
Software is important to modern scientific research as
1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
8. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software. scientific research as
telescopesasand test tubes. From groups that work exclusively
Software is important to modern scientific research as
1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
9. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand test tubes. From groups that work exclusively
Software is important to modern scientific research as
1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
10. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
11. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
12. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems, control.
5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
13. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems, control.
5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
6. computers. This includes the development of c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
14. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems, control.
5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
6. computers. This includes the development of c
volves 7. Plan for mistakes.
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
that are generated in single research projects, and
to know that it is actually doing what it i
combining
be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
15. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems, control.
5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
6. computers. This includes the development of c
volves 7. Plan for mistakes.
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing andworksand p
cannot be easily read and understood it is
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
that are generated in single research projects, correctly.the large amounts
to
combining disparate datasets to assess synthetic problems.
be productive, software developers must tc
of Scientists that are generated in single research human cognition andaccount
data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
human working memory is limited, huma
t
16. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems, control.
5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
6. computers. This includes the development of c
volves 7. Plan for mistakes.
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing andworksand p
cannot be easily read and understood it is
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
that are generated in single research projects, correctly.the large amounts
to
combining disparate datasets to assess synthetic problems.
9. Document the designown software single research projects, and must t
and purpose ofthese rather than itssoftware developers
code be productive, mechanics. c
of Scientists that are generated in for
data typically develop their
purposes because doing so requires substantial domain-specific
aspects of human cognition into account
human working memory is limited, huma
t
17. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with p
less effort. We describe a set of best practices for scientific software
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience, m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity e
development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
ment, but used in concert they will red f
development that have solid foundations in research and experience,
and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
1. Write programs for the authors of the software time and effo
Software is as important to modern focusing on the underlying scientific ques
software.
2. Automate repetitive tasks. scientific research as
telescopesasand computer to record history. as that work exclusively
3. Use important to tubes. From groups
Software is the test modern scientific research
1
telescopes andMaketubes. From groups that work exclusively
4. test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
Scientists writing software need to writeS
on computational problems, control.
5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
6. computers. This includes the development of c
volves 7. Plan for mistakes.
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
programmers (especially the author’s fut
of data algorithms, managing andworksand p
cannot be easily read and understood it is
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
that are generated in single research projects, correctly.the large amounts
to
combining disparate datasets to assess synthetic problems.
9. Document the designown software single research projects, and must t
and purpose ofthese rather than itssoftware developers
code be productive, mechanics. c
of Scientists that are generated in for
data typically develop their
purposes because doing so code reviews.
10. Conduct requires substantial domain-specific aspects of human cognition into account
human working memory is limited, huma
t
19. Ruby.
(or maybe python)
“Friends don’t let friends do Perl” - reddit user
20. Programming better
• “being
able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
21. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
22. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
• coding width: 100 characters
23. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
• coding width: 100 characters
• indenting
24. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
• coding width: 100 characters
• indenting
• Followconventions -eg “Google R Style”
or https://github.com/hadley/devtools/wiki/
Style
25. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
• coding width: 100 characters
• indenting
• Followconventions -eg “Google R Style”
or https://github.com/hadley/devtools/wiki/
Style
• Versioning: DropBox & http://github.com/
26. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
• coding width: 100 characters
• indenting
• Followconventions -eg “Google R Style”
or https://github.com/hadley/devtools/wiki/
Style
• Versioning: DropBox & http://github.com/
• Automated testing. e.g.:
27. Programming better
• “being able to use understand and improve your code in 6
months & in 60 years” - approximate Damian Conway
• variable naming
• coding width: 100 characters
• indenting
• Followconventions -eg “Google R Style”
or https://github.com/hadley/devtools/wiki/
Style
• Versioning: DropBox & http://github.com/
preprocess_snps <- function(snp_table, testing=FALSE) {
• Automated testing. e.g.: if (testing) {
# run a bunch of tests of extreme situations.
# quit if a test gives a weird result.
}
# real part of function.
}
32. knitr (sweave)Analyzing & Reporting in a single file.
MyFile.Rnw
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
usepackage{url}
begin{document}
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
title{A Minimal Demo of knitr}
author{Yihui Xie}
maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:
<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@
The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:
<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@
Do the above chunks work? You should be able to compile the TeX{}
33. knitr (sweave)Analyzing & Reporting in a single file.
### in R:
MyFile.Rnw library(knitr)
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
knit(“MyFile.Rnw”)
usepackage{url}
begin{document}
# --> creates MyFile.tex
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
### in shell:
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
pdflatex MyFile.tex
title{A Minimal Demo of knitr} # --> creates MyFile.pdf
author{Yihui Xie}
maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:
<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@
The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:
<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@
Do the above chunks work? You should be able to compile the TeX{}
34. knitr (sweave)Analyzing & Reporting in a single file.
### in R:
MyFile.Rnw library(knitr)
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
knit(“MyFile.Rnw”)
usepackage{url}
begin{document}
# --> creates MyFile.tex
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
### in shell:
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
pdflatex MyFile.tex
title{A Minimal Demo of knitr} # --> creates MyFile.pdf
author{Yihui Xie}
maketitle A Minimal Demo of knitr
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers: Yihui Xie
<<boring-random,echo=TRUE,cache=TRUE>>= February 26, 2012
set.seed(1121)
(x=rnorm(20))
mean(x);var(x) You can test if knitr works with this minimal demo. OK, let’s get started with s
@ numbers:
set.seed(1121)
The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
(x <- rnorm(20))
and histograms recorded by the PDF device:
## [1] 0.14496 0.43832 0.15319 1.08494 1.99954 -0.81188 0.16027 0
<<boring-plots,cache=TRUE,echo=TRUE>>= ## [10] -0.02531 0.15088 0.11008 1.35968 -0.32699 -0.71638 1.80977 0
## two plots side by side ## [19] 0.13272 -0.15594
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x) mean(x)
hist(x,main='')
@ ## [1] 0.3217
Do the above chunks work? You should be able to compile the TeX{} var(x)