SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Doing computational science better
         Some sources of inspiration
                Some tools
               Getting help

                   A vous
Some sources of inspiration
Education

A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America


Introduction                                           understanding your work or who may be                   under a common root directory. The
                                                       evaluating your research skills. Most com-              exception to this rule is source code or
   Most bioinformatics coursework focus-               monly, however, that ‘‘someone’’ is you. A              scripts that are used in multiple projects.
es on algorithms, with perhaps some                    few months from now, you may not                        Each such program might have a project
components devoted to learning pro-                    remember what you were up to when you                   directory of its own.
gramming skills and learning how to                    created a particular set of files, or you may              Within a given project, I use a top-level
use existing bioinformatics software. Un-              not remember what conclusions you drew.                 organization that is logical, with chrono-
fortunately, for students who are prepar-              You will either have to then spend time                 logical organization at the next level, and
ing for a research career, this type of                reconstructing your previous experiments                logical organization below that. A sample
curriculum fails to address many of the                or lose whatever insights you gained from               project, called msms, is shown in Figure 1.
day-to-day organizational challenges as-               those experiments.                                      At the root of most of my projects, I have a
sociated with performing computational                    This leads to the second principle,                  data directory for storing fixed data sets, a
experiments. In practice, the principles               which is actually more like a version of                results directory for tracking computa-
behind organizing and documenting                      Murphy’s Law: Everything you do, you                    tional experiments peformed on that data,
computational experiments are often                    will probably have to do over again.                    a doc directory with one subdirectory per
learned on the fly, and this learning is               Inevitably, you will discover some flaw in              manuscript, and directories such as src
strongly influenced by personal predilec-              your initial preparation of the data being              for source code and bin for compiled
tions as well as by chance interactions                analyzed, or you will get access to new                 binaries or scripts.
with collaborators or colleagues.                      data, or you will decide that your param-                  Within the data and results directo-
   The purpose of this article is to describe          eterization of a particular model was not               ries, it is often tempting to apply a similar,
one good strategy for carrying out com-                broad enough. This means that the                       logical organization. For example, you
putational experiments. I will not describe            experiment you did last week, or even                   may have two or three data sets against
profound issues such as how to formulate               the set of experiments you’ve been work-                which you plan to benchmark your
hypotheses, design experiments, or draw                ing on over the past month, will probably               algorithms, so you could create one
conclusions. Rather, I will focus on                   need to be redone. If you have organized                directory for each of them under data.
relatively mundane issues such as organiz-             and documented your work clearly, then                  In my experience, this approach is risky,
ing files and directories and documenting              repeating the experiment with the new                   because the logical structure of your final
Education

A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America


Introduction                                            understanding your work or who may be                            under a common root directory. The
                                                        evaluating your research skills. Most com-                       exception to this rule is source code or
   Most bioinformatics coursework focus-                monly, however, that ‘‘someone’’ is you. A                       scripts that are used in multiple projects.
es on algorithms, with perhaps some                     few months from now, you may not                                 Each such program might have a project
components devoted to learning pro-                     remember what you were up to when you                            directory of its own.
gramming skills and learning how to                     created a particular set of files, or you may                        Within a given project, I use a top-level
use existing bioinformatics software. Un-               not remember what conclusions you drew.                          organization that is logical, with chrono-
fortunately, for students who are prepar-               You will either have to then spend time                          logical organization at the next level, and
ing for a research career, this type of                 reconstructing your previous experiments                         logical organization below that. A sample
curriculum fails to address many of the                 or lose whatever insights you gained from                        project, called msms, is shown in Figure 1.
day-to-day organizational challenges as-                those experiments.                                               At the root of most of my projects, I have a
sociated with performing computational                      This leads to the second principle,                          data directory for storing fixed data sets, a
experiments. In practice, the principles                which is actually more like a version of                         results directory for tracking computa-
behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
                                         Figure                                                           names are      typeface, and filenames are
                                                        Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
                                         the files are shown here. NoteLaw: Everything you                                                             sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
                                         source         will probably          is compiled     create bin/ms-analysis a doc directory with one subdirectory per
                                                                                                                         what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
                                         files in
                                         automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-
                                                        generates the three subdirectories split1, split2, and in
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
                                         sqt.py
                                                        your initial of the runall of scripts.                           for source code and bin for compiled
                                         doi:10.1371/journal.pcbi.1000424.g001
tions as well as by chance interactions                 analyzed, or you will get access to new                          binaries or scripts.
with collaborators or colleagues.        with this approach,or you will decide that Lab Notebook
                                                        data, the distinction be- The your param-                            Within the data and results a complete
                                                                                                                                           These types of entries provide directo-
   The purpose of this article is to describe data and results may of a particular model was not
                                         tween          eterization     not be useful.                                   ries, it is often tempting to apply of the project
                                                                                                                                           picture of the development a similar,
                                                                                                In parallel with this chronological
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
                                         Instead,         could
                                                                   enough. This directory that I                           useful
                                                                                                                                           over time.
                                         directory called something like experi-                                                              In practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
                                         ments          experiment you did                  maintain a or even           may lab research three put their lab against
profound issues such as how to formulate 2008-12-19. Optionally, the directory              notebook. This is a document that resides
                                                        the set of experiments you’veroot of the results directory andyou online, behind benchmark your
                                                                                            in the been work-
                                                                                                                         which                plan to password protection if
hypotheses, design experiments, or draw might ing on over word past month, will probably
                                         name            also include a        or two                                                      necessary. When I meet with a member
                                                                                            that records your progress algorithms, ofso lab or a could team, we can one
                                         indicating the topic of the the   experiment                                     in detail.          my you project create refer
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
                                             on         need a be redone. If and they should be relatively verbose, with to the online them under data.
                                                                                            Entries in the notebook      directory
relatively mundane issues such as organiz-
                                         will often require more than one day of                                                           the current entry but scrolling up to
                                                        and documented your work clearly, thenimages In my experience, entries approach is risky,      this
                                         work, and so you may end up working a              links or embedded              or tables       previous           as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo-
                                         few days        more before the experiment with the new                         because            logical structure remote final
Education

A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America


Introduction                                            understanding your work or who may be                            under a common root directory. The
                                                        evaluating your research skills. Most com-                       exception to this rule is source code or
   Most bioinformatics coursework focus-                monly, however, that ‘‘someone’’ is you. A                       scripts that are used in multiple projects.
es on algorithms, with perhaps some                     few months from now, you may not                                 Each such program might have a project
components devoted to learning pro-                     remember what you were up to when you                            directory of its own.
gramming skills and learning how to                     created a particular set of files, or you may                        Within a given project, I use a top-level
use existing bioinformatics software. Un-               not remember what conclusions you drew.                          organization that is logical, with chrono-
fortunately, for students who are prepar-               You will either have to then spend time                          logical organization at the next level, and
ing for a research career, this type of                 reconstructing your previous experiments                         logical organization below that. A sample
curriculum fails to address many of the                 or lose whatever insights you gained from                        project, called msms, is shown in Figure 1.
day-to-day organizational challenges as-                those experiments.                                               At the root of most of my projects, I have a
sociated with performing computational                      This leads to the second principle,                          data directory for storing fixed data sets, a
experiments. In practice, the principles                which is actually more like a version of                         results directory for tracking computa-
behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
                                         Figure                                                           names are      typeface, and filenames are
                                                        Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
                                         the files are shown here. NoteLaw: Everything you

   In each results folder:
                                                                                                                                                       sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
                                         source         will probably          is compiled     create bin/ms-analysis a doc directory with one subdirectory per
                                                                                                                         what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
                                         files in
                                         automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-
                                                        generates the three subdirectories split1, split2, and in



   •script: getResults.rb or WHATIDID.txt
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
                                         sqt.py
                                                        your initial of the runall of scripts.                           for source code and bin for compiled
                                         doi:10.1371/journal.pcbi.1000424.g001
tions as well as by chance interactions                 analyzed, or you will get access to new                          binaries or scripts.
with collaborators or colleagues.        with this approach,or you will decide that Lab Notebook
                                                        data, the distinction be- The your param-                            Within the data and results a complete
                                                                                                                                           These types of entries provide directo-


   •intermediates
   The purpose of this article is to describe data and results may of a particular model was not
                                         tween          eterization     not be useful.                                   ries, it is often tempting to apply of the project
                                                                                                                                           picture of the development a similar,
                                                                                                In parallel with this chronological
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
                                         Instead,         could
                                                                   enough. This directory that I                           useful
                                                                                                                                           over time.
                                         directory called something like experi-                                                              In practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
                                                                                            maintain a or even           may lab research three put their lab against

   •output
                                         ments          experiment you did
profound issues such as how to formulate 2008-12-19. Optionally, the directory              notebook. This is a document that resides
                                                        the set of experiments you’veroot of the results directory andyou online, behind benchmark your
                                                                                            in the been work-
                                                                                                                         which                plan to password protection if
hypotheses, design experiments, or draw might ing on over word past month, will probably
                                         name            also include a        or two                                                      necessary. When I meet with a member
                                                                                            that records your progress algorithms, ofso lab or a could team, we can one
                                         indicating the topic of the the   experiment                                     in detail.          my you project create refer
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
                                             on         need a be redone. If and they should be relatively verbose, with to the online them under data.
                                                                                            Entries in the notebook      directory
relatively mundane issues such as organiz-
                                         will often require more than one day of                                                           the current entry but scrolling up to
                                                        and documented your work clearly, thenimages In my experience, entries approach is risky,      this
                                         work, and so you may end up working a              links or embedded              or tables       previous           as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo-
                                         few days        more before the experiment with the new                         because            logical structure remote final
Best Practices for Scientific Computing
Greg Wilson ∗ , D.A. Aruliah † , C. Titus Brown ‡ , Neil P. Chue Hong § , Matt Davis ¶ , Richard T. Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)




                                                                                                               arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using         and open source software development [61
software. However, most scientists are never taught how to do this       ical studies of scientific computing [4, 31,
efficiently. As a result, many are unaware of tools and practices that     development in general (summarized in
would allow them to write more reliable and maintainable code with       practices will guarantee efficient, error-fr
less effort. We describe a set of best practices for scientific software
                                                                         ment, but used in concert they will red
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their   errors in scientific software, make it easie
software.                                                                the authors of the software time and effo
                                                                         focusing on the underlying scientific ques
    Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively             1. Write programs for people, not c
on computational problems, to traditional laboratory and field
                                                                         Scientists writing software need to write
scientists, more and more of the daily operation of science re-
                                                                         cutes correctly and can be easily read and
volves around computers. This includes the development of
                                                                         programmers (especially the author’s fut
new algorithms, managing and analyzing the large amounts
                                                                         cannot be easily read and understood it is
of data that are generated in single research projects, and
                                                                         to know that it is actually doing what it i
combining disparate datasets to assess synthetic problems.
                                                                         be productive, software developers must t
    Scientists typically develop their own software for these
                                                                         aspects of human cognition into account
purposes because doing so requires substantial domain-specific
                                                                         human working memory is limited, huma
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                               arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                         ment, but used in concert they will red                                   f
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their   errors in scientific software, make it easie
                                                                         the authors of the software time and effo
       Software is as important to modern scientific research as
software.
                                                                         focusing on the underlying scientific ques
telescopesasand test tubes. From groups that work exclusively
    Software is    important to modern scientific research as
                                                                                                            1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of                                                   c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.                                                          scientific research as
telescopesasand test tubes. From groups that work exclusively
    Software is    important to modern scientific research as
                                                                                                            1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of                                                   c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand test tubes. From groups that work exclusively
    Software is    important to modern scientific research as
                                                                                                            1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of                                                   c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                            1
telescopes and test tubes. From groups that work exclusively
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of                                                   c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                            1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems,
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of                                                   c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                            1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems, control.
            5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves around computers. This includes the development of                                                   c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                            1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems, control.
            5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
            6. computers. This includes the development of                                                  c
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                            1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems, control.
            5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
            6. computers. This includes the development of                                                  c
volves 7. Plan for mistakes.
             around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing and analyzing the large amounts
                                                                  cannot be easily read and understood it isp
new disparate datasets to assess synthetic problems.
         that are generated in single research projects, and
                                                                  to know that it is actually doing what it i
combining
                                                                  be productive, software developers must t c
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                            t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                           1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems, control.
            5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
            6. computers. This includes the development of                                                 c
volves 7. Plan for mistakes.
             around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing andworksand                                                                   p
                                                                  cannot be easily read and understood it is
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
         that are generated in single research projects, correctly.the large amounts
                                                                  to
combining disparate datasets to assess synthetic problems.
                                                                  be productive, software developers must tc
of Scientists that are generated in single research human cognition andaccount
      data typically develop their own software for these aspects of projects, into
purposes because doing so requires substantial domain-specific
                                                                  human working memory is limited, huma
                                                                                                           t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                           1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems, control.
            5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
            6. computers. This includes the development of                                                 c
volves 7. Plan for mistakes.
             around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing andworksand                                                                   p
                                                                  cannot be easily read and understood it is
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
         that are generated in single research projects, correctly.the large amounts
                                                                  to
combining disparate datasets to assess synthetic problems.
            9. Document the designown software single research projects, and must t
                                        and purpose ofthese rather than itssoftware developers
                                                          code be productive, mechanics.                   c
of Scientists that are generated in for
      data typically develop their
purposes because doing so requires substantial domain-specific
                                                                  aspects of human cognition into account
                                                                  human working memory is limited, huma
                                                                                                           t
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Best Practices for Scientific Computing
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
           ∗
Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
∗
 Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
Scientists spend an increasing amount of time building and using                                            a
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
software. However, most scientists are never taught how to do this                                          i
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
efficiently. As a result, many are unaware of tools and practices that                                        d
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
would allow them to write more reliable and maintainable code with                                          p
less effort. We describe a set of best practices for scientific software




                                                                                                              arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using research and software development [61
                                                                   and open source experience,              m
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity                                                              e
                                                                   development in general (summarized in
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
would allow them to write more reliable and maintainable code with
less effort. We
                                                                      ment, but used in concert they will red                                     f
development that have solid foundations in research and experience,
                                          and the not computers. errors in scientific software, make it easie
and that improve scientists’ productivitypeople, reliability of their
           1. Write programs for                                      the authors of the software time and effo
       Software is as important to modern focusing on the underlying scientific ques
software.
            2. Automate repetitive tasks.                          scientific research as
telescopesasand computer to record history. as that work exclusively
            3. Use important to tubes. From groups
    Software is     the test modern scientific research
                                                                                                           1
telescopes andMaketubes. From groups that work exclusively
            4.  test incremental changes.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
                             problems, to traditional Write programs for field
                                                                  Scientists writing software need to writeS
on computational problems, control.
            5. Use version
scientists, more and more of the daily operation of science re- operation of science re-
scientists, more and more of the daily cutes correctly and can be easily read and
volves aroundDon’t repeat yourself (or others).
            6. computers. This includes the development of                                                 c
volves 7. Plan for mistakes.
             around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
                                                                  programmers (especially the author’s fut
of data algorithms, managing andworksand                                                                   p
                                                                  cannot be easily read and understood it is
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
         that are generated in single research projects, correctly.the large amounts
                                                                  to
combining disparate datasets to assess synthetic problems.
            9. Document the designown software single research projects, and must t
                                        and purpose ofthese rather than itssoftware developers
                                                          code be productive, mechanics.                   c
of Scientists that are generated in for
      data typically develop their
purposes because doing so code reviews.
            10. Conduct requires substantial domain-specific       aspects of human cognition into account
                                                                  human working memory is limited, huma
                                                                                                           t
Ruby.
(or maybe python)
Ruby.
                    (or maybe python)




“Friends don’t let friends do Perl” - reddit user
Programming better
• “being
      able to use understand and improve your code in 6
 months & in 60 years” - approximate Damian Conway
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming

• coding   width: 100 characters
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming

• coding   width: 100 characters
• indenting
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming

• coding   width: 100 characters
• indenting

• Followconventions -eg “Google R Style”
 or https://github.com/hadley/devtools/wiki/
 Style
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming

• coding   width: 100 characters
• indenting

• Followconventions -eg “Google R Style”
 or https://github.com/hadley/devtools/wiki/
 Style
• Versioning: DropBox   & http://github.com/
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming

• coding   width: 100 characters
• indenting

• Followconventions -eg “Google R Style”
 or https://github.com/hadley/devtools/wiki/
 Style
• Versioning: DropBox     & http://github.com/
• Automated    testing. e.g.:
Programming better
• “being  able to use understand and improve your code in 6
  months & in 60 years” - approximate Damian Conway
• variable naming

• coding   width: 100 characters
• indenting

• Followconventions -eg “Google R Style”
 or https://github.com/hadley/devtools/wiki/
 Style
• Versioning: DropBox     & http://github.com/
                          preprocess_snps <- function(snp_table, testing=FALSE) {
• Automated    testing. e.g.: if (testing) {
                                  # run a bunch of tests of extreme situations.
                                  # quit if a test gives a weird result.
                              }
                              # real part of function.
                          }
A few tools
Take notes in Markdown   to html, pdf,
Take notes in Markdown   to html, pdf,
knitr (sweave)Analyzing & Reporting in a single file.
MyFile.Rnw
knitr (sweave)Analyzing & Reporting in a single file.
MyFile.Rnw
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
usepackage{url}

begin{document}

<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@


title{A Minimal Demo of knitr}

author{Yihui Xie}

maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:

<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@

The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:

<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@

Do the above chunks work? You should be able to compile the TeX{}
knitr (sweave)Analyzing & Reporting in a single file.
                                                                         ### in R:
MyFile.Rnw                                                               library(knitr)
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
                                                                         knit(“MyFile.Rnw”)
usepackage{url}

begin{document}
                                                                         # --> creates MyFile.tex
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
                                                                         ### in shell:
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
                                                                         pdflatex MyFile.tex
title{A Minimal Demo of knitr}                                          # --> creates MyFile.pdf
author{Yihui Xie}

maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:

<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@

The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:

<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@

Do the above chunks work? You should be able to compile the TeX{}
knitr (sweave)Analyzing & Reporting in a single file.
                                                                         ### in R:
MyFile.Rnw                                                               library(knitr)
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
                                                                         knit(“MyFile.Rnw”)
usepackage{url}

begin{document}
                                                                         # --> creates MyFile.tex
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
                                                                         ### in shell:
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
                                                                         pdflatex MyFile.tex
title{A Minimal Demo of knitr}                                          # --> creates MyFile.pdf
author{Yihui Xie}

maketitle                                                                                               A Minimal Demo of knitr
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:                                                                             Yihui Xie
<<boring-random,echo=TRUE,cache=TRUE>>=                                                                             February 26, 2012
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)                                                                You can test if knitr works with this minimal demo. OK, let’s get started with s
@                                                                           numbers:

                                                                            set.seed(1121)
The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
                                                                            (x <- rnorm(20))
and histograms recorded by the PDF device:
                                                                            ## [1] 0.14496 0.43832        0.15319   1.08494 1.99954 -0.81188       0.16027   0
<<boring-plots,cache=TRUE,echo=TRUE>>=                                      ## [10] -0.02531 0.15088      0.11008   1.35968 -0.32699 -0.71638      1.80977   0
## two plots side by side                                                   ## [19] 0.13272 -0.15594
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)                                                                  mean(x)
hist(x,main='')
@                                                                           ## [1] 0.3217

Do the above chunks work? You should be able to compile the TeX{}          var(x)
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub
2013 03-15- Institut Jacques Monod - bioinfoclub

Contenu connexe

En vedette

2013 11-25-sheffield
2013 11-25-sheffield2013 11-25-sheffield
2013 11-25-sheffieldYannick Wurm
 
SBSM035 Genome informatics Module intro
SBSM035 Genome informatics Module introSBSM035 Genome informatics Module intro
SBSM035 Genome informatics Module introYannick Wurm
 
Public PhD defense
Public PhD defense Public PhD defense
Public PhD defense Yannick Wurm
 
2016 05-30-monday-assembly
2016 05-30-monday-assembly2016 05-30-monday-assembly
2016 05-30-monday-assemblyYannick Wurm
 
2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker bad2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker badYannick Wurm
 
2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcommYannick Wurm
 
2014 sbc174-week4 evolution
2014 sbc174-week4 evolution2014 sbc174-week4 evolution
2014 sbc174-week4 evolutionYannick Wurm
 
2015 09-28 bio721 intro
2015 09-28 bio721 intro2015 09-28 bio721 intro
2015 09-28 bio721 introYannick Wurm
 
Biol113 week4 evolution
Biol113 week4 evolutionBiol113 week4 evolution
Biol113 week4 evolutionYannick Wurm
 
Evolution - Week 4: Human evolution
Evolution - Week 4: Human evolutionEvolution - Week 4: Human evolution
Evolution - Week 4: Human evolutionYannick Wurm
 
Week 5 genetic basis of evolution
Week 5   genetic basis of evolutionWeek 5   genetic basis of evolution
Week 5 genetic basis of evolutionYannick Wurm
 
2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.keyYannick Wurm
 

En vedette (16)

2013 11-25-sheffield
2013 11-25-sheffield2013 11-25-sheffield
2013 11-25-sheffield
 
SBSM035 Genome informatics Module intro
SBSM035 Genome informatics Module introSBSM035 Genome informatics Module intro
SBSM035 Genome informatics Module intro
 
2014 11-28-lyon
2014 11-28-lyon2014 11-28-lyon
2014 11-28-lyon
 
Sbc322 intro.key
Sbc322 intro.keySbc322 intro.key
Sbc322 intro.key
 
Public PhD defense
Public PhD defense Public PhD defense
Public PhD defense
 
2016 05-30-monday-assembly
2016 05-30-monday-assembly2016 05-30-monday-assembly
2016 05-30-monday-assembly
 
2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker bad2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker bad
 
2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm
 
2014 sbc174-week4 evolution
2014 sbc174-week4 evolution2014 sbc174-week4 evolution
2014 sbc174-week4 evolution
 
2015 09-28 bio721 intro
2015 09-28 bio721 intro2015 09-28 bio721 intro
2015 09-28 bio721 intro
 
Evolution week2
Evolution week2Evolution week2
Evolution week2
 
Biol113 week4 evolution
Biol113 week4 evolutionBiol113 week4 evolution
Biol113 week4 evolution
 
Evolution week3
Evolution week3Evolution week3
Evolution week3
 
Evolution - Week 4: Human evolution
Evolution - Week 4: Human evolutionEvolution - Week 4: Human evolution
Evolution - Week 4: Human evolution
 
Week 5 genetic basis of evolution
Week 5   genetic basis of evolutionWeek 5   genetic basis of evolution
Week 5 genetic basis of evolution
 
2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key
 

Similaire à 2013 03-15- Institut Jacques Monod - bioinfoclub

Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic ResearchMiklos Koren
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchJoshuaApolonio1
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesAnnika Eriksson
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014dreusser
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
Fundamentals of data structures ellis horowitz & sartaj sahni
Fundamentals of data structures   ellis horowitz & sartaj sahniFundamentals of data structures   ellis horowitz & sartaj sahni
Fundamentals of data structures ellis horowitz & sartaj sahniHitesh Wagle
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search resultsSunny Kr
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Laboratory manual
Laboratory manualLaboratory manual
Laboratory manualAsif Rana
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchUniversity Medicine Greifswald
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —swethaT16
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
Predictive Modeling Procedure
Predictive Modeling ProcedurePredictive Modeling Procedure
Predictive Modeling ProcedurePredactica Social
 

Similaire à 2013 03-15- Institut Jacques Monod - bioinfoclub (20)

Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic Research
 
qualitative.ppt
qualitative.pptqualitative.ppt
qualitative.ppt
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative Research
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple Rules
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Fundamentals of data structures ellis horowitz & sartaj sahni
Fundamentals of data structures   ellis horowitz & sartaj sahniFundamentals of data structures   ellis horowitz & sartaj sahni
Fundamentals of data structures ellis horowitz & sartaj sahni
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search results
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Promberger_paper
Promberger_paperPromberger_paper
Promberger_paper
 
Laboratory manual
Laboratory manualLaboratory manual
Laboratory manual
 
Promberger_paper
Promberger_paperPromberger_paper
Promberger_paper
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical research
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Predictive Modeling Procedure
Predictive Modeling ProcedurePredictive Modeling Procedure
Predictive Modeling Procedure
 

Plus de Yannick Wurm

2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomicsYannick Wurm
 
2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics research2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics researchYannick Wurm
 
2017 11-15-reproducible research
2017 11-15-reproducible research2017 11-15-reproducible research
2017 11-15-reproducible researchYannick Wurm
 
2016 09-16-fairdom
2016 09-16-fairdom2016 09-16-fairdom
2016 09-16-fairdomYannick Wurm
 
2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosome2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosomeYannick Wurm
 
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...Yannick Wurm
 
2015 11-17-programming inr.key
2015 11-17-programming inr.key2015 11-17-programming inr.key
2015 11-17-programming inr.keyYannick Wurm
 
2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitch2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitchYannick Wurm
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible researchYannick Wurm
 
2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcommYannick Wurm
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopYannick Wurm
 
2014 11-24-sbsm028-yannicksocialevolution
2014 11-24-sbsm028-yannicksocialevolution2014 11-24-sbsm028-yannicksocialevolution
2014 11-24-sbsm028-yannicksocialevolutionYannick Wurm
 
2014 11-25-sbc322-experiments
2014 11-25-sbc322-experiments2014 11-25-sbc322-experiments
2014 11-25-sbc322-experimentsYannick Wurm
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2014 11-12 sbsm032rstatsprogramming.key
2014 11-12 sbsm032rstatsprogramming.key2014 11-12 sbsm032rstatsprogramming.key
2014 11-12 sbsm032rstatsprogramming.keyYannick Wurm
 
2014 sbc174-evolution lectureswk5
2014 sbc174-evolution lectureswk52014 sbc174-evolution lectureswk5
2014 sbc174-evolution lectureswk5Yannick Wurm
 

Plus de Yannick Wurm (18)

2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics research2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics research
 
2017 11-15-reproducible research
2017 11-15-reproducible research2017 11-15-reproducible research
2017 11-15-reproducible research
 
2016 09-16-fairdom
2016 09-16-fairdom2016 09-16-fairdom
2016 09-16-fairdom
 
2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosome2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosome
 
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
 
2015 11-17-programming inr.key
2015 11-17-programming inr.key2015 11-17-programming inr.key
2015 11-17-programming inr.key
 
2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitch2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitch
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshop
 
2014 12-09-oulu
2014 12-09-oulu2014 12-09-oulu
2014 12-09-oulu
 
2014 11-24-sbsm028-yannicksocialevolution
2014 11-24-sbsm028-yannicksocialevolution2014 11-24-sbsm028-yannicksocialevolution
2014 11-24-sbsm028-yannicksocialevolution
 
2014 11-25-sbc322-experiments
2014 11-25-sbc322-experiments2014 11-25-sbc322-experiments
2014 11-25-sbc322-experiments
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2014 11-12 sbsm032rstatsprogramming.key
2014 11-12 sbsm032rstatsprogramming.key2014 11-12 sbsm032rstatsprogramming.key
2014 11-12 sbsm032rstatsprogramming.key
 
2014 10-21-sbc322
2014 10-21-sbc3222014 10-21-sbc322
2014 10-21-sbc322
 
2014 sbc174-evolution lectureswk5
2014 sbc174-evolution lectureswk52014 sbc174-evolution lectureswk5
2014 sbc174-evolution lectureswk5
 

2013 03-15- Institut Jacques Monod - bioinfoclub

  • 1. Doing computational science better Some sources of inspiration Some tools Getting help A vous
  • 2. Some sources of inspiration
  • 3. Education A Quick Guide to Organizing Computational Biology Projects William Stafford Noble1,2* 1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America Introduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects. es on algorithms, with perhaps some few months from now, you may not Each such program might have a project components devoted to learning pro- remember what you were up to when you directory of its own. gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono- fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1. day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a experiments. In practice, the principles which is actually more like a version of results directory for tracking computa- behind organizing and documenting Murphy’s Law: Everything you do, you tional experiments peformed on that data, computational experiments are often will probably have to do over again. a doc directory with one subdirectory per learned on the fly, and this learning is Inevitably, you will discover some flaw in manuscript, and directories such as src strongly influenced by personal predilec- your initial preparation of the data being for source code and bin for compiled tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts. with collaborators or colleagues. data, or you will decide that your param- Within the data and results directo- The purpose of this article is to describe eterization of a particular model was not ries, it is often tempting to apply a similar, one good strategy for carrying out com- broad enough. This means that the logical organization. For example, you putational experiments. I will not describe experiment you did last week, or even may have two or three data sets against profound issues such as how to formulate the set of experiments you’ve been work- which you plan to benchmark your hypotheses, design experiments, or draw ing on over the past month, will probably algorithms, so you could create one conclusions. Rather, I will focus on need to be redone. If you have organized directory for each of them under data. relatively mundane issues such as organiz- and documented your work clearly, then In my experience, this approach is risky, ing files and directories and documenting repeating the experiment with the new because the logical structure of your final
  • 4. Education A Quick Guide to Organizing Computational Biology Projects William Stafford Noble1,2* 1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America Introduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects. es on algorithms, with perhaps some few months from now, you may not Each such program might have a project components devoted to learning pro- remember what you were up to when you directory of its own. gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono- fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1. day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a experiments. In practice, the principles which is actually more like a version of results directory for tracking computa- behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Figure names are typeface, and filenames are Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you sorted in chronological order. The computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per what date. The driver script results/2009-01-15/runall learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src files in automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse- generates the three subdirectories split1, split2, and in strongly influenced by personal predilec- script is called by bothpreparation driverthe data being sqt.py your initial of the runall of scripts. for source code and bin for compiled doi:10.1371/journal.pcbi.1000424.g001 tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts. with collaborators or colleagues. with this approach,or you will decide that Lab Notebook data, the distinction be- The your param- Within the data and results a complete These types of entries provide directo- The purpose of this article is to describe data and results may of a particular model was not tween eterization not be useful. ries, it is often tempting to apply of the project picture of the development a similar, In parallel with this chronological one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you Instead, could enough. This directory that I useful over time. directory called something like experi- In practice, I ask members of my putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks ments experiment you did maintain a or even may lab research three put their lab against profound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides the set of experiments you’veroot of the results directory andyou online, behind benchmark your in the been work- which plan to password protection if hypotheses, design experiments, or draw might ing on over word past month, will probably name also include a or two necessary. When I meet with a member that records your progress algorithms, ofso lab or a could team, we can one indicating the topic of the the experiment in detail. my you project create refer conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on need a be redone. If and they should be relatively verbose, with to the online them under data. Entries in the notebook directory relatively mundane issues such as organiz- will often require more than one day of the current entry but scrolling up to and documented your work clearly, thenimages In my experience, entries approach is risky, this work, and so you may end up working a links or embedded or tables previous as necessary. The URL ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo- few days more before the experiment with the new because logical structure remote final
  • 5. Education A Quick Guide to Organizing Computational Biology Projects William Stafford Noble1,2* 1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America Introduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects. es on algorithms, with perhaps some few months from now, you may not Each such program might have a project components devoted to learning pro- remember what you were up to when you directory of its own. gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-level use existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono- fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, and ing for a research career, this type of reconstructing your previous experiments logical organization below that. A sample curriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1. day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have a sociated with performing computational This leads to the second principle, data directory for storing fixed data sets, a experiments. In practice, the principles which is actually more like a version of results directory for tracking computa- behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Figure names are typeface, and filenames are Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you In each results folder: sorted in chronological order. The computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per what date. The driver script results/2009-01-15/runall learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src files in automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse- generates the three subdirectories split1, split2, and in •script: getResults.rb or WHATIDID.txt strongly influenced by personal predilec- script is called by bothpreparation driverthe data being sqt.py your initial of the runall of scripts. for source code and bin for compiled doi:10.1371/journal.pcbi.1000424.g001 tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts. with collaborators or colleagues. with this approach,or you will decide that Lab Notebook data, the distinction be- The your param- Within the data and results a complete These types of entries provide directo- •intermediates The purpose of this article is to describe data and results may of a particular model was not tween eterization not be useful. ries, it is often tempting to apply of the project picture of the development a similar, In parallel with this chronological one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you Instead, could enough. This directory that I useful over time. directory called something like experi- In practice, I ask members of my putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks maintain a or even may lab research three put their lab against •output ments experiment you did profound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides the set of experiments you’veroot of the results directory andyou online, behind benchmark your in the been work- which plan to password protection if hypotheses, design experiments, or draw might ing on over word past month, will probably name also include a or two necessary. When I meet with a member that records your progress algorithms, ofso lab or a could team, we can one indicating the topic of the the experiment in detail. my you project create refer conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on need a be redone. If and they should be relatively verbose, with to the online them under data. Entries in the notebook directory relatively mundane issues such as organiz- will often require more than one day of the current entry but scrolling up to and documented your work clearly, thenimages In my experience, entries approach is risky, this work, and so you may end up working a links or embedded or tables previous as necessary. The URL ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo- few days more before the experiment with the new because logical structure remote final
  • 6. Best Practices for Scientific Computing Greg Wilson ∗ , D.A. Aruliah † , C. Titus Brown ‡ , Neil P. Chue Hong § , Matt Davis ¶ , Richard T. Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using and open source software development [61 software. However, most scientists are never taught how to do this ical studies of scientific computing [4, 31, efficiently. As a result, many are unaware of tools and practices that development in general (summarized in would allow them to write more reliable and maintainable code with practices will guarantee efficient, error-fr less effort. We describe a set of best practices for scientific software ment, but used in concert they will red development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie software. the authors of the software time and effo focusing on the underlying scientific ques Software is as important to modern scientific research as telescopes and test tubes. From groups that work exclusively 1. Write programs for people, not c on computational problems, to traditional laboratory and field Scientists writing software need to write scientists, more and more of the daily operation of science re- cutes correctly and can be easily read and volves around computers. This includes the development of programmers (especially the author’s fut new algorithms, managing and analyzing the large amounts cannot be easily read and understood it is of data that are generated in single research projects, and to know that it is actually doing what it i combining disparate datasets to assess synthetic problems. be productive, software developers must t Scientists typically develop their own software for these aspects of human cognition into account purposes because doing so requires substantial domain-specific human working memory is limited, huma
  • 7. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie the authors of the software time and effo Software is as important to modern scientific research as software. focusing on the underlying scientific ques telescopesasand test tubes. From groups that work exclusively Software is important to modern scientific research as 1 telescopes and test tubes. From groups that work exclusively on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves around computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 8. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. scientific research as telescopesasand test tubes. From groups that work exclusively Software is important to modern scientific research as 1 telescopes and test tubes. From groups that work exclusively on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves around computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 9. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand test tubes. From groups that work exclusively Software is important to modern scientific research as 1 telescopes and test tubes. From groups that work exclusively on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves around computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 10. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes and test tubes. From groups that work exclusively on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves around computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 11. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves around computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 12. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, control. 5. Use version scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves around computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 13. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, control. 5. Use version scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of c volves around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 14. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, control. 5. Use version scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of c volves 7. Plan for mistakes. around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing and analyzing the large amounts cannot be easily read and understood it isp new disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it i combining be productive, software developers must t c of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 15. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, control. 5. Use version scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of c volves 7. Plan for mistakes. around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing andworksand p cannot be easily read and understood it is new 8. Optimize software only after it analyzingknow that it is actually doing what it i that are generated in single research projects, correctly.the large amounts to combining disparate datasets to assess synthetic problems. be productive, software developers must tc of Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, into purposes because doing so requires substantial domain-specific human working memory is limited, huma t
  • 16. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, control. 5. Use version scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of c volves 7. Plan for mistakes. around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing andworksand p cannot be easily read and understood it is new 8. Optimize software only after it analyzingknow that it is actually doing what it i that are generated in single research projects, correctly.the large amounts to combining disparate datasets to assess synthetic problems. 9. Document the designown software single research projects, and must t and purpose ofthese rather than itssoftware developers code be productive, mechanics. c of Scientists that are generated in for data typically develop their purposes because doing so requires substantial domain-specific aspects of human cognition into account human working memory is limited, huma t
  • 17. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w Best Practices for Scientific Computing Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗ Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† ∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru Scientists spend an increasing amount of time building and using a State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute software. However, most scientists are never taught how to do this i (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi efficiently. As a result, many are unaware of tools and practices that d Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, m development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity e development in general (summarized in software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt would allow them to write more reliable and maintainable code with less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easie and that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific ques software. 2. Automate repetitive tasks. scientific research as telescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1 telescopes andMaketubes. From groups that work exclusively 4. test incremental changes. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeS on computational problems, control. 5. Use version scientists, more and more of the daily operation of science re- operation of science re- scientists, more and more of the daily cutes correctly and can be easily read and volves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of c volves 7. Plan for mistakes. around computers. This includes the development of new algorithms, managing and analyzing the large amounts programmers (especially the author’s fut of data algorithms, managing andworksand p cannot be easily read and understood it is new 8. Optimize software only after it analyzingknow that it is actually doing what it i that are generated in single research projects, correctly.the large amounts to combining disparate datasets to assess synthetic problems. 9. Document the designown software single research projects, and must t and purpose ofthese rather than itssoftware developers code be productive, mechanics. c of Scientists that are generated in for data typically develop their purposes because doing so code reviews. 10. Conduct requires substantial domain-specific aspects of human cognition into account human working memory is limited, huma t
  • 19. Ruby. (or maybe python) “Friends don’t let friends do Perl” - reddit user
  • 20. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway
  • 21. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming
  • 22. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming • coding width: 100 characters
  • 23. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming • coding width: 100 characters • indenting
  • 24. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming • coding width: 100 characters • indenting • Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style
  • 25. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming • coding width: 100 characters • indenting • Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style • Versioning: DropBox & http://github.com/
  • 26. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming • coding width: 100 characters • indenting • Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style • Versioning: DropBox & http://github.com/ • Automated testing. e.g.:
  • 27. Programming better • “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway • variable naming • coding width: 100 characters • indenting • Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style • Versioning: DropBox & http://github.com/ preprocess_snps <- function(snp_table, testing=FALSE) { • Automated testing. e.g.: if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }
  • 29. Take notes in Markdown to html, pdf,
  • 30. Take notes in Markdown to html, pdf,
  • 31. knitr (sweave)Analyzing & Reporting in a single file. MyFile.Rnw
  • 32. knitr (sweave)Analyzing & Reporting in a single file. MyFile.Rnw documentclass{article} usepackage[sc]{mathpazo} usepackage[T1]{fontenc} usepackage{url} begin{document} <<setup, include=FALSE, cache=FALSE, echo=FALSE>>= # this is equivalent to SweaveOpts{...} opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold') options(replace.assign=TRUE,width=90) @ title{A Minimal Demo of knitr} author{Yihui Xie} maketitle You can test if textbf{knitr} works with this minimal demo. OK, let's get started with some boring random numbers: <<boring-random,echo=TRUE,cache=TRUE>>= set.seed(1121) (x=rnorm(20)) mean(x);var(x) @ The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots and histograms recorded by the PDF device: <<boring-plots,cache=TRUE,echo=TRUE>>= ## two plots side by side par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1) boxplot(x) hist(x,main='') @ Do the above chunks work? You should be able to compile the TeX{}
  • 33. knitr (sweave)Analyzing & Reporting in a single file. ### in R: MyFile.Rnw library(knitr) documentclass{article} usepackage[sc]{mathpazo} usepackage[T1]{fontenc} knit(“MyFile.Rnw”) usepackage{url} begin{document} # --> creates MyFile.tex <<setup, include=FALSE, cache=FALSE, echo=FALSE>>= # this is equivalent to SweaveOpts{...} ### in shell: opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold') options(replace.assign=TRUE,width=90) @ pdflatex MyFile.tex title{A Minimal Demo of knitr} # --> creates MyFile.pdf author{Yihui Xie} maketitle You can test if textbf{knitr} works with this minimal demo. OK, let's get started with some boring random numbers: <<boring-random,echo=TRUE,cache=TRUE>>= set.seed(1121) (x=rnorm(20)) mean(x);var(x) @ The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots and histograms recorded by the PDF device: <<boring-plots,cache=TRUE,echo=TRUE>>= ## two plots side by side par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1) boxplot(x) hist(x,main='') @ Do the above chunks work? You should be able to compile the TeX{}
  • 34. knitr (sweave)Analyzing & Reporting in a single file. ### in R: MyFile.Rnw library(knitr) documentclass{article} usepackage[sc]{mathpazo} usepackage[T1]{fontenc} knit(“MyFile.Rnw”) usepackage{url} begin{document} # --> creates MyFile.tex <<setup, include=FALSE, cache=FALSE, echo=FALSE>>= # this is equivalent to SweaveOpts{...} ### in shell: opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold') options(replace.assign=TRUE,width=90) @ pdflatex MyFile.tex title{A Minimal Demo of knitr} # --> creates MyFile.pdf author{Yihui Xie} maketitle A Minimal Demo of knitr You can test if textbf{knitr} works with this minimal demo. OK, let's get started with some boring random numbers: Yihui Xie <<boring-random,echo=TRUE,cache=TRUE>>= February 26, 2012 set.seed(1121) (x=rnorm(20)) mean(x);var(x) You can test if knitr works with this minimal demo. OK, let’s get started with s @ numbers: set.seed(1121) The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots (x <- rnorm(20)) and histograms recorded by the PDF device: ## [1] 0.14496 0.43832 0.15319 1.08494 1.99954 -0.81188 0.16027 0 <<boring-plots,cache=TRUE,echo=TRUE>>= ## [10] -0.02531 0.15088 0.11008 1.35968 -0.32699 -0.71638 1.80977 0 ## two plots side by side ## [19] 0.13272 -0.15594 par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1) boxplot(x) mean(x) hist(x,main='') @ ## [1] 0.3217 Do the above chunks work? You should be able to compile the TeX{} var(x)