Reproducibility: 10 Simple Rules

Reproducibility:
10 Simple Rules
And more!
Sandve, Geir Kjetil, et al. "Ten simple rules for reproducible computational research." PLoS computational biology 9.10 (2013): e1003285.

Rule 1: For Every Result, Keep Track of
How It Was Produced
http://xkcd.com/

Rule 2: Avoid Manual Data
Manipulation Steps
• “Stop clicking, start typing” – Matt Frost,
Charlottesville, VA
• Use scripts for even small changes
• Split commonly used code off into
functions/classes, and put these into libraries

Rule 3: Archive the Exact Versions of
All External Programs Used
Level
0
Note names and versions
of all packages
Level
1
Use package management
system (packrat,
anaconda/conda)
Boss
Level
Save image of entire
system

Rule 4: Version Control All Custom
Scripts
http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics
• Also, version control workflows (what are
good workflow management systems, guys?)
• Use the commit
space to write
something useful to
your future self
(“pwew pwew pwew”
is not useful)

Rule 5: Record All Intermediate
Results, When Possible in Standardized
Formats
• “Explicit is better than implicit” – Tim Peters,
The Zen of Python

Rule 6: For Analyses That Include
Randomness, Note Underlying
Random Seeds
• This goes for all parameters that may change
• Separate code from configuration, e.g. use
config files (another gift to your future self!)

Rule 7: Always Store Raw Data behind
Plots
• (and the plot generating code, too)
• Make raw data read only
• Separate folders for raw and pre-processed
data
https://inspguilfoyle.wordpress.com/2014/02/19/straight-lines/

Rule 8: Generate Hierarchical Analysis
Output, Allowing Layers of Increasing
Detail to Be Inspected

Rule 9: Connect Textual Statements to
Underlying Results

Rule 10: Provide Public Access to
Scripts, Runs, and Results
• GitHub
• Synapse
• Open Science Framework
• ReadTheDocs
• RunMyCode
• ???

Documentation
 Is it clear where to begin? (e.g., can someone picking a project up
see where to start running it)
 can you determine which file(s) was/were used as input in a process
that produced a derived file?
 Who do I cite? (code, data, etc.)
 Is there documentation about every result?
 Have you noted the exact version of every external application used
in the process?
 For analyses that include randomness, have you noted the
underlying random seed(s)?
 Have you specified the license under which you're distributing your
content, data, and code?
 Have you noted the license(s) for others peoples' content, data, and
code used in your analysis?
http://ropensci.github.io/reproducibility-guide/sections/checklist/

Organization
 Which is the most recent data file/code?
 Which folders can I safely delete?
 Do you keep older files/code or delete them?
 Can you find a file for a particular replicate of your research
project?
 Have you stored the raw data behind each plot? Is your analysis
output done hierarchically? (allowing others to find more detailed
output underneath a summary)
 Do you run backups on all files associated with your analysis?
 How many times has a particular file been generated in the past?
 Why was the same file generated multiple times?
 Where did a file that I didn't generate come from?

Automation
Are there lots of manual data manipulation steps are
there?
Are all custom scripts under version control?
Is your writing (content) under version control?

Publication
Have you archived the exact version of every external
application used in your process(es)?
Did you include a reproducibility statement or
declaration at the end of your paper(s)?
Are textual statements connected/linked to the
supporting results or data?
Did you archived preprints of resulting papers in a
public repository?
Did you release the underlying code at the time of
publishing a paper?
Are you providing public access to your scripts, runs,
and results?

Best Practices for Scientific Computing
Write programs for people, not computers.
Let the computer do the work.
Make incremental changes.
DRY: Don’t repeat yourself (or others).
Plan for mistakes. (“Defensive Programming”)
Use pair programming.
Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.

Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.
Document design and purpose, not mechanics.

Suggested Training Topics
• version control and use of online repositories
• modern programming practice including unit testing and regression
testing
• maintaining “notebooks” or “research compendia”
• recording the provenance of final results relative to code and/or data
• numerical / floating point reproducibility and nondeterminism
• reproducibility on parallel systems
• dealing with large datasets
• dealing with complicated software stacks and use of virtual machines
• documentation and literate programming
• IP and licensing issues, proper citation and attribution
http://icerm.brown.edu/tw12-5-rcem/

Resources
• http://projecttemplate.net/ - Project automation (R)
• http://www.nature.com/news/2010/101013/full/4677
53a.html - Publish your computer code: it is good
enough
• http://www.carlboettiger.info/ - Open lab notebook
• http://wiki.stodden.net/ICERM_Reproducibility_in_Co
mputational_and_Experimental_Mathematics:_Readin
gs_and_References
• http://rrcns.readthedocs.org/ - Best practices tutorial
• http://www.bioinformaticszen.com/

Reproducibility: 10 Simple Rules

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Reproducibility: 10 Simple Rules

Similaire à Reproducibility: 10 Simple Rules (20)

Dernier

Dernier (20)

Reproducibility: 10 Simple Rules