For a Bioinformatics Discussion for Students and Post-Docs (BioDSP) meeting: Expands on Sandve's "Ten Simple Rules for Reproducible Computational Research"
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Reproducibility: 10 Simple Rules
1. Reproducibility:
10 Simple Rules
And more!
Sandve, Geir Kjetil, et al. "Ten simple rules for reproducible computational research." PLoS computational biology 9.10 (2013): e1003285.
2. Rule 1: For Every Result, Keep Track of
How It Was Produced
http://xkcd.com/
3. Rule 2: Avoid Manual Data
Manipulation Steps
• “Stop clicking, start typing” – Matt Frost,
Charlottesville, VA
• Use scripts for even small changes
• Split commonly used code off into
functions/classes, and put these into libraries
4. Rule 3: Archive the Exact Versions of
All External Programs Used
Level
0
Note names and versions
of all packages
Level
1
Use package management
system (packrat,
anaconda/conda)
Boss
Level
Save image of entire
system
5. Rule 4: Version Control All Custom
Scripts
http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics
• Also, version control workflows (what are
good workflow management systems, guys?)
• Use the commit
space to write
something useful to
your future self
(“pwew pwew pwew”
is not useful)
6. Rule 5: Record All Intermediate
Results, When Possible in Standardized
Formats
• “Explicit is better than implicit” – Tim Peters,
The Zen of Python
7. Rule 6: For Analyses That Include
Randomness, Note Underlying
Random Seeds
• This goes for all parameters that may change
• Separate code from configuration, e.g. use
config files (another gift to your future self!)
8. Rule 7: Always Store Raw Data behind
Plots
• (and the plot generating code, too)
• Make raw data read only
• Separate folders for raw and pre-processed
data
https://inspguilfoyle.wordpress.com/2014/02/19/straight-lines/
9. Rule 8: Generate Hierarchical Analysis
Output, Allowing Layers of Increasing
Detail to Be Inspected
11. Rule 10: Provide Public Access to
Scripts, Runs, and Results
• GitHub
• Synapse
• Open Science Framework
• ReadTheDocs
• RunMyCode
• ???
12. Documentation
Is it clear where to begin? (e.g., can someone picking a project up
see where to start running it)
can you determine which file(s) was/were used as input in a process
that produced a derived file?
Who do I cite? (code, data, etc.)
Is there documentation about every result?
Have you noted the exact version of every external application used
in the process?
For analyses that include randomness, have you noted the
underlying random seed(s)?
Have you specified the license under which you're distributing your
content, data, and code?
Have you noted the license(s) for others peoples' content, data, and
code used in your analysis?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
13. Organization
Which is the most recent data file/code?
Which folders can I safely delete?
Do you keep older files/code or delete them?
Can you find a file for a particular replicate of your research
project?
Have you stored the raw data behind each plot? Is your analysis
output done hierarchically? (allowing others to find more detailed
output underneath a summary)
Do you run backups on all files associated with your analysis?
How many times has a particular file been generated in the past?
Why was the same file generated multiple times?
Where did a file that I didn't generate come from?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
14. Automation
Are there lots of manual data manipulation steps are
there?
Are all custom scripts under version control?
Is your writing (content) under version control?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
15. Publication
Have you archived the exact version of every external
application used in your process(es)?
Did you include a reproducibility statement or
declaration at the end of your paper(s)?
Are textual statements connected/linked to the
supporting results or data?
Did you archived preprints of resulting papers in a
public repository?
Did you release the underlying code at the time of
publishing a paper?
Are you providing public access to your scripts, runs,
and results?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
16. Best Practices for Scientific Computing
Write programs for people, not computers.
Let the computer do the work.
Make incremental changes.
DRY: Don’t repeat yourself (or others).
Plan for mistakes. (“Defensive Programming”)
Use pair programming.
Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.
17. Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.
Document design and purpose, not mechanics.
18. Suggested Training Topics
• version control and use of online repositories
• modern programming practice including unit testing and regression
testing
• maintaining “notebooks” or “research compendia”
• recording the provenance of final results relative to code and/or data
• numerical / floating point reproducibility and nondeterminism
• reproducibility on parallel systems
• dealing with large datasets
• dealing with complicated software stacks and use of virtual machines
• documentation and literate programming
• IP and licensing issues, proper citation and attribution
http://icerm.brown.edu/tw12-5-rcem/
19. Resources
• http://projecttemplate.net/ - Project automation (R)
• http://www.nature.com/news/2010/101013/full/4677
53a.html - Publish your computer code: it is good
enough
• http://www.carlboettiger.info/ - Open lab notebook
• http://wiki.stodden.net/ICERM_Reproducibility_in_Co
mputational_and_Experimental_Mathematics:_Readin
gs_and_References
• http://rrcns.readthedocs.org/ - Best practices tutorial
• http://www.bioinformaticszen.com/