This talk was presented at the first annual Duke Docker Day presented by the Duke Office of Information Technology. It describes a reproducible analysis pipeline using Docker images
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
Duke Docker Day 2014: Research Applications with Docker
1. Research Analysis Applications with Docker:
Automate
Analyses, Reuse Them, Allow
Reproducibility
Duke Office of Information Technology
9/11/14 | Duke Docker Day
2. Docker Concepts
▪ Build context: Directory with a Dockerfile, and any files to be added to the image to be
built
▪ Image:
▫ Like a VM Image
▫ run to produce a container
▫ Multiple containers can be produced from the same image
▫ Images are shared on a hub
▫ Foundation of reusability and reproducibility
▪ Container:
▫ Like a VM machine instance
▫ A running instance of an image
▪ Hub:
▫ Network accessible repository of named docker images
▫ https://registry.hub.docker.com is the world’s repo of docker images
▫ Can be hosted internally for private images
▫ docker commandline is aware of a hub (registry.hub.docker.com by default, but
configurable)
2
3. Docker commandline interface
▪ Requires sudo (unless on a mac, or specially
configured by sysadmins)
▪ https://docs.docker.com/reference/commandli
ne/cli/
▪ https://docs.docker.com/reference/run/
3
4. Dockerfile
▪ https://docs.docker.com/reference/builder/
• http://devo.ps/blog/docker-dos-and-donts/
• Tension between lots of RUN statements VS a
single RUN of a big, all-inclusive installation
process (shell, puppet, ansible, etc.)
• lots of RUN’s can get hard to maintain
• you lose all the benefits of caching if you just
run a single installation process
• Look for the golden mean. Maybe run multiple
installation processes with the aim of adding
related functionality as a group
4
5. 5
DEMO Plasmodium Alignment
A Research Analysis Pipeline WITH a Reproducible Exemplar!
https://github.com/dmlond/docker_bwa_aligner
6. What is a Docker Application?
▪ wraps the logic for exposing a single process
interface (may have many processes running in
the background, but generally exposes only
one process to the user)
▪ Can run much like an installed application
6
7. Example: dmlond/bwa_aligner
▪ It’s a perl script
▫ In a container built to have its own special *nix
environment
▪ Starts from centos:centos6
▪ Adds its own user ‘bwa_user’ with its own HOMEDIR
/home/bwa_user
▪ Adds the EPEL repo
▪ Adds bwa and samtools from EPEL using yum (one could
download source and compile just as easily)
▫ Hosted on github so you can view its build context, and
build it yourself from scratch
https://github.com/dmlond/bwa_aligner
▫ Hosted on dockerhub so you can run it on your own
machinehttps://registry.hub.docker.com/u/dmlond/bwa
_aligner
7
8. What is a Volume Container
• Image contains the logic for exposing one or more
distinct directory trees to other Docker containers
• Running the image to produce a container exposes its
own version of the specified directory tree
• A volume container can run and immediately exit, but
its specific directory tree stays around for use in other
containers
• Designed to be run with --name $name
• Other containers access a volume containers’ exposed
directory trees by passing its $name at run time using
the --volumes-from run parameter
• When you rm a volume container, all files in its specific
directory tree are destroyed
8
9. Example dmlond/bwa_reference_volume
▪ Dockerfile Exposes /home/bwa_user/bwa_indexed
▪ When the container runs (with a name), it exits
immediately
▪ A container can add files to the volume container
directory (dmlond/bwa_reference)
▪ A container can read files in the volume container
directory (dmlond/bwa_aligner)
▪ Each container created from the image has its own
distinct existence. Writes to the
/home/bwa_user/bwa_indexed directory tree in one
container does not affect the directory trees in
dmlond/bwa_reference_volume containers
9
10. FROM Here to Eternity and Beyond
▪ You can extend an existing image to have new
functionality using your own build context
▪ Use intermediate container names, and
tagging
10
11. 11
DEMO Agents
Reusing and Extending the Applications from the Plasmodium
Alignment Exemplar!
https://github.com/dmlond/split_agent
[ https://github.com/dmlond/split_raw/blob/master/split_raw.pl ]
https://github.com/dmlond/bwa_aligner_agent
[ https://github.com/dmlond/bwa_aligner ]
12. Old School *nix is Cool Again!!!!
(For better or worse)
▪ STDIN, STDOUT, STDERR
▪ $?, the exit status
▪ Wrapper scripts
▪ Usage statements
▪ Building a containerized app feels like
compiling a C application, e.g edit, build, run,
repeat
12
13. Security
▪ Unlike in traditional VM, a docker container can access some host
resources
▪ DO NOT RUN AS ROOT BY DEFAULT!
▫ USER + ENTRYPOINT + CMD + WORKDIR
▫ These can be overridden at run time
▫ These can be overridden in new containers starting FROM them to
‘extend’ them
▪ Be very specific with commands, rely on wildcards and shell/exec
commands sparingly
▪ Use the same paranoid practices in your container apps that you use in
web/cgi applications:
▫ use open([“cmd”,”arg1”,”arg2”]) instead of open(“cmd arg1 arg2”)
▫ check for tainted input
▫ watch for wildcards in filenames, especially if doing chmod, chown,
chgrp, rsync, etc. (
http://www.defensecode.com/public/DefenseCode_Unix_WildCards
_Gone_Wild.txt )
13
http://opensource.com/business/14/9/security-for-docker
14. Acknowledgements:
▪ Duke Office of Research Informatics
▪ ORI Research Application Development Group
▪ Duke Office of Information Technology
▪ Mark Delong (Duke Research Computing)
▪ Chris Collins (OIT)
▪ Erich Huang (Duke School of Medicine)
▪ Greg Crawford (Genomics and Computational
Biology)
▪ Rutger Vos (Naturalis)
14
15. References
1. Stodden VC. Reproducible research: Addressing the need for data and code sharing in
computational science. Computing in Science & Engineering 2010
2. Stodden V, Guo P, Ma Z. Toward Reproducible Computational Research: An Empirical
Analysis of Data and Code Policy Adoption by Journals. Zaykin D, editor. PLoS ONE.
Public Library of Science; 2013;8(6):e67111.
3. Francis S. Collins& Lawrence A. Tabak. Policy: NIH plans to enhance reproducibility.
Nature 505, 612–613 (30 January 2014)
4. Announcement: Reducing our irreproducibility. Nature News. 2013 Apr
25;496(7446):398–8.
5. Ince D. C., Hatton L., Graham-Cumming J. The Case for Open Computer Programs.
Nature 482, 485–488 (23 February 2012).
6. Dudley JT, Butte AJ (2009) A Quick Guide for Developing Effective Bioinformatics
Programming Skills. PLoS Comput Biol 5(12): e1000589.
doi:10.1371/journal.pcbi.1000589.
7. Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS
Comput Biol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424.
15
17. https://www.docker.com/whatisdocker
▪ Open platform for developers and sysadmins to
build, ship, and run distributed applications
▪ Consists of Docker Engine, a portable, lightweight
runtime and packaging tool, and Docker Hub, a
cloud service for sharing applications and
automating workflows
▪ Enables apps to be quickly assembled from
components and eliminates the friction between
development, QA, and production environments
▪ IT can ship faster and run the same app,
unchanged, on laptops, data center VMs, and any
cloud
17
21. Researchers should also like Docker
Computation is becoming more prevalent
▪ "Computation is becoming central to
the scientific enterprise, but the
prevalence of relaxed attitudes about
communicating computational
experiments’ details and the validation
of results is causing a large and growing
credibility gap.” (1)
▪ “To adhere to the scientific method in
the face of the transformations arising
from changes in technology and the
Internet, we must be able to reproduce
computational results.” (1)
Granting Agencies and Journals have
begun to take note
▪ 2012 saw a one year increase of 16% in
the number of data policies, a 30%
increase in code policies, and a 7%
increase in the number of
supplemental materials policies in
journals (2)
▪ NIH has introduced new mandatory
training modules, and reviewer
checklists (3)
▪ Nature has introduced checklists to
enhance reproducibility (4)
http://melissagymrek.com/science/2014/08/29/docker-reproducible-research.html
21
22. What about Good Old Excel Spreadsheets?
Benefits
▪ Reusable
▪ Reproducible
▪ Shareable
▪ Code and Data stored in
one convenient package
Problems
▪ $$$
▪ Only works on MS
Windows and OSX*
▪ Easy to share data not
intended for sharing (PHI
accidentally left in
another worksheet)
▪ Inter-version
incompatibilities
▪ Does not scale to big data
▪ Security (macros and
viruses)
22
23. Free and Open Source Code
Benefits
▪ Free for anyone
▪ Code can easily be shared using
online repositories (github,
sourceforge, etc.), separately
from data, and without cost to
publisher or peers
▪ Can scale to big data
Problems
▪ Inter-version incompatibilities
▪ Difficult to fully specify software
dependencies (especially when moving
between architectures and OSes)
▪ Dependency clashes between libraries
required by different applications
▪ Data must be structured rigorously,
and code must be written in a special
way to facilitate automation and
reproducibility (6,7)
▪ Code and Data distribution must be
managed independently
▪ Code can get stale without routine
maintenance
23
“we have reached the point that, with some
exceptions, anything less than release of
actual source code is an indefensible approach
for any scientific results that depend on
computation, because not releasing such code
raises needless, and needlessly confusing,
roadblocks to reproducibility.”(5)
24. Workflow Enactors (Taverna, Galaxy,…)
Benefits
▪ Easy to share workflows
with others
▪ Reduces dependency
Clashes
▪ Can scale to big data
with proper
parallelization
Problems
▪ Dependence on web
accessible data
(security, privacy)
▪ Emphasize web services
over commandline
applications
▪ Still have inter-version
incompatibilities
24
25. Machine Images (Virtualbox, VMWare)?
Benefits
▪ Eliminates inter-version
incompatibility issues
▪ Eliminates dependency
clashes
▪ Can be shared between
different servers (internal,
amazon, google, etc.)
running VM hosting
technology
▪ Can spin up/tear down as
many instances as needed
Problems
▪ Can get quite large
▪ Slow to start
▪ Must become proficient
with server provisioning
commands (package
management,
puppet/chef/ansible,
etc.)
25
26. Docker
Benefits
▪ Similar to VM
▪ Start and Stop much
faster than VM
▪ Docker images are
smaller than equivalent
VM Image
Problems
▪ Must become proficient
with server provisioning
commands (package
management,
puppet/chef/ansible, etc.
google ‘DevOps’)
▪ Orchestration and
enactment
▪ Docker images may
provide access to host
resources that are not
available to VM images
26
Notes de l'éditeur
you have already seen views on the ways that Docker can be used to solve problems in many different domains of IT. This is why Docker has become so popular, so fast. It is a technology that meets different, unique needs in many different parts of the IT universe.
Imagine if you could
buy a computer
install a specific version of linux on it
Install only the linux packages needed by your application
Override the boot sequence to only allow the user to enter arguments for your application, run the application, store the results on a mounted file system, and exit
This is what Docker allows you to do
Emphasis on Distributed Applications makes it useful for sharing applications as well, get cloud ready apps for free
As you will see, componentization is the foundation for ‘reuse and repurposing’ of existing applications
‘Friction’ is what you experience each time you request an upgrade to R, Perl, Python, Java, or ask for a library such as one supporting HTTPS communication
Docker Images are 99% immutable, Soon they will be signable. They can have PROV attached to them
Excited about the way Docker allows them to more efficiently use their existing metal infrastructure to serve more applications
Excited about a radical new ways to build, deploy, and continuously improve their applications
All are excited about how Docker allows them to mitigate dependency clashes between different software products, and even different versions of the same software product
The standard that is currently emerging will, very soon, require researchers to make it possible for other researchers to run your analysis pipeline on your data to produce values that confirm your original research finding
Ironically, as technologies such as Docker make this possible for you, they raise the bar of expectations of what others expect out of you
While some might sneer at ‘Those Darn Excel Spreadsheets’ there are valid reasons why this software rose to dominate the research analysis world in the last 4 decades
Here I have added a somewhat controversial viewpoint expressed in the pages of the journal Nature
Those developing processes to analyze big data will also have to overcome many of these issues, regardless of whether they are intending to share their code Open Source
Reduces dependency clashes, so long as users use the version of tools existing within the enactor compute nodes
It is no accident that discussions of reproducible research computation accelerated at about the same time that VM technologies began to emerge