Duke Docker Day 2014: Research Applications with Docker

Research Analysis Applications with Docker:
Automate
Analyses, Reuse Them, Allow
Reproducibility
Duke Office of Information Technology
9/11/14 | Duke Docker Day

Docker Concepts
▪ Build context: Directory with a Dockerfile, and any files to be added to the image to be
built
▪ Image:
▫ Like a VM Image
▫ run to produce a container
▫ Multiple containers can be produced from the same image
▫ Images are shared on a hub
▫ Foundation of reusability and reproducibility
▪ Container:
▫ Like a VM machine instance
▫ A running instance of an image
▪ Hub:
▫ Network accessible repository of named docker images
▫ https://registry.hub.docker.com is the world’s repo of docker images
▫ Can be hosted internally for private images
▫ docker commandline is aware of a hub (registry.hub.docker.com by default, but
configurable)
2

Docker commandline interface
▪ Requires sudo (unless on a mac, or specially
configured by sysadmins)
▪ https://docs.docker.com/reference/commandli
ne/cli/
▪ https://docs.docker.com/reference/run/
3

Dockerfile
▪ https://docs.docker.com/reference/builder/
• http://devo.ps/blog/docker-dos-and-donts/
• Tension between lots of RUN statements VS a
single RUN of a big, all-inclusive installation
process (shell, puppet, ansible, etc.)
• lots of RUN’s can get hard to maintain
• you lose all the benefits of caching if you just
run a single installation process
• Look for the golden mean. Maybe run multiple
installation processes with the aim of adding
related functionality as a group
4

5
DEMO Plasmodium Alignment
A Research Analysis Pipeline WITH a Reproducible Exemplar!
https://github.com/dmlond/docker_bwa_aligner

What is a Docker Application?
▪ wraps the logic for exposing a single process
interface (may have many processes running in
the background, but generally exposes only
one process to the user)
▪ Can run much like an installed application
6

Example: dmlond/bwa_aligner
▪ It’s a perl script
▫ In a container built to have its own special *nix
environment
▪ Starts from centos:centos6
▪ Adds its own user ‘bwa_user’ with its own HOMEDIR
/home/bwa_user
▪ Adds the EPEL repo
▪ Adds bwa and samtools from EPEL using yum (one could
download source and compile just as easily)
▫ Hosted on github so you can view its build context, and
build it yourself from scratch
https://github.com/dmlond/bwa_aligner
▫ Hosted on dockerhub so you can run it on your own
machinehttps://registry.hub.docker.com/u/dmlond/bwa
_aligner
7

What is a Volume Container
• Image contains the logic for exposing one or more
distinct directory trees to other Docker containers
• Running the image to produce a container exposes its
own version of the specified directory tree
• A volume container can run and immediately exit, but
its specific directory tree stays around for use in other
containers
• Designed to be run with --name $name
• Other containers access a volume containers’ exposed
directory trees by passing its $name at run time using
the --volumes-from run parameter
• When you rm a volume container, all files in its specific
directory tree are destroyed
8

Example dmlond/bwa_reference_volume
▪ Dockerfile Exposes /home/bwa_user/bwa_indexed
▪ When the container runs (with a name), it exits
immediately
▪ A container can add files to the volume container
directory (dmlond/bwa_reference)
▪ A container can read files in the volume container
directory (dmlond/bwa_aligner)
▪ Each container created from the image has its own
distinct existence. Writes to the
/home/bwa_user/bwa_indexed directory tree in one
container does not affect the directory trees in
dmlond/bwa_reference_volume containers
9

FROM Here to Eternity and Beyond
▪ You can extend an existing image to have new
functionality using your own build context
▪ Use intermediate container names, and
tagging
10

11
DEMO Agents
Reusing and Extending the Applications from the Plasmodium
Alignment Exemplar!
https://github.com/dmlond/split_agent
[ https://github.com/dmlond/split_raw/blob/master/split_raw.pl ]
https://github.com/dmlond/bwa_aligner_agent
[ https://github.com/dmlond/bwa_aligner ]

Old School *nix is Cool Again!!!!
(For better or worse)
▪ STDIN, STDOUT, STDERR
▪ $?, the exit status
▪ Wrapper scripts
▪ Usage statements
▪ Building a containerized app feels like
compiling a C application, e.g edit, build, run,
repeat
12

Security
▪ Unlike in traditional VM, a docker container can access some host
resources
▪ DO NOT RUN AS ROOT BY DEFAULT!
▫ USER + ENTRYPOINT + CMD + WORKDIR
▫ These can be overridden at run time
▫ These can be overridden in new containers starting FROM them to
‘extend’ them
▪ Be very specific with commands, rely on wildcards and shell/exec
commands sparingly
▪ Use the same paranoid practices in your container apps that you use in
web/cgi applications:
▫ use open([“cmd”,”arg1”,”arg2”]) instead of open(“cmd arg1 arg2”)
▫ check for tainted input
▫ watch for wildcards in filenames, especially if doing chmod, chown,
chgrp, rsync, etc. (
http://www.defensecode.com/public/DefenseCode_Unix_WildCards
_Gone_Wild.txt )
13
http://opensource.com/business/14/9/security-for-docker

Acknowledgements:
▪ Duke Office of Research Informatics
▪ ORI Research Application Development Group
▪ Duke Office of Information Technology
▪ Mark Delong (Duke Research Computing)
▪ Chris Collins (OIT)
▪ Erich Huang (Duke School of Medicine)
▪ Greg Crawford (Genomics and Computational
Biology)
▪ Rutger Vos (Naturalis)
14

References
1. Stodden VC. Reproducible research: Addressing the need for data and code sharing in
computational science. Computing in Science & Engineering 2010
2. Stodden V, Guo P, Ma Z. Toward Reproducible Computational Research: An Empirical
Analysis of Data and Code Policy Adoption by Journals. Zaykin D, editor. PLoS ONE.
Public Library of Science; 2013;8(6):e67111.
3. Francis S. Collins& Lawrence A. Tabak. Policy: NIH plans to enhance reproducibility.
Nature 505, 612–613 (30 January 2014)
4. Announcement: Reducing our irreproducibility. Nature News. 2013 Apr
25;496(7446):398–8.
5. Ince D. C., Hatton L., Graham-Cumming J. The Case for Open Computer Programs.
Nature 482, 485–488 (23 February 2012).
6. Dudley JT, Butte AJ (2009) A Quick Guide for Developing Effective Bioinformatics
Programming Skills. PLoS Comput Biol 5(12): e1000589.
doi:10.1371/journal.pcbi.1000589.
7. Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS
Comput Biol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424.
15

https://www.docker.com/whatisdocker
▪ Open platform for developers and sysadmins to
build, ship, and run distributed applications
▪ Consists of Docker Engine, a portable, lightweight
runtime and packaging tool, and Docker Hub, a
cloud service for sharing applications and
automating workflows
▪ Enables apps to be quickly assembled from
components and eliminates the friction between
development, QA, and production environments
▪ IT can ship faster and run the same app,
unchanged, on laptops, data center VMs, and any
cloud
17

Application Developers Like Docker
19

Another Reason to Like Docker
20

Researchers should also like Docker
Computation is becoming more prevalent
▪ "Computation is becoming central to
the scientific enterprise, but the
prevalence of relaxed attitudes about
communicating computational
experiments’ details and the validation
of results is causing a large and growing
credibility gap.” (1)
▪ “To adhere to the scientific method in
the face of the transformations arising
from changes in technology and the
Internet, we must be able to reproduce
computational results.” (1)
Granting Agencies and Journals have
begun to take note
▪ 2012 saw a one year increase of 16% in
the number of data policies, a 30%
increase in code policies, and a 7%
increase in the number of
supplemental materials policies in
journals (2)
▪ NIH has introduced new mandatory
training modules, and reviewer
checklists (3)
▪ Nature has introduced checklists to
enhance reproducibility (4)
http://melissagymrek.com/science/2014/08/29/docker-reproducible-research.html
21

What about Good Old Excel Spreadsheets?
Benefits
▪ Reusable
▪ Reproducible
▪ Shareable
▪ Code and Data stored in
one convenient package
Problems
▪ $$$
▪ Only works on MS
Windows and OSX*
▪ Easy to share data not
intended for sharing (PHI
accidentally left in
another worksheet)
▪ Inter-version
incompatibilities
▪ Does not scale to big data
▪ Security (macros and
viruses)
22

Free and Open Source Code
Benefits
▪ Free for anyone
▪ Code can easily be shared using
online repositories (github,
sourceforge, etc.), separately
from data, and without cost to
publisher or peers
▪ Can scale to big data
Problems
▪ Inter-version incompatibilities
▪ Difficult to fully specify software
dependencies (especially when moving
between architectures and OSes)
▪ Dependency clashes between libraries
required by different applications
▪ Data must be structured rigorously,
and code must be written in a special
way to facilitate automation and
reproducibility (6,7)
▪ Code and Data distribution must be
managed independently
▪ Code can get stale without routine
maintenance
23
“we have reached the point that, with some
exceptions, anything less than release of
actual source code is an indefensible approach
for any scientific results that depend on
computation, because not releasing such code
raises needless, and needlessly confusing,
roadblocks to reproducibility.”(5)

Workflow Enactors (Taverna, Galaxy,…)
Benefits
▪ Easy to share workflows
with others
▪ Reduces dependency
Clashes
▪ Can scale to big data
with proper
parallelization
Problems
▪ Dependence on web
accessible data
(security, privacy)
▪ Emphasize web services
over commandline
applications
▪ Still have inter-version
incompatibilities
24

Machine Images (Virtualbox, VMWare)?
Benefits
▪ Eliminates inter-version
incompatibility issues
▪ Eliminates dependency
clashes
▪ Can be shared between
different servers (internal,
amazon, google, etc.)
running VM hosting
technology
▪ Can spin up/tear down as
many instances as needed
Problems
▪ Can get quite large
▪ Slow to start
▪ Must become proficient
with server provisioning
commands (package
management,
puppet/chef/ansible,
etc.)
25

Docker
Benefits
▪ Similar to VM
▪ Start and Stop much
faster than VM
▪ Docker images are
smaller than equivalent
VM Image
Problems
▪ Must become proficient
with server provisioning
commands (package
management,
puppet/chef/ansible, etc.
google ‘DevOps’)
▪ Orchestration and
enactment
▪ Docker images may
provide access to host
resources that are not
available to VM images
26

Duke Docker Day 2014: Research Applications with Docker

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Duke Docker Day 2014: Research Applications with Docker

Similaire à Duke Docker Day 2014: Research Applications with Docker (20)

Dernier

Dernier (20)

Duke Docker Day 2014: Research Applications with Docker

Notes de l'éditeur