This document evaluates the performance of container virtualization using Docker for a bioinformatics application called MEGADOCK. Two experiments were conducted:
1) MEGADOCK was run on a physical machine with and without Docker, showing a 6.32% performance overhead with Docker. With NVIDIA Docker on GPU, performance was comparable to native.
2) MEGADOCK was run on Azure VMs with and without Docker, showing comparable scalability. Docker performance was around 6x faster than VMs.
The results show that Docker introduces small overhead for compute-intensive applications like MEGADOCK. Docker provides advantages of environment isolation and portability without significant performance costs.
AWS Community Day CPH - Three problems of Terraform
Evaluation of Container Virtualized MEGADOCK System in Distributed Computing Environment
1. Evaluation of Container Virtualized
MEGADOCK System
in Distributed Computing Environment
March 23th, 2017
SIG BIO 49@Japan Advanced Institute of Science and Technology
Kento Aoyama1,2, Yuki Yamamoto1,2, Masahito Ohue1,3, Yutaka Akiyama1,2,3
1) Department of Computer Science, School of Computing
Tokyo Institute of Technology
2) Education Academy of Computational Life Sciences (ACLS)
Tokyo Institute of Technology
3) Advanced Computational Drug Discovery Unit, Institute of Innovative Research
Tokyo Institute of Technology
3. Docker and Bioinformatics 3
A. Paolo, D. Tommaso, A. B. Ramirez, E. Palumbo, C. Notredame, and D.
Gruber, “Benchmark Report : Univa Grid Engine , Nextflow , and Docker
for running Genomic Analysis Workflows.”
Docker Integration Benchmark Report
@Centre for Genomic Regulation
(Barcelona, Spain)
• Univa Grid Engine (Job Scheduler)
• Nextflow (Workflow manager)
• Docker (Linux Container)
• Reproducibility
• Portability
4. To develop the
Container-Native HPC Bioinformatics Application
Using Linux Container
which has …
• Low Dependency on Environment
• High-Performance
• Parallel execution performance
• Overhead of virtualization
• Dynamically Scaling
Research Purpose 4
5. • To evaluate the
Performance of Docker Container-Virtualization
in Bioinformatics Application
Target Application
• MEGADOCK[1]
• FFT-grid-based Protein-Protein Docking software
• Multi-threading, Multi-node, Multi-GPU (OpenMP, MPI, GPU)
• Extremely compute intensive workloads
Today’s Report 5
[1] Masahito Ohue, et al. “MEGADOCK 4.0: an ultra-high-performance protein-protein docking
software for heterogeneous supercomputers”, Bioinformatics, 30(22): 3281-3283, 2014.
7. Kernel-Shared Virtualization
• Lightweight : small size, fast deploy, easy sharing
• Performance : few virtualization overhead, faster than VM
Linux Container 7
Hardware
Linux Kernel
Container
App
Bins/Libs
Container
App
Bins/Libs
Hardware
Virtual
Machine
App
Guest
OS
Bins/Libs
Virtual
Machine
App
Guest
OS
Bins/Libs
Hypervisor
Virtual Machines Containers
8. Linux Container
• virtualizes the host resource as containers
• Filesystem, hostname, IPC, PID, Network, User, etc.
• can be used like Virtual Machines
Linux Kernel Features
• Containers are sharing same host kernel
• namespace[1], chroot, cgroup, SELinux, etc.
Container-based Virtualization 8
[1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”,
In Proceedings of the 2006 Ottawa Linux Symposium, 2006.
Machine
Linux Kernel Space
Container
Process
Process
Container
Process
Process
9. Linux Container – Performance [1] 9
[1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual
machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and
Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.)
0.96 1.00 0.98
0.78
0.83
0.99
0.82
0.98
0.00
0.20
0.40
0.60
0.80
1.00
PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS]
PerformanceRatio
[basedNative]
Native Docker KVM KVM-tuned
10. Docker [1]
• Most popular Linux Container management platform
• Many useful components and services
Linux Container Management Tools 10
[1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker
[2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for
HPC,” Cray User Group, pp. 1–12, 2016.
[3] “Singularity” - http://singularity.lbl.gov/
[1]
[2] [3]
11. Easy container sharing – Docker Hub 11
Portability & Reproducibility
• Easy to share the application environment via Docker Hub
• Containers can be executed on other host machine
Ubuntu
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Docker Hub
Image
App
Bins/Libs
Push Pull
Dockerfile
apt-get install …
wget …
…
make
CentOS
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Generate
Share
13. Why in the field of Bioinformatics?
• Types of Applications
• Data Analysis, Machine Learning
• MD Simulation, Docking calc. , etc.
• Data-centric workload
• Compute : Large
• Data I/O : Case by case
• Communication : Small
• Container performs well on compute-Intensive workload[1]
For Bioinformatics Apps : 1 13
[1] W. Felter, et al. “An updated performance comparison of virtual
machines and Linux containers,” IEEE International Symposium on
Performance Analysis of Systems and Software, pp.171-172, 2015.
14. Reproducibility
• Different version of library can make different result
• e.g.) Genomic analysis pipeline [Paolo, 2016]
Container A’
Container A
Container BContainer A
For Bioinformatics Apps : 2 14
Library A
Application A Application B
version >= 1.2 version < 1.1
Application A
Library version 1.3
Result A’
Application A
Library version 1.2
Result A
conflict
different
result
Dependency
Isolation
Application
Reproducibility
Dependency conflict
• Different application can requires different version of same library
15. Performance
• Few performance overhead
Reproducibility
• Dependency Isolation from other applications/libraries
Portability, Generality
• Sharing/Porting to other environment
Features for Bioinformatics Apps 15
Features Native VM Container
Performance
Scalability
Great Bad Good
Reproducibility Bad Good Great
Portability
Generality
Bad Great Great
17. MEGADOCK 17
Masahito Ohue, et al. “MEGADOCK 4.0: an ultra-high-
performance protein-protein docking software for
heterogeneous supercomputers”, Bioinformatics,
30(22): 3281-3283, 2014.
High-performance protein-protein interaction predictions
• FFT-grid based docking software
• Extremely compute-intensive
• OpenMP/MPI/GPU support
• Great HPC Performance
18. Container-based Application Distribution 18
ResourceResource
MEGA
DOCK
Resource
MEGA
DOCK
Add/Remove
Container
Resource
MEGA
DOCK
Add/Remove
Application
Layer
Compute
Resource
Layer
• All application dependencies exist in the Container
• Easy-to-test application
• Easy-to-scale size of resources
Test Environment Production Environment
29. Scalability (Strong Scaling, based VM=1) 29
0
5
10
15
20
25
30
35
40
45
0 100 200 300 400 500
Speed-up
# of worker cores
Ideal VM Docker on VM
VM=5
VM=1
VM=10
VM=20
VM=30
comparable scalability
30. Experiment I
• MEGADOCK + Docker on Physical Machine
showed 6.32% lower performance.
• Docker can cause 0-4% compute-performance down[1]
• Communications via Docker NAT (Network Address Translation)
• MEGADOCK (GPU) + NVIDIA-Docker on Physical Machine
showed comparable performance to native.
• GPU calc. is independent from container virtualization
• Container virtualization has few overhead on memory bandwidth
Experiment II
• MEGADOCK + Docker on Microsoft Azure
performed comparable scalability.
• Container virtualization overhead is smaller than other cloud environment factor
Result & Discussion 30
[1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual
machines and Linux containers”, IEEE International Symposium on Performance Analysis of Systems
and Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.)
31. • Performance overhead of
Docker container-virtualization is small.
• suitable for GPU-accelerated-App and Cloud Environment
• Container-Virtualization can isolate
application environment from host environment.
• same container image can be used on various machines
• Physical machine on local environment
• Virtual machine on cloud environment
• Docker is useful for computational research work
Conclusion 31
32. Multi-Node & Multi-GPU Evaluation on Cloud
• NVIDIA-Docker is not available on Docker Swarm mode
• Kubernetes[1] officially support 1GPU/1node
• (experimental-feature: multi-GPU support)
Container-based Task Distribution
• Web-Service-Application like container-based distribution
• easy to scale computing resource
• easy to extends multiple task (e.g. GHOST-MP, MEGADOCK)
Future Work 32
[1] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, Omega, and
Kubernetes,” acmqueue, vol. 14, no. 1, p. 24, 2016.