How to Troubleshoot Apps for the Modern Connected Worker
Cloud ntino-krampis
1. Cloud BioLinux: pre-Configured and on-demand
computing for genomics independently of institutional,
geographic or economic boundaries
Ntino Krampis, PhD
JCVI-NIAID workshop 2011
S. Africa
2. Expensive sequencing and large organizations
Commodity sequencing and small labs
●
large sequencing center, multi-million, broad-impact sequencing projects
● dedicated bioinformatics department, coordination with other centers
● small-factor, bench-top sequencer available: GS Junior by 454
● sequencing as a standard technique in basic biology and genetics research
● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
3. Acquiring the sequence data is only the first step
● downstream bioinformatics analysis for scientific discovery
● many commonly-used bioinformatics tools are difficult to install
● usually available only as source code - needs technical expertise
● large-scale sequence data analysis requires high performance
and expensive computing hardware
4. Alternative: computational capacity on the cloud
● Cloud Computing: large-scale, high
performance computers accessible
through the Internet
●Example: using Gmail, Google Docs,
Yahoo! Mail, FaceBook etc. you store and
access data on a remote computer
●Cloud Computing services - Amazon
EC2 (http://aws.amazon.com/ec2) rent high
computational and data storage capacity
on remote computers
5. How does Cloud Computing work ?
remote Amazon EC2 Cloud Computing service
operating system, bioinformatics software
and data, are installed in a Virtual Machine VM VM VM
(VM)
a VM is uploaded and executed on a cloud
computing service
run a practically unlimited number of VMs Internet
for large-scale sequence data analysis
access VM on a desktop computer through
the Internet
local desktop computers
6. Cloud BioLinux
● Cloud BioLinux by leverages VM technology and the
cloud, offering pre-configured bioinformatics computing
● allow setting up a high-performance data analysis
environment, without any technical expertise
● researchers can perform large-scale data analysis, by
simply using a desktop computer with Internet access
● accessible without any institutional, economic or national
boundaries
7. Launching Cloud BioLinux
1. sign up for an Amazon EC2 cloud account:
http://aws.amazon.com/ec2
Also can connect an existing account from the main Amazon.com website
for the cloud usage charges. We have an account ready for you:
Username: aws_nhgri@jcvi.org
Password: Nhg4|CL0ud!
2. using the account credentials sign in to the EC2 cloud console
(select EC2 in the dropdown menu below the sign-in button):
http://aws.amazon.com/console
3. launch Cloud BioLinux through the cloud console wizard
9. Launch instance wizard: steps 1 & 2
1. specify the Cloud
BioLinux identifier
under “Community
AMIs” tab
2. computational
capacity: memory,
processor, CPU cores
10. Launch instance wizard: step 3
3. specify a
password for login for
the Cloud BioLinux
desktop, under “User
Data” box
4. remaining steps:
all as default, keep
clicking the
“Continue” button
until the wizard
finishes and you are
back to the console
11. Launching
Cloud
BioLinux
back to the
console after
we completed
the wizard
Pick a running
instance, select
with your
mouse and
copy its “Public
DNS” address
(Cloud
BioLinux
server address
on the cloud)
12. While waiting for Cloud BioLinux to boot up...
● examples of NCBI public datasets on EC2
● bringing the data to the compute
13. Final step: connecting remotely to Cloud BioLinux
click the NX client icon on your computer's desktop
A. paste the DNS in the “Host” box B. select “Unix”, “Gnome”, remote desktop size
C. “ubuntu” is the default user Login
“workshop” is the password we set
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27. What if I want to
share my
alignments with
a collaborator?
save your data as
a new VM
0.10$ / GB /
month
at 15GB, it costs
1.5$ / month
28. Cloud BioLinux
whole system snapshot exchange
share your analysis results: publicly or only with your
collaborators
authorized users can access the cloud VM/image with
all the software, data, analysis results
29. Cloud BioLinux and Genomic Standards
whole system snapshot exchange
start VM / image share
perform analysis snapshot researcher B
researcher A
snapshot perform analysis
share start VM / image
30. Acknowledgments & Credits
Brad Chapman - development of the fabric scripts and community organizer
Tim Booth, Bela Tiwari, Dawn Field – BioLinux 6.0 development and EC2 documentation
Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop
Justin Johnson – community and sponsorship of cloudbiolinux.com
J. Craig Venter Inst. - time allowed to work on an open-source project
D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation
Members of the Cloud Biolinux community:
Enis Afgan
Michael Heuer
Richard Holland
Mark Jensen Thank you !
Dave Messina
Steffen Möller
Roman Valls