Axa Assurance Maroc - Insurer Innovation Award 2024
The pulse of cloud computing with bioinformatics as an example
1. The Pulse of Cloud Computing
with Bioinformatics as an example
Nuwan Goonasekera†
, Enis Afgan*
†
University of Melbourne, Melbourne Bioinformatics, Australia
* Johns Hopkins University, Taylor Lab, USA
@ University of Colombo
Feb 2017
3. Overview
• The key characteristics of Cloud Computing
• Using Cloud Computing for bioinformatics
Source: http://dilbert.com/strips/comic/2012-05-25/
5. Data center use before cloud computing
source: http://www.rackspace.com/knowledge_center/whitepaper/revolution-not-evolution-how-cloud-computing-differs-from-traditional-it-and-why-it
6. Cloud Computing: A Definition
• NIST definition: “Cloud computing is a model for enabling
ubiquitous, convenient, on-demand network access to a
shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that
can be rapidly provisioned and released with minimal
management effort or service provider interaction.”
» National Institute of Standards and Technology
(http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf)
7. The Cloud Model
Private Community Public Hybrid
Deployment
Models
Delivery
Models
Essential
Characteristics
Software as a Service
(SaaS)
Platform as a Service
(PaaS)
Infrastructure as a
Service (IaaS)
• On-demand self-service
• Broad network access
• Resource pooling
• Rapid elasticity
• Measured service
10. Public PaaS Examples
Cloud Name Language and
Developer Tools
Programming
Models Supported
by Provider
Target Applications
and Storage Options
Google App Engine Python, Java, Go,
PHP + JVM languages
(scala, groovy, jruby)
MapReduce, Web,
DataStore, Storage
and other APIs
Web applications and
BigTable storage
Salesforce.com’s
Force.com
Apex, Eclipsed-based
IDE, web-based
wizard
Workflow, excel-like
formula, web
programming
Business applications
such as CRM
Microsoft Azure .NET, Visual Studio,
Azure tools
Unrestricted model Enterprise and web
apps
Amazon Elastic
MapReduce
Hive, Pig, Java, Ruby
etc.
MapReduce Data processing and
e-commerce
Aneka .NET, stand-alone
SDK
Threads, task,
MapReduce
.NET enterprise
applications, HPC
11. Public SaaS examples
• Gmail
• Sharepoint
• Salesforce.com CRM
• On-live
• Gaikai
• Microsoft Office 365
• Some definitions include those that do not require payment.
E.g. ad-supported sites
12. Things we find most interesting
• Accessibility
• Infrastructure as code
• Elasticity
• Programming models that fit the cloud
13. Accessibility
● Global availability via public clouds
● On-demand self-service
● A platform for democratisation of computing
● Access is enabled via point-and-click interfaces (blends with the Internet)
18. Bioinformatics
A multi-disciplinary science using computers for acquiring, managing and
analyzing biological data.
It is a data-driven science.
It is a tool for genomics research.
Biology Medicine
Math &
Physics
Computer
Science
Bioinformatics
19. Genomics
Oxford dictionaries
“The branch of molecular biology concerned with the
structure, function, evolution, and mapping of genomes.”
Where are the genes and other interesting pieces?
How do sequences change over evolutionary time?
What does all the DNA do?
What are the physical shapes of the genome and its products?
20. Genomics: contrast with biology and genetics
Biology and genetics
Targeted studies of one
or a few genes
Targeted,
low-throughput
experiments
Clever experimental design,
painstaking experimentation
Genomics
Studies considering all
genes in a genome
Global,
high-throughput
experiments
Tons of data,
uncertainty, computation
scope
technology
hard part
* Everything on this slide is
a generalization
21. Where is genomics used?
Basic science
● What is the DNA sequence of the genome?
● Where are the genes?
● What does all the DNA in the genome do?
● How did history shape our ethnicities and populations?
Medicine
● What’s the difference between DNA in a tumor vs DNA in healthy tissue?
● Can genomic data help predict what drugs might be appropriate for:
○ a particular cancer patient?
○ a particular genetic disorder?
● Can genomic data help us predict what flu strains will prevail next year?
22. Genome
Oxford dictionaries
“The complete set of genes or genetic material
present in a cell or organism.”
“Blueprint” or “recipe” of life.
Self-copying store of read-only information about
how to develop and maintain an organism.
23. Where do genomes live?
All the trillions of cells in a person have
same genomic DNA in the nucleus.
Picture source:
https://publications.nigms.nih.gov/insidethecell/preface.html
Genome
24. How do we obtain genome data? Sequencing!
First methods developed in the mid-1970’s, called Sanger sequencing.
In the 1990’s, the international Human Genome Project took 13 years to sequence
the human genome.
In the 2000’s, massively parallel Next Generation Sequencers (NGS) were
developed that took days to sequence a human genome at a much lesser cost.
Today, nanopore sequencers are emerging, offering real time sequencing.
There are many public data repositories with
free access to data (e.g., TCGA, 1000 genomes,
GenBank).
25. Two unrelated humans have genomes that are ~99.8% similar by sequence.
There are about 3-4 million differences. Most are small, e.g. Single Nucleotide
Polymorphisms (SNPs).
Human and chimpanzee
genomes are about 96%
similar.
Genome variation
26. Apply data transformations to extract useful information
This is not always a well-defined process
This is typically done with existing tools, or by developing one’s own
Tools can be chained into workflows
Making sense of the data through data manipulations
27. What does all of this have to do with
Cloud Computing?
33. Results
Raw
data
Some computers + reliable persistent data storage +
bioinf tools + reference data + workflow system
100-1000's GB
few GB
Indexed
genomes
10-100's GB
Aug
Sep
Oct
Nov
...
A real-world infrastructure requirements
34. A Data analysis and integration tool
A (free for everyone) web service integrating a
wealth of tools, compute resources, terabytes of
reference data and permanent storage
Open source software that makes integrating your
own tools and data and customizing for your own
site simple
36. Three ways to use Galaxy
1. Download and run locally
2. Public website (http://usegalaxy.org)
3. Run on the Cloud
37. Bringing cloud resources to genomics
Cloud resources need to be provisioned and configured for use in genomics.
A Cloud Manager that orchestrates all of the steps required to provision, manage,
and share a compute platform on a cloud infrastructure, all through a web
browser.
43. Architectural stack
CloudLaunch.usegalaxy.org
C L O U D A P P S
CloudBridge
CloudMan
cloudbridge.readthedocs.org
github.com/gvlproject/cloudbridge
beta.launch.usegalaxy.org
github.com/galaxyproject/cloudlaunch-ui
github.com/galaxyproject/cloudlaunch
wiki.galaxyproject.org/CloudMan
github.com/galaxyproject/cloudman
46. Everything talked about here is an effort from a large community!
Come talk to us; get involved.
enis.afgan@jhu.edu or nuwan.goonasekera@unimelb.edu.au