This talks covers the current challenges and opportunities for using cloud computing for data-heavy, research computing.
Talk given at the Marcus Evans "Cloud Computing in the Pharmaceutical Industry" conference, Frankfurt 2011.
3. The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based in Hinxton Genome Campus,
Cambridge, UK.
Large scale genomic research.
• Sequenced 1/3 of the human genome.
(largest single contributor).
• We have active cancer, malaria,
pathogen and genomic variation / human
health studies.
All data is made publicly
available.
• Websites, ftp, direct database. access,
programmatic APIs.
8. Ensembl
Ensembl is a system for genome Annotation.
Data visualisation / Mining web services.
• www.ensembl.org
• Provides web / programmatic interfaces to genomic data.
• 10k visitors / 126k page views per day.
Compute Pipeline (HPTC Workload)
• Take a raw genome and run it through a compute pipeline to find genes
and other features of interest.
• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
genomes.
• Software is Open Source (apache license).
• Data is free for download.
We have web services and HPTC workloads running on
Iaas.
9. Why Cloud?
Web services
• Was hosted in a single datacentre at the Genome Campus, UK.
• 1 datacentre = Single point of failure.
• Access slow if you were not in western Europe.
Cloud Application
• Build worldwide network of mirrors on IaaS.
HPC
• People want to run Ensembl HPC pipeline on their own data.
• Requires skilled bioinformatician to get the software running and access
to a HPC cluster.
Cloud Application
• Build HPC SaaS.
• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a
HPC cluster and analyses their data.
14. Economic Trends:
As cost of sequencing halves every 12
months.
• cf Moore's Law
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $10,000.
• Large centres are now doing studies with 10,000s of
genomes.
Trend will continue:
• Generation 3 sequencers are on their way.
• $500 genome is probable within 5 years.
15. The scary graph
Peak Yearly capillary Current weeky sequencing:
sequencing: 30 Gbase 3000 Gbase
16. Managing Growth
We have exponential growth in
storage and compute.
• Storage /compute doubles every 12 Disk Storage
months. 6000
• 2009 ~7 PB raw
5000
Gigabase of sequence ≠ Gigbyte 4000
of storage.
• 16 bytes per base for for sequence
Terabytes
3000
data.
• Intermediate analysis typically need 10x 2000
disk space of the raw data. 1000
Moore's law will not save us. 0
• Transistor/disk density: Td=18 months
1995 1997 1999 2001 2003 2005 2007 2009
1994 1996 1998 2000 2002 2004 2006 2008
• Sequencing cost: Td=12 months Year
• Sequencing output: Td=3-6 months
17. What do you need to do
sequencing?
LIMS System / Data Tracking
External
External
analysis
analysis Data
Data
Sample prep
Sample prep Sequencer
Sequencer repository
repository
software
software repository
repository
Integrated
Integrated
compute
compute
HPC
HPC
Resource
Resource
18. What IT do you need to do
sequencing?
LIMS System / Data Tracking
External
External
analysis
analysis Data
Data
Sample prep
Sample prep Sequencer
Sequencer repository
repository
software
software repository
repository
Integrated
Integrated
compute
compute
HPC
HPC
Resource
Resource
Part covered in the grant
19. This is really hard...
We have a whole division of HPC specialists, LIMs
developers, bio-informaticians.
What about smaller labs with 1 or 2 sequencers?
20. ...and then change it.
Sequencing informatics is massively fluid.
• New chemistry.
• More sequencing machines.
• New analysis software.
Constant cycle of development and deployment.
22. What can we put on the Cloud?
LIMS System / Data Tracking
External
External
analysis
analysis Data
Data
Sample prep
Sample prep Sequencer
Sequencer repository
repository
software
software repository
repository
Integrated
Integrated
compute
compute
HPC
HPC
Resource
Resource
23. Does it Cloud?
How do we decide what to cloud?
Rule of thumb borrowed from HPC.
• Small data / High CPU work better in distributed environments.
IO Bound CPU Bound
/ Large data / small data
24. Sequencing Data
Data size per Genome
Tracking / LIMs Structured data
(100s Kbytes) (databases)
Individual
features (3MB)
Variation data (1GB)
Alignments (200 GB)
Sequence + quality data (500 GB)
Unstructured data
(flat files)
( Raw data (TB) )
25. Sequencing Data
Data size per Genome
Cloud Friendly
Tracking / LIMs Structured data
(100s Kbytes) (databases)
Individual
features (3MB)
Variation data (1GB)
Alignments (200 GB)
Sequence + quality data (500 GB)
Unstructured data
Cloud Unfriendly (flat files)
( Raw data (TB) )
26. Can we Cloudify Sequencing?
LIMS System / Data Tracking
External
External
analysis
analysis Data
Data
Sample prep
Sample prep Sequencer
Sequencer repository
repository
software
software repository
repository
Integrated
Integrated
compute
compute
HPC
HPC
Resource
Resource
27. What are the blockers?
HPC infrastructure is now available in the cloud.
• Good enough for 95% of sequencing.
Doing big data is hard:
1. You have to get the data there first.
2. You may not be allowed to put the data there.
28. Moving data is hard
Tools:
• (FTP,ssh/rsync) are not suited to wide-area networks.
• WAN tools: gridFTP/FDT/Aspera.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)
• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
• 11 hours to move 1TB to Dublin.
• 23 hours to move 1 TB to East coast.
What speed should we get?
• Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost impossible.
Do you have fast enough disks at each end to keep the
network full?
Why not just ship disks?
• Logistical nightmare.
• Format issues, corruption, slow.
29. Networking
How do we improve data
transfers across the public
internet?
• CERN approach; don't.
• Dedicated networking has been
put in between CERN and the T1
centres who get all of the CERN
data.
Can it work for cloud?
• Buy dedicated bandwidth to a
provider.
• Ties you in.
• Should they pay?
We need good connectivity
to everywhere.
31. Are you allowed to put data on
the cloud?
Default policy:
“Our data is confidential/important/critical to our business.
We must keep our data on our computers.”
32. What does “My System”
mean?
My System Not my system
Purchased computer in Purchased computer in IaaS on a cloud
my data centre a co-lo facility provider
Leased computer in
Traditionally outsourced IT SaaS on a cloud
my data centre
service provider
Root / Admin Access?
VPN / inside or outside firewall?
Encrypted/ Non encrypted? Legal / IP agreement in place?
33. How confidential is the data?
Low Risk High Risk
Anonymised Personally
Publically available datasets identifiable datasets Trade Secret /
Genome data (eg individual Patentable data
genomes with no
identifiers)
34. Reasons to be optimistic:
Most (all?) data security issues can be dealt with.
• But the devil is in the details.
• Data can be put on the cloud, if care is taken.
It is probably more secure there than in your own data-
centre.
• Can you match AWS data availability guarantees?
Are cloud providers different from any other organisation
you outsource to?
35. Outstanding Issues
Audit and compliance:
• If you need IP agreements, above your providers standard T&Cs, how do
you push them through?
Geographical boundaries mean little in the cloud.
• Data can be replicated across national boundaries, without end user
being aware.
Moving personally identifiable data outside of the EU is
potentially problematic.
• (Can be problematic within the EU; privacy laws are not as harmonised as
you might think.)
• More sequencing experiments are trying to link with phenotype data. (ie
personally identifiable medical records).
36. Private Cloud to rescue?
Sequencing increasingly takes place in large consortiums.
• Eg International Cancer Genome Consortium http://www.icgc.org)
Can we do private clouds within the consortium?
37. Traditional Collaboration
IT
IT
IT
IT Sequencing
Sequencing IT
IT
Sequencing centre
centre Sequencing
Sequencing Sequencing
centre
centre centre
centre
Sequencing
Sequencing
Centre + DCC
Centre + DCC
IT
IT
38. Cloud Collaborations
Sequencing
Sequencing
centre
centre
Private Cloud
Private Cloud
IaaS // SaaS
IaaS SaaS
Sequencing
Sequencing Sequencing
Sequencing
centre
centre centre
centre
Private Cloud
Private Cloud
IaaS // SaaS
IaaS SaaS
Sequencing
Sequencing
Centre
Centre
39. Private Cloud
Advantages:
• LIMS / analysis software easily shared with consortium.
• Small organisations leverage expertise of big IT organisations.
• Academia tends to be linked by fast research networks.
• Moving data is easier.
• Consortium will be signed up to data-access agreements.
• Simplifies data governance.
Problems:
• Big change in funding model.
• Are big centres set up to provide private cloud services?
•Selling services is hard if you are a charity.
• Can we do it as well as the big internet companies?
41. Dark Archives
Storing data in an archive is not
particularly useful.
• You need to be able to access the
data and do something useful with it.
Data in current archives is
“dark”.
• You can put/get data, but cannot
compute across it.
• Is data in an inaccessible archive
really useful?
42. Example problem:
“We want to run out pipeline across 100TB of data
currently in EGA/SRA.”
We will need to de-stage the data to Sanger, and then run
the compute.
• Extra 0.5 PB of storage, 1000 cores of compute.
• 3 month lead time.
• ~$1.5M capex.
43. Cloud / Computable archives
Move the compute to the
data.
• Upload workload onto VMs.
• Put VMs on compute that is
“attached” to the data.
CPU CPU CPU CPU
CPU CPU CPU CPU
Federated between
centres Data
Data
• Grid software build on top of CPU CPU CPU CPU
CPU CPU CPU CPU
cloud components.
• Avoids scaling problems VM
VM
Data
inherent in putting everything Data
on one place.
44. Acknowledgements
Sanger EBI
• Phil Butcher Glenn Proctor
• James Beal Steve Keenan
• Pete Clapham
• Simon Kelley
• Gen-Tao Chiang
• Steve Searle
• Jan-Hinnerk Vogel
• Bronwen Aken