SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Cloud Technical Challenges


               Guy Coates
      Wellcome Trust Sanger Institute


           gmpc@sanger.ac.uk
Outline
Background
Cloud Experiences
Barriers
Future Directions
The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based in Hinxton Genome Campus,
    Cambridge, UK.

Large scale genomic research.
• Sequenced 1/3 of the human genome.
    (largest single contributor).
•   We have active cancer, malaria,
    pathogen and genomic variation / human
    health studies.

All data is made publicly
available.
• Websites, ftp, direct database. access,
    programmatic APIs.
Lost in the clouds...
Victory!
Our Cloud Experiences
Hype Cycle

Awesome!

                        Just works...
Ensembl
Ensembl is a system for genome Annotation.
Data visualisation / Mining web services.
• www.ensembl.org
• Provides web / programmatic interfaces to genomic data.
• 10k visitors / 126k page views per day.
Compute Pipeline (HPTC Workload)
• Take a raw genome and run it through a compute pipeline to find genes
    and other features of interest.
•   Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
    genomes.

• Software is Open Source (apache license).
• Data is free for download.
We have web services and HPTC workloads running on
Iaas.
Why Cloud?
Web services
• Was hosted in a single datacentre at the Genome Campus, UK.
• 1 datacentre = Single point of failure.
• Access slow if you were not in western Europe.
Cloud Application
• Build worldwide network of mirrors on IaaS.
HPC
• People want to run Ensembl HPC pipeline on their own data.
• Requires skilled bioinformatician to get the software running and access
  to a HPC cluster.

Cloud Application
• Build HPC SaaS.
• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a
  HPC cluster and analyses their data.
Hype Cycle



         Web services /
          Some HPC
That was easy...
Hype cycle




Sequencing
informatics
DNA sequencing
Economic Trends:
As cost of sequencing halves every 12
months.
• cf Moore's Law
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $10,000.
• Large centres are now doing studies with 10,000s of
  genomes.

Trend will continue:
• Generation 3 sequencers are on their way.
• $500 genome is probable within 5 years.
The scary graph




Peak Yearly capillary   Current weeky sequencing:
sequencing: 30 Gbase    3000 Gbase
Managing Growth
We have exponential growth in
storage and compute.
• Storage /compute doubles every 12                                                   Disk Storage
    months.                                                6000

     • 2009 ~7 PB raw
                                                           5000



Gigabase of sequence ≠ Gigbyte                             4000

of storage.
• 16 bytes per base for for sequence




                                               Terabytes
                                                           3000


    data.
•   Intermediate analysis typically need 10x               2000


    disk space of the raw data.                            1000



Moore's law will not save us.                                 0


• Transistor/disk density: Td=18 months
                                                                      1995    1997    1999    2001    2003    2005    2007    2009
                                                                  1994    1996    1998    2000    2002    2004    2006    2008


• Sequencing cost:         Td=12 months                                                      Year



• Sequencing output:       Td=3-6 months
What do you need to do
                  sequencing?
                    LIMS System       /      Data Tracking


                                                                    External
                                                                     External
                                  analysis
                                  analysis              Data
                                                        Data
Sample prep
Sample prep    Sequencer
               Sequencer                                           repository
                                                                    repository
                                  software
                                  software           repository
                                                      repository

               Integrated
                Integrated
                 compute
                 compute




                                   HPC
                                    HPC
                                  Resource
                                  Resource
What IT do you need to do
              sequencing?
                       LIMS System       /      Data Tracking


                                                                       External
                                                                        External
                                     analysis
                                     analysis              Data
                                                           Data
Sample prep
Sample prep       Sequencer
                  Sequencer                                           repository
                                                                       repository
                                     software
                                     software           repository
                                                         repository

                  Integrated
                   Integrated
                    compute
                    compute




                                      HPC
                                       HPC
                                     Resource
                                     Resource
   Part covered in the grant
This is really hard...
We have a whole division of HPC specialists, LIMs
developers, bio-informaticians.

What about smaller labs with 1 or 2 sequencers?
...and then change it.
Sequencing informatics is massively fluid.
• New chemistry.
• More sequencing machines.
• New analysis software.

Constant cycle of development and deployment.
How can cloud help?
What can we put on the Cloud?

                   LIMS System       /      Data Tracking


                                                                   External
                                                                    External
                                 analysis
                                 analysis              Data
                                                       Data
Sample prep
Sample prep   Sequencer
              Sequencer                                           repository
                                                                   repository
                                 software
                                 software           repository
                                                     repository

              Integrated
               Integrated
                compute
                compute




                                  HPC
                                   HPC
                                 Resource
                                 Resource
Does it Cloud?
How do we decide what to cloud?
Rule of thumb borrowed from HPC.
• Small data / High CPU work better in distributed environments.




IO Bound                                                 CPU Bound
/ Large data                                             / small data
Sequencing Data
   Data size per Genome




      Tracking / LIMs              Structured data
       (100s Kbytes)                (databases)

          Individual
        features (3MB)

     Variation data (1GB)

     Alignments (200 GB)

Sequence + quality data (500 GB)
                                   Unstructured data
                                       (flat files)
       ( Raw data (TB) )
Sequencing Data
   Data size per Genome



                                   Cloud Friendly
      Tracking / LIMs                               Structured data
       (100s Kbytes)                                 (databases)

          Individual
        features (3MB)

     Variation data (1GB)

     Alignments (200 GB)

Sequence + quality data (500 GB)
                                                    Unstructured data
                            Cloud Unfriendly            (flat files)
       ( Raw data (TB) )
Can we Cloudify Sequencing?

                   LIMS System       /      Data Tracking



                                                                   External
                                                                    External
                                 analysis
                                 analysis              Data
                                                       Data
Sample prep
Sample prep   Sequencer
              Sequencer                                           repository
                                                                   repository
                                 software
                                 software           repository
                                                     repository

              Integrated
               Integrated
                compute
                compute




                                  HPC
                                   HPC
                                 Resource
                                 Resource
What are the blockers?
HPC infrastructure is now available in the cloud.
• Good enough for 95% of sequencing.

Doing big data is hard:

1. You have to get the data there first.
2. You may not be allowed to put the data there.
Moving data is hard

Tools:
• (FTP,ssh/rsync) are not suited to wide-area networks.
• WAN tools: gridFTP/FDT/Aspera.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)
• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
• 11 hours to move 1TB to Dublin.
• 23 hours to move 1 TB to East coast.
What speed should we get?
• Once we leave JANET (UK academic network) finding out what the
  connectivity is and what we should expect is almost impossible.

Do you have fast enough disks at each end to keep the
network full?
Why not just ship disks?
• Logistical nightmare.
• Format issues, corruption, slow.
Networking
How do we improve data
transfers across the public
internet?
• CERN approach; don't.
• Dedicated networking has been
  put in between CERN and the T1
  centres who get all of the CERN
  data.

Can it work for cloud?
• Buy dedicated bandwidth to a
  provider.
   • Ties you in.
   • Should they pay?


We need good connectivity
to everywhere.
Data Security
Are you allowed to put data on
          the cloud?
 Default policy:

 “Our data is confidential/important/critical to our business.
 We must keep our data on our computers.”
What does “My System”
                  mean?

 My System                                                                                         Not my system




Purchased computer in              Purchased computer in                         IaaS on a cloud
my data centre                     a co-lo facility                              provider

              Leased computer in
                                                   Traditionally outsourced IT                     SaaS on a cloud
              my data centre
                                                   service                                         provider



    Root / Admin Access?
                                                                VPN / inside or outside firewall?


         Encrypted/ Non encrypted?                 Legal / IP agreement in place?
How confidential is the data?

Low Risk                                                                High Risk




                         Anonymised        Personally
  Publically available   datasets          identifiable datasets   Trade Secret /
  Genome data            (eg individual                            Patentable data
                         genomes with no
                         identifiers)
Reasons to be optimistic:
Most (all?) data security issues can be dealt with.
• But the devil is in the details.
• Data can be put on the cloud, if care is taken.
It is probably more secure there than in your own data-
centre.
• Can you match AWS data availability guarantees?
Are cloud providers different from any other organisation
you outsource to?
Outstanding Issues
Audit and compliance:
• If you need IP agreements, above your providers standard T&Cs, how do
    you push them through?


Geographical boundaries mean little in the cloud.
• Data can be replicated across national boundaries, without end user
    being aware.

Moving personally identifiable data outside of the EU is
potentially problematic.
• (Can be problematic within the EU; privacy laws are not as harmonised as
    you might think.)
•   More sequencing experiments are trying to link with phenotype data. (ie
    personally identifiable medical records).
Private Cloud to rescue?
Sequencing increasingly takes place in large consortiums.
• Eg International Cancer Genome Consortium http://www.icgc.org)
Can we do private clouds within the consortium?
Traditional Collaboration
                  IT
                   IT

    IT
     IT      Sequencing
             Sequencing         IT
                                 IT
Sequencing     centre
               centre       Sequencing
Sequencing                  Sequencing
  centre
  centre                      centre
                              centre




              Sequencing
              Sequencing
             Centre + DCC
             Centre + DCC

                  IT
                   IT
Cloud Collaborations
             Sequencing
             Sequencing
               centre
               centre
             Private Cloud
             Private Cloud
              IaaS // SaaS
               IaaS SaaS

Sequencing
Sequencing                   Sequencing
                             Sequencing
  centre
  centre                       centre
                               centre




             Private Cloud
             Private Cloud
              IaaS // SaaS
               IaaS SaaS

             Sequencing
             Sequencing
               Centre
               Centre
Private Cloud
Advantages:
• LIMS / analysis software easily shared with consortium.
     • Small organisations leverage expertise of big IT organisations.
•   Academia tends to be linked by fast research networks.
     • Moving data is easier.
•   Consortium will be signed up to data-access agreements.
     • Simplifies data governance.



Problems:
• Big change in funding model.
• Are big centres set up to provide private cloud services?
     •Selling services is hard if you are a charity.
•   Can we do it as well as the big internet companies?
Cloud data archives
Dark Archives
Storing data in an archive is not
particularly useful.
• You need to be able to access the
    data and do something useful with it.

Data in current archives is
“dark”.
• You can put/get data, but cannot
    compute across it.
•   Is data in an inaccessible archive
    really useful?
Example problem:
“We want to run out pipeline across 100TB of data
currently in EGA/SRA.”
We will need to de-stage the data to Sanger, and then run
the compute.
• Extra 0.5 PB of storage, 1000 cores of compute.
• 3 month lead time.
• ~$1.5M capex.
Cloud / Computable archives
Move the compute to the
data.
• Upload workload onto VMs.
• Put VMs on compute that is
    “attached” to the data.
                                                            CPU CPU CPU CPU
                                                            CPU CPU CPU CPU

Federated between
centres                                                         Data
                                                                Data
• Grid software build on top of           CPU CPU CPU CPU
                                          CPU CPU CPU CPU
    cloud components.
•   Avoids scaling problems          VM
                                     VM
                                              Data
    inherent in putting everything            Data
    on one place.
Acknowledgements
Sanger                EBI
•   Phil Butcher      Glenn Proctor
•   James Beal        Steve Keenan
•   Pete Clapham
•   Simon Kelley
•   Gen-Tao Chiang

• Steve Searle
• Jan-Hinnerk Vogel
• Bronwen Aken

Contenu connexe

Tendances

Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOpsBrice Fernandes
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsSonja Schweigert
 
Flagger: Istio Progressive Delivery Operator
Flagger: Istio Progressive Delivery OperatorFlagger: Istio Progressive Delivery Operator
Flagger: Istio Progressive Delivery OperatorWeaveworks
 
GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...Weaveworks
 
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & WeaveworksSecure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & WeaveworksWeaveworks
 
Cloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOpsCloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOpsWeaveworks
 
Hands-on GitOps Patterns for Helm Users
Hands-on GitOps Patterns for Helm UsersHands-on GitOps Patterns for Helm Users
Hands-on GitOps Patterns for Helm UsersWeaveworks
 
GitOps is the best modern practice for CD with Kubernetes
GitOps is the best modern practice for CD with KubernetesGitOps is the best modern practice for CD with Kubernetes
GitOps is the best modern practice for CD with KubernetesVolodymyr Shynkar
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Labs
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitWeaveworks
 
E bpf and profilers
E bpf and profilersE bpf and profilers
E bpf and profilersLibbySchulze
 
Continuous Security for GitOps
Continuous Security for GitOpsContinuous Security for GitOps
Continuous Security for GitOpsWeaveworks
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance TestingGrid Dynamics
 
Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Weaveworks
 
GitOps with ArgoCD
GitOps with ArgoCDGitOps with ArgoCD
GitOps with ArgoCDCloudOps2005
 
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneySetting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneyWeaveworks
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDSunnyvale
 

Tendances (20)

Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOps
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Flagger: Istio Progressive Delivery Operator
Flagger: Istio Progressive Delivery OperatorFlagger: Istio Progressive Delivery Operator
Flagger: Istio Progressive Delivery Operator
 
GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...
 
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & WeaveworksSecure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
 
Cloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOpsCloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOps
 
Hands-on GitOps Patterns for Helm Users
Hands-on GitOps Patterns for Helm UsersHands-on GitOps Patterns for Helm Users
Hands-on GitOps Patterns for Helm Users
 
GitOps is the best modern practice for CD with Kubernetes
GitOps is the best modern practice for CD with KubernetesGitOps is the best modern practice for CD with Kubernetes
GitOps is the best modern practice for CD with Kubernetes
 
Openshift argo cd_v1_2
Openshift argo cd_v1_2Openshift argo cd_v1_2
Openshift argo cd_v1_2
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
 
E bpf and profilers
E bpf and profilersE bpf and profilers
E bpf and profilers
 
Continuous Security for GitOps
Continuous Security for GitOpsContinuous Security for GitOps
Continuous Security for GitOps
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)
 
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOpsMeetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOps
 
GitOps with ArgoCD
GitOps with ArgoCDGitOps with ArgoCD
GitOps with ArgoCD
 
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneySetting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison Dowdney
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
 

Similaire à Cloud Technical Challenges

Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Yandex
 
Developments in datamanagement
Developments in datamanagementDevelopments in datamanagement
Developments in datamanagementSURFnet
 
IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...
IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...
IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...IBM Sverige
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Ryft
 
Reverse Engineering of Software Architecture
Reverse Engineering of Software ArchitectureReverse Engineering of Software Architecture
Reverse Engineering of Software ArchitectureDharmalingam Ganesan
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Machine Learning and Apache Edgent with STM32F401 to Firebase
Machine Learning and Apache Edgent with STM32F401 to Firebase Machine Learning and Apache Edgent with STM32F401 to Firebase
Machine Learning and Apache Edgent with STM32F401 to Firebase Mostafa Ramezani
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
CD presentation march 12th, 2018
CD presentation march 12th, 2018CD presentation march 12th, 2018
CD presentation march 12th, 2018Ran Levy
 

Similaire à Cloud Technical Challenges (20)

Guy Coates
Guy CoatesGuy Coates
Guy Coates
 
iRODS
iRODSiRODS
iRODS
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"
 
Developments in datamanagement
Developments in datamanagementDevelopments in datamanagement
Developments in datamanagement
 
IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...
IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...
IBM Business Analytics and Optimization - Traffic Management with IBM InfoSph...
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Reverse Engineering of Software Architecture
Reverse Engineering of Software ArchitectureReverse Engineering of Software Architecture
Reverse Engineering of Software Architecture
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Machine Learning and Apache Edgent with STM32F401 to Firebase
Machine Learning and Apache Edgent with STM32F401 to Firebase Machine Learning and Apache Edgent with STM32F401 to Firebase
Machine Learning and Apache Edgent with STM32F401 to Firebase
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
CD presentation march 12th, 2018
CD presentation march 12th, 2018CD presentation march 12th, 2018
CD presentation march 12th, 2018
 

Plus de Guy Coates

Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud ExperiencesGuy Coates
 
Sharing data: Sanger Experiences
Sharing data: Sanger ExperiencesSharing data: Sanger Experiences
Sharing data: Sanger ExperiencesGuy Coates
 
Sanger HPC infrastructure Report (2007)
Sanger HPC infrastructure  Report (2007)Sanger HPC infrastructure  Report (2007)
Sanger HPC infrastructure Report (2007)Guy Coates
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTCGuy Coates
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Guy Coates
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 

Plus de Guy Coates (12)

Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Sharing data: Sanger Experiences
Sharing data: Sanger ExperiencesSharing data: Sanger Experiences
Sharing data: Sanger Experiences
 
Sanger HPC infrastructure Report (2007)
Sanger HPC infrastructure  Report (2007)Sanger HPC infrastructure  Report (2007)
Sanger HPC infrastructure Report (2007)
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTC
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Cloud Technical Challenges

  • 1. Cloud Technical Challenges Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  • 3. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • Sequenced 1/3 of the human genome. (largest single contributor). • We have active cancer, malaria, pathogen and genomic variation / human health studies. All data is made publicly available. • Websites, ftp, direct database. access, programmatic APIs.
  • 4. Lost in the clouds...
  • 7. Hype Cycle Awesome! Just works...
  • 8. Ensembl Ensembl is a system for genome Annotation. Data visualisation / Mining web services. • www.ensembl.org • Provides web / programmatic interfaces to genomic data. • 10k visitors / 126k page views per day. Compute Pipeline (HPTC Workload) • Take a raw genome and run it through a compute pipeline to find genes and other features of interest. • Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes. • Software is Open Source (apache license). • Data is free for download. We have web services and HPTC workloads running on Iaas.
  • 9. Why Cloud? Web services • Was hosted in a single datacentre at the Genome Campus, UK. • 1 datacentre = Single point of failure. • Access slow if you were not in western Europe. Cloud Application • Build worldwide network of mirrors on IaaS. HPC • People want to run Ensembl HPC pipeline on their own data. • Requires skilled bioinformatician to get the software running and access to a HPC cluster. Cloud Application • Build HPC SaaS. • Users deploy ready-to-run Ensembl code on AWS, self-assembles into a HPC cluster and analyses their data.
  • 10. Hype Cycle Web services / Some HPC
  • 14. Economic Trends: As cost of sequencing halves every 12 months. • cf Moore's Law The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $10,000. • Large centres are now doing studies with 10,000s of genomes. Trend will continue: • Generation 3 sequencers are on their way. • $500 genome is probable within 5 years.
  • 15. The scary graph Peak Yearly capillary Current weeky sequencing: sequencing: 30 Gbase 3000 Gbase
  • 16. Managing Growth We have exponential growth in storage and compute. • Storage /compute doubles every 12 Disk Storage months. 6000 • 2009 ~7 PB raw 5000 Gigabase of sequence ≠ Gigbyte 4000 of storage. • 16 bytes per base for for sequence Terabytes 3000 data. • Intermediate analysis typically need 10x 2000 disk space of the raw data. 1000 Moore's law will not save us. 0 • Transistor/disk density: Td=18 months 1995 1997 1999 2001 2003 2005 2007 2009 1994 1996 1998 2000 2002 2004 2006 2008 • Sequencing cost: Td=12 months Year • Sequencing output: Td=3-6 months
  • 17. What do you need to do sequencing? LIMS System / Data Tracking External External analysis analysis Data Data Sample prep Sample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource
  • 18. What IT do you need to do sequencing? LIMS System / Data Tracking External External analysis analysis Data Data Sample prep Sample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource Part covered in the grant
  • 19. This is really hard... We have a whole division of HPC specialists, LIMs developers, bio-informaticians. What about smaller labs with 1 or 2 sequencers?
  • 20. ...and then change it. Sequencing informatics is massively fluid. • New chemistry. • More sequencing machines. • New analysis software. Constant cycle of development and deployment.
  • 21. How can cloud help?
  • 22. What can we put on the Cloud? LIMS System / Data Tracking External External analysis analysis Data Data Sample prep Sample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource
  • 23. Does it Cloud? How do we decide what to cloud? Rule of thumb borrowed from HPC. • Small data / High CPU work better in distributed environments. IO Bound CPU Bound / Large data / small data
  • 24. Sequencing Data Data size per Genome Tracking / LIMs Structured data (100s Kbytes) (databases) Individual features (3MB) Variation data (1GB) Alignments (200 GB) Sequence + quality data (500 GB) Unstructured data (flat files) ( Raw data (TB) )
  • 25. Sequencing Data Data size per Genome Cloud Friendly Tracking / LIMs Structured data (100s Kbytes) (databases) Individual features (3MB) Variation data (1GB) Alignments (200 GB) Sequence + quality data (500 GB) Unstructured data Cloud Unfriendly (flat files) ( Raw data (TB) )
  • 26. Can we Cloudify Sequencing? LIMS System / Data Tracking External External analysis analysis Data Data Sample prep Sample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource
  • 27. What are the blockers? HPC infrastructure is now available in the cloud. • Good enough for 95% of sequencing. Doing big data is hard: 1. You have to get the data there first. 2. You may not be allowed to put the data there.
  • 28. Moving data is hard Tools: • (FTP,ssh/rsync) are not suited to wide-area networks. • WAN tools: gridFTP/FDT/Aspera. Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link). • Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s) • Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin. • 23 hours to move 1 TB to East coast. What speed should we get? • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible. Do you have fast enough disks at each end to keep the network full? Why not just ship disks? • Logistical nightmare. • Format issues, corruption, slow.
  • 29. Networking How do we improve data transfers across the public internet? • CERN approach; don't. • Dedicated networking has been put in between CERN and the T1 centres who get all of the CERN data. Can it work for cloud? • Buy dedicated bandwidth to a provider. • Ties you in. • Should they pay? We need good connectivity to everywhere.
  • 31. Are you allowed to put data on the cloud? Default policy: “Our data is confidential/important/critical to our business. We must keep our data on our computers.”
  • 32. What does “My System” mean? My System Not my system Purchased computer in Purchased computer in IaaS on a cloud my data centre a co-lo facility provider Leased computer in Traditionally outsourced IT SaaS on a cloud my data centre service provider Root / Admin Access? VPN / inside or outside firewall? Encrypted/ Non encrypted? Legal / IP agreement in place?
  • 33. How confidential is the data? Low Risk High Risk Anonymised Personally Publically available datasets identifiable datasets Trade Secret / Genome data (eg individual Patentable data genomes with no identifiers)
  • 34. Reasons to be optimistic: Most (all?) data security issues can be dealt with. • But the devil is in the details. • Data can be put on the cloud, if care is taken. It is probably more secure there than in your own data- centre. • Can you match AWS data availability guarantees? Are cloud providers different from any other organisation you outsource to?
  • 35. Outstanding Issues Audit and compliance: • If you need IP agreements, above your providers standard T&Cs, how do you push them through? Geographical boundaries mean little in the cloud. • Data can be replicated across national boundaries, without end user being aware. Moving personally identifiable data outside of the EU is potentially problematic. • (Can be problematic within the EU; privacy laws are not as harmonised as you might think.) • More sequencing experiments are trying to link with phenotype data. (ie personally identifiable medical records).
  • 36. Private Cloud to rescue? Sequencing increasingly takes place in large consortiums. • Eg International Cancer Genome Consortium http://www.icgc.org) Can we do private clouds within the consortium?
  • 37. Traditional Collaboration IT IT IT IT Sequencing Sequencing IT IT Sequencing centre centre Sequencing Sequencing Sequencing centre centre centre centre Sequencing Sequencing Centre + DCC Centre + DCC IT IT
  • 38. Cloud Collaborations Sequencing Sequencing centre centre Private Cloud Private Cloud IaaS // SaaS IaaS SaaS Sequencing Sequencing Sequencing Sequencing centre centre centre centre Private Cloud Private Cloud IaaS // SaaS IaaS SaaS Sequencing Sequencing Centre Centre
  • 39. Private Cloud Advantages: • LIMS / analysis software easily shared with consortium. • Small organisations leverage expertise of big IT organisations. • Academia tends to be linked by fast research networks. • Moving data is easier. • Consortium will be signed up to data-access agreements. • Simplifies data governance. Problems: • Big change in funding model. • Are big centres set up to provide private cloud services? •Selling services is hard if you are a charity. • Can we do it as well as the big internet companies?
  • 41. Dark Archives Storing data in an archive is not particularly useful. • You need to be able to access the data and do something useful with it. Data in current archives is “dark”. • You can put/get data, but cannot compute across it. • Is data in an inaccessible archive really useful?
  • 42. Example problem: “We want to run out pipeline across 100TB of data currently in EGA/SRA.” We will need to de-stage the data to Sanger, and then run the compute. • Extra 0.5 PB of storage, 1000 cores of compute. • 3 month lead time. • ~$1.5M capex.
  • 43. Cloud / Computable archives Move the compute to the data. • Upload workload onto VMs. • Put VMs on compute that is “attached” to the data. CPU CPU CPU CPU CPU CPU CPU CPU Federated between centres Data Data • Grid software build on top of CPU CPU CPU CPU CPU CPU CPU CPU cloud components. • Avoids scaling problems VM VM Data inherent in putting everything Data on one place.
  • 44. Acknowledgements Sanger EBI • Phil Butcher Glenn Proctor • James Beal Steve Keenan • Pete Clapham • Simon Kelley • Gen-Tao Chiang • Steve Searle • Jan-Hinnerk Vogel • Bronwen Aken