The Next-Generation sequencing data-deluge requires storage and compute services to be provisioned at an ever-increasing rate. Can Cloud (and last decade's buzzword, Grid), help us?
Talk given at the NHGRI Cloud computing workshop, 2010.
7. Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre
8. Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing Centre 3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B Federated access
9. Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) DAS, bioMART etc ? Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
16. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3
17.
18.
19. Allows user at institute A to seamlessly access data at institute B in a controlled manner.
47. Put VMs on compute that is “attached” to the data. Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM
48. Proto-Example: Ssaha trace search Hash Table (320 GB) trace Database ~30TB 1. hash database CPU CPU CPU CPU hash hash hash hash 2 .Distribute hash across machines query 3. Run query in parallel
56. Compute architecture VS CPU CPU CPU Fat Network Posix Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular hadoop/S3 Data-store Data-store