Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
45. Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre
46. Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing centre Sequencing centre Sequencing centre Sequencing centre Federated access
47. Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) Clinical Researchers, non-infomaticians Sequencing informatics specialists Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
52. Bulk Data Structured data (databases) Unstructured data (flat files) Data size per Genome Sequencing informatics specialists Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
58. Compute farm analysis/QC pipeline Alignment/assembly suckers Data pull ... Final Repository (Oracle) 100TB / yr staging area 500 TB Seq 1 Seq 38
59.
60.
61. ... Data pull ... ? Compute farm analysis/QC pipeline assembly/alignment suckers Final Repository (Oracle) 100TB / yr staging area 500TB Seq 1 Seq 38 Compute Farm Compute farm disk Collaberators / 3 rd party sequencing Unmanged LIMS managed data
62. Accidents waiting to happen... From: <User A> (who left 12 months ago) I find the <project> directory is removed . The original directory is "/scratch/ <User B> (who left 6 months ago) " ..where is it ? If this problem cannot be solved ,I am afriaid that <project> cannot be released.
63.
64.
65.
66.
67. Produced by DICE (Data Intensive Cyber Environments) groups at U. North Carolina, Chapel Hill.
69. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database
82. Structured Data Structured data (databases) Unstructured data (flat files) Data size per Genome Clinical Researchers, non-infomaticians Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
129. Gene Finding DNA HMM Prediction Alignment with known proteins Alignment with fragments recovered in vivo Alignment with other genes and other species
146. IO Architecture VS CPU CPU CPU Fat Network Posix Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular hadoop/S3