Systems biology is becoming a data-intensive science due to the exponential growth of genomic and biological data. Large projects now produce petabytes of data that require new computational infrastructure to store, manage, and analyze. Cloud computing provides elastic resources that can scale to support the increasing data needs of systems biology. Case studies show how clouds are used for large-scale data integration and analysis, running combinatorial analysis over genomic marks, and enabling reanalysis of biological data through elastic virtual machines. The Open Cloud Consortium is working to provide open cloud resources for biological and biomedical research through testbeds and proposed bioclouds.
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Is Systems Biology Ready for Data Intensive Science
1. Is Systems Biology Becoming a Data Intensive Science? Assuming So, Are You Ready? December 7, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago 1
2. Part 1Biology as a Data Intensive Science. 2 Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
5. The Challenge is to Support Cubes of High Throughput Sequence Data Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set. Different developmental stages Differentconditions Perturb the environment
6. We Have a Problem … vs More and more of your colleagues produce so much data that they cannot easily manage & analyze it. Large projects build their own infrastructure. Every else is on their own.
14. Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
15. What Do You Optimize? Goal: Minimze latency and control heat. Goal: Maximze data (with matching compute) and control cost.
22. Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
23. CUBioS Applications Front Ends CUBioS Bowtie, TopHAT, R pipelines, etc… Ingestion Cistrack is an instance of CUBioS. RNA seq ChIPseq DNA capture etc.
24. Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters & enhancers H3K9Ac activation H3K9me3 heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript. & promoters CBP HAT- enhancers Total RNA expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
25. Cistrack Supports Multi-Dim. Cubes… Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.
26. … Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.
27. Hadoopvs Sector 27 Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
28. Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Cistrack Large Data Cloud Services Ingestion Services
30. Active Gene - Method K4Me3 to TSS distance Gene Activeness: Label a transcript t as XYZ X=1 if a H3K4Me3 binds in [-1800, min(2200, TranscriptLength)] Y=1 if a Pol II binds in [-1800, min(2200, TranscriptLength)] Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA. Pol II to TSS distance Source: Jia Chen et. al. (ModENCODE)
31. Promoters: Use H3K4me3, PolII & RNA to Map Active Genes Source: Jia Chen et. al. (ModENCODE)
32. Active Genes (cont’d) A. B. C. PolII H3K4me3 1418 332 753 6104 680 482 1350 RNA bp from TSS bp from TSS Source: Jia Chen et. al. (ModENCODE)
33. Interesting Combinatorial Combination of Marks Probes along genome … Marks Item-sets formed by sliding moving window along genome. A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.
35. Cistrack Elastic Cloud A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB. Multiple racks form a data center. Virtual machines can run pipelines. Virtual machines have access to large data services. No need to move large datasets in and out of Amazon public cloud.
36. Use VMs to Support Reanalysis Replace Cloud VM VM VM At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
37. Comparing Peak Calling Algorithms for ModENCODE We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline. Also running the worm peak calling pipeline on the fly data.
38. Case Study 4Ensembles of Trees on Clouds 100 tree models data 10,000??? tree models WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
39. Ensembles of Trees for Clouds Top-k ensembles Each node builds single random tree with local data. Central node picks k best random trees to predict. Lower cost with corresponding lower accuracy. Shuffling data can improve accuracy. Skeleton ensembles Central node builds k skeletons of random trees. Each local node fills in the skeletons. Central node merges all trees from local nodes. Greater cost, but more accurate.
40. Experimental Studies Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed. Standard ensemble based models are more expensive than proposed approaches and can overfit. Skeleton ensembles are more accurate but more expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset, recommend skeleton ensembles.
48. Open Science Data Cloud sky cloud additional projects in planning… biocloud 44
49. OCC Condominium Clouds In a condominium cloud, you buy your own rack or bunch of racks. The racks are managed and operated by the condominium association, in this case the OCC. If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource. 45