Is Systems Biology Ready for Data Intensive Science

Is Systems Biology Becoming a Data Intensive Science? Assuming So, Are You Ready? December 7, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago 1

Part 1Biology as a Data Intensive Science. 2 Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).

Growth of Genomic Data ENCODE HGP 2003 2001 1977 1995 2005 Sanger Sequencing Microarray technology 454, Solexa sequencing 10^10 Genbank 10^5 10^8

Growth of Genomic Data Sequence individuals AWS Hadoop GFS Sequence environment 2006 2008 2003 Sequence species ENCODE HGP 2003 2001 1977 1995 2005 Sanger Sequencing Microarray technology 454, Solexa sequencing 10^10 Genbank 10^5 10^8

The Challenge is to Support Cubes of High Throughput Sequence Data Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set. Different developmental stages Differentconditions Perturb the environment

We Have a Problem … vs More and more of your colleagues produce so much data that they cannot easily manage & analyze it. Large projects build their own infrastructure. Every else is on their own.

2003 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science

To Answer today’s biological questions Point of View Analytic infrastructure Analytic algorithms & statistical models Data

What is a Cloud? 10 Software as a Service

Is Anything Else a Cloud? 11 Infrastructure as a Service – based upon scaling Virtual Machines (VMs)

Are There Other Types of Clouds? 12 web search & ad targeting Large Data Cloud Services

Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.

What Do You Optimize? Goal: Minimze latency and control heat. Goal: Maximze data (with matching compute) and control cost.

Elastic, Usage Based Pricing Is New 17 costs the same as 1 computer in a rack for 120 hours 120 computers in three racks for 1 hour ,[object Object]

Clouds can be used to manage surges in computing.,[object Object]

Case Study 1Cistrack Large Data Cloud 21 www.cistrack.org

Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.

CUBioS Applications Front Ends CUBioS Bowtie, TopHAT, R pipelines, etc… Ingestion Cistrack is an instance of CUBioS. RNA seq ChIPseq DNA capture etc.

Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters & enhancers H3K9Ac activation H3K9me3 heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript. & promoters CBP HAT- enhancers Total RNA expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)

Cistrack Supports Multi-Dim. Cubes… Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.

… Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.

Hadoopvs Sector 27 Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.

Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Cistrack Large Data Cloud Services Ingestion Services

Case Study 2: Combinatorial Analysis of Marks

Active Gene - Method K4Me3 to TSS distance Gene Activeness: Label a transcript t as XYZ X=1 if a H3K4Me3 binds in [-1800, min(2200, TranscriptLength)] Y=1 if a Pol II binds in [-1800, min(2200, TranscriptLength)] Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA. Pol II to TSS distance Source: Jia Chen et. al. (ModENCODE)

Promoters: Use H3K4me3, PolII & RNA to Map Active Genes Source: Jia Chen et. al. (ModENCODE)

Active Genes (cont’d) A. B. C. PolII H3K4me3 1418 332 753 6104 680 482 1350 RNA bp from TSS bp from TSS Source: Jia Chen et. al. (ModENCODE)

Interesting Combinatorial Combination of Marks Probes along genome … Marks Item-sets formed by sliding moving window along genome. A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.

Case Study 3Cistrack Elastic Cloud

Cistrack Elastic Cloud A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB. Multiple racks form a data center. Virtual machines can run pipelines. Virtual machines have access to large data services. No need to move large datasets in and out of Amazon public cloud.

Use VMs to Support Reanalysis Replace Cloud VM VM VM At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.

Comparing Peak Calling Algorithms for ModENCODE We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline. Also running the worm peak calling pipeline on the fly data.

Case Study 4Ensembles of Trees on Clouds 100 tree models data 10,000??? tree models WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.

Ensembles of Trees for Clouds Top-k ensembles Each node builds single random tree with local data. Central node picks k best random trees to predict. Lower cost with corresponding lower accuracy. Shuffling data can improve accuracy. Skeleton ensembles Central node builds k skeletons of random trees. Each local node fills in the skeletons. Central node merges all trees from local nodes. Greater cost, but more accurate.

Experimental Studies Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed. Standard ensemble based models are more expensive than proposed approaches and can overfit. Skeleton ensembles are more accurate but more expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset, recommend skeleton ensembles.

KDDCup99 dataset Census income dataset

Part 5.Open Cloud Consortium Biocloud

Open Cloud Testbed C-Wave CENIC Dragon Phase 2 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s ,[object Object]

Open Science Data Cloud sky cloud additional projects in planning… biocloud 44

Is Systems Biology Ready for Data Intensive Science

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Is Systems Biology Ready for Data Intensive Science

Similaire à Is Systems Biology Ready for Data Intensive Science (20)

Plus de Robert Grossman

Plus de Robert Grossman (20)

Dernier

Dernier (20)

Is Systems Biology Ready for Data Intensive Science