SlideShare une entreprise Scribd logo
1  sur  48
Is Systems Biology Becoming a Data Intensive Science?  Assuming So, Are You Ready? December 7, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago 1
Part 1Biology as a Data Intensive Science. 2 Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
Growth of Genomic Data ENCODE HGP 2003 2001 1977 1995 2005 Sanger Sequencing Microarray technology 454, Solexa sequencing 10^10 Genbank 10^5 10^8
Growth of Genomic Data Sequence individuals AWS   Hadoop GFS Sequence environment 2006 2008 2003 Sequence species ENCODE HGP 2003 2001 1977 1995 2005 Sanger Sequencing Microarray technology 454, Solexa sequencing 10^10 Genbank 10^5 10^8
The Challenge is to Support Cubes of High Throughput Sequence Data Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq,  movie, etc. data set. Different developmental stages Differentconditions Perturb the environment
We Have a Problem … vs More and more of your colleagues produce so much data that they cannot easily manage &  analyze it.   Large projects build their own infrastructure. Every else is on their own.
2003 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science
To Answer today’s biological questions Point of View Analytic infrastructure Analytic algorithms & statistical models Data
Part 2What is a Cloud? 9
What is a Cloud? 10 Software as a Service
Is Anything Else a Cloud? 11 Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
Are There Other Types of Clouds? 12 web search & ad targeting  Large Data Cloud Services
What is Virtualization? 13
Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
What Do You Optimize? Goal: Minimze latency and control heat. Goal: Maximze data (with matching compute) and control cost.
16 Scale is new
Elastic, Usage Based Pricing Is New 17 costs the same as 1 computer in a rack for 120 hours 120 computers in  three racks for 1 hour ,[object Object]
Clouds can be used to manage surges in computing.,[object Object]
19 Clouds vs Grids
Part 3Case Studies
Case Study 1Cistrack Large Data Cloud 21 www.cistrack.org
Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
CUBioS Applications Front Ends CUBioS Bowtie, TopHAT, R pipelines, etc…  Ingestion Cistrack is an instance of CUBioS. RNA seq ChIPseq DNA capture etc.
Chromatin Developmental Time-Course H3K4me1 	enhancers H3K4me3		promoters & enhancers H3K9Ac		activation H3K9me3		heterochromotin H3K27Ac		activation H3K27me3	repression PolII			transcript. & promoters CBP			HAT- enhancers Total RNA		expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
Cistrack Supports Multi-Dim. Cubes… Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP.  Each factor has been studied for 12 different time-points of Drosophila development.
… Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.
Hadoopvs Sector 27 Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Cistrack Large Data Cloud Services Ingestion Services
Case Study 2: Combinatorial Analysis of Marks
Active Gene - Method K4Me3 to TSS distance Gene Activeness: Label a transcript t as XYZ X=1 if a H3K4Me3 binds in  [-1800, min(2200, TranscriptLength)] Y=1 if a Pol II binds in  [-1800, min(2200, TranscriptLength)] Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA. Pol II to TSS distance Source: Jia Chen et. al. (ModENCODE)
Promoters: Use H3K4me3, PolII & RNA to Map Active Genes Source: Jia Chen et. al. (ModENCODE)
Active Genes (cont’d) A. B. C. PolII H3K4me3 1418 332 753 6104 680 482 1350 RNA bp from TSS bp from TSS Source: Jia Chen et. al. (ModENCODE)
Interesting Combinatorial Combination of Marks Probes along genome … Marks Item-sets formed by sliding moving window along genome.   A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.
Case Study 3Cistrack Elastic Cloud
Cistrack Elastic Cloud  A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB. Multiple racks form a data center. Virtual machines can run pipelines. Virtual machines have access to large data services. No need to move large datasets in and out of Amazon public cloud.
Use VMs to Support Reanalysis Replace Cloud VM VM VM At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
Comparing Peak Calling Algorithms for ModENCODE We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline. Also running the worm peak calling pipeline on the fly data.
Case Study 4Ensembles of Trees on Clouds 100 tree models data 10,000??? tree models WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
Ensembles of Trees for Clouds Top-k ensembles Each node builds single random tree with local data. Central node picks k best random trees to predict. Lower cost with corresponding lower accuracy. Shuffling data can improve accuracy. Skeleton ensembles Central node builds k skeletons of random trees. Each local node fills in the skeletons. Central node merges all trees from local nodes. Greater cost, but more accurate.
Experimental Studies Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed. Standard ensemble based models are more expensive than proposed approaches and can overfit. Skeleton ensembles are more accurate but more expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset, recommend skeleton ensembles.
KDDCup99 dataset Census income dataset
Part 5.Open Cloud Consortium Biocloud
             Open Cloud Testbed C-Wave CENIC Dragon Phase 2 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s ,[object Object]
Sector/Sphere
Thrift
KVM VMs
Eucalyptus VMsMREN 43
                   Open Science Data Cloud sky cloud additional projects in planning… biocloud 44

Contenu connexe

Tendances

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
aimsnist
 
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
KAMAL CHOUDHARY
 

Tendances (20)

Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 

Similaire à The Transformation of Systems Biology Into A Large Data Science

Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 

Similaire à The Transformation of Systems Biology Into A Large Data Science (20)

Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
 
Grid computing & its applications
Grid computing & its applicationsGrid computing & its applications
Grid computing & its applications
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
Physics Research in an Era of Global Cyberinfrastructure
Physics Research in an Era of Global CyberinfrastructurePhysics Research in an Era of Global Cyberinfrastructure
Physics Research in an Era of Global Cyberinfrastructure
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
 
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
 
Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a BeginningCollaborations Between Calit2, SIO, and the Venter Institute-a Beginning
Collaborations Between Calit2, SIO, and the Venter Institute-a Beginning
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Building an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic SciencesBuilding an Information Infrastructure to Support Genetic Sciences
Building an Information Infrastructure to Support Genetic Sciences
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its Applications
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 

Plus de Robert Grossman

Plus de Robert Grossman (20)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

The Transformation of Systems Biology Into A Large Data Science

  • 1. Is Systems Biology Becoming a Data Intensive Science? Assuming So, Are You Ready? December 7, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago 1
  • 2. Part 1Biology as a Data Intensive Science. 2 Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
  • 3. Growth of Genomic Data ENCODE HGP 2003 2001 1977 1995 2005 Sanger Sequencing Microarray technology 454, Solexa sequencing 10^10 Genbank 10^5 10^8
  • 4. Growth of Genomic Data Sequence individuals AWS Hadoop GFS Sequence environment 2006 2008 2003 Sequence species ENCODE HGP 2003 2001 1977 1995 2005 Sanger Sequencing Microarray technology 454, Solexa sequencing 10^10 Genbank 10^5 10^8
  • 5. The Challenge is to Support Cubes of High Throughput Sequence Data Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set. Different developmental stages Differentconditions Perturb the environment
  • 6. We Have a Problem … vs More and more of your colleagues produce so much data that they cannot easily manage & analyze it. Large projects build their own infrastructure. Every else is on their own.
  • 7. 2003 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science
  • 8. To Answer today’s biological questions Point of View Analytic infrastructure Analytic algorithms & statistical models Data
  • 9. Part 2What is a Cloud? 9
  • 10. What is a Cloud? 10 Software as a Service
  • 11. Is Anything Else a Cloud? 11 Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
  • 12. Are There Other Types of Clouds? 12 web search & ad targeting Large Data Cloud Services
  • 14. Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
  • 15. What Do You Optimize? Goal: Minimze latency and control heat. Goal: Maximze data (with matching compute) and control cost.
  • 16. 16 Scale is new
  • 17.
  • 18.
  • 19. 19 Clouds vs Grids
  • 21. Case Study 1Cistrack Large Data Cloud 21 www.cistrack.org
  • 22. Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
  • 23. CUBioS Applications Front Ends CUBioS Bowtie, TopHAT, R pipelines, etc… Ingestion Cistrack is an instance of CUBioS. RNA seq ChIPseq DNA capture etc.
  • 24. Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters & enhancers H3K9Ac activation H3K9me3 heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript. & promoters CBP HAT- enhancers Total RNA expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
  • 25. Cistrack Supports Multi-Dim. Cubes… Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.
  • 26. … Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.
  • 27. Hadoopvs Sector 27 Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
  • 28. Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Cistrack Large Data Cloud Services Ingestion Services
  • 29. Case Study 2: Combinatorial Analysis of Marks
  • 30. Active Gene - Method K4Me3 to TSS distance Gene Activeness: Label a transcript t as XYZ X=1 if a H3K4Me3 binds in [-1800, min(2200, TranscriptLength)] Y=1 if a Pol II binds in [-1800, min(2200, TranscriptLength)] Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA. Pol II to TSS distance Source: Jia Chen et. al. (ModENCODE)
  • 31. Promoters: Use H3K4me3, PolII & RNA to Map Active Genes Source: Jia Chen et. al. (ModENCODE)
  • 32. Active Genes (cont’d) A. B. C. PolII H3K4me3 1418 332 753 6104 680 482 1350 RNA bp from TSS bp from TSS Source: Jia Chen et. al. (ModENCODE)
  • 33. Interesting Combinatorial Combination of Marks Probes along genome … Marks Item-sets formed by sliding moving window along genome. A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.
  • 34. Case Study 3Cistrack Elastic Cloud
  • 35. Cistrack Elastic Cloud A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB. Multiple racks form a data center. Virtual machines can run pipelines. Virtual machines have access to large data services. No need to move large datasets in and out of Amazon public cloud.
  • 36. Use VMs to Support Reanalysis Replace Cloud VM VM VM At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
  • 37. Comparing Peak Calling Algorithms for ModENCODE We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline. Also running the worm peak calling pipeline on the fly data.
  • 38. Case Study 4Ensembles of Trees on Clouds 100 tree models data 10,000??? tree models WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
  • 39. Ensembles of Trees for Clouds Top-k ensembles Each node builds single random tree with local data. Central node picks k best random trees to predict. Lower cost with corresponding lower accuracy. Shuffling data can improve accuracy. Skeleton ensembles Central node builds k skeletons of random trees. Each local node fills in the skeletons. Central node merges all trees from local nodes. Greater cost, but more accurate.
  • 40. Experimental Studies Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed. Standard ensemble based models are more expensive than proposed approaches and can overfit. Skeleton ensembles are more accurate but more expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset, recommend skeleton ensembles.
  • 41. KDDCup99 dataset Census income dataset
  • 42. Part 5.Open Cloud Consortium Biocloud
  • 43.
  • 48. Open Science Data Cloud sky cloud additional projects in planning… biocloud 44
  • 49. OCC Condominium Clouds In a condominium cloud, you buy your own rack or bunch of racks. The racks are managed and operated by the condominium association, in this case the OCC. If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource. 45
  • 51. To Get Involved The Cistrack resource for transcriptional data: www.cistrack.org Sector/Sphere cloud: sector.sourceforge.net
  • 52. Thank You For more information: blog.rgrossman.com or www.rgrossman.com