Bioinformatics Data Pipelines built by CSIRO on AWS

•Télécharger en tant que PPTX, PDF•

5 j'aime•716 vues

Lynn Langit

bioinformatics (genomics) data pipelines built by CSRIO Australia on the AWS Cloud

Sciences

Cancer Genomics
Data Pipelines
Lynn & Samantha Langit
CSIRO Bioinformatics / Australia
June 2017 - Oslo

3 Billion data points per patient DNA sample
Up to 25% of the population could be sequenced by 2025

Two Perspectives
Bioinformatics
Research
• Insight
• Reproducibility
Cloud
Architecture
• Speed
• Low Cost
• Simplicity

Cloud Data Pipeline Pattern
Problem
• Define business
problem
Data
• Quality
• Quantity
Candidate
Technologies
• Ingest
• ETL
• Biz Analytics
• ML
• Visualization
Build MVPs
• Iterate
• Learn
Assemble
Pipeline
• Validate each
section
• Test at scale

Genomic Sequencing Results
CRISPR-Cas9 for molecular engineering technology
enables the accurate editing of genomes for researchers.
It…
 Pattern-matching unique sequences of DNA
 Huge demand for large-scale computation
 Time-critical dimension to compute
 NIH-approved for human health
 Could revolutionize cancer treatments

Serverless Lambda
Architecture Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users

CSIRO: Commonwealth Scientific & Industrial Research Organization

Scale Genomic Analysis
GWAS = genome-wide sequencing data association
studies
 Analysis on large cohort data or imputed SNP array data
 Clustering on genomic profiles to stratify large-cohort
genomic data
 Viewing datasets with millions of features

What is CSIRO’s solution?
For Scale at
reasonable cost Use Apache Hadoop
For Scale at
speed Use Apache Spark for Hadoop
For Usability
in
bioinformatics
Create a domain-specific API (OSS library)
For global use
Leverage Cloud Pipeline Patterns

GWAS Analysis with Variant-Spark
On premise Hadoop Cluster
with Apache Spark
Genomics Analysts
corporate data center

80% faster than ADAM
90% faster than R
90% faster than Python

VariantSpark
Uses Apache Spark to massively parallelize the generation of
random forests to identify disease genes efficiently
 Analyzes 3,000 samples with 80 million features in < 30 minutes
 Enables real-time diagnosis by finding similar patients
 Contributes to motor neuron disease (ALS) research in Australia

Data
Prep
Statistics
Probabilistic
Algorithms
Data Viz
Machine
Learning…

Spark ML Classification Algorithms
Wide Random Forest Ensemble
of Decision Trees
Logistic Regression
variant-spark other libraries

 usable? performant?
 extendable? (clean code)
 using the best language
(Scala)?
 using the ‘best version’ of
Spark?
 using a version of wide
random forests that is
understandable?
Is it…

How best to Deploy Cloud Hadoop?
• IaaS
 EC2 instances with Apache Hadoop, Apache Spark, more…
• PaaS
 Elastic Map Reduce (EMR) Hadoop cluster
• SaaS
 Vendor-managed, i.e. DataBricks w/Jupyter Notebooks

Solving
Important
Questions…
Cancer Genomics?

GWAS Analysis with Variant-Spark
EC2 Hadoop Cluster with Apache Spark
Genomics Analysts
Availability Zone
1000 Genomes
GWAS input
Spot EC2 Hadoop
worker instances
EC2 Hadoop
instances

Cloud Data Pipeline Pattern
Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
Analyze GWAS -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
SaaS

Cloud Data Pipeline Pattern
Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
1. Scan vcf -> S3/DynamoDB Ingest
ETL
Analyze
Viz
S3
Lambda
Lambda
Lambda/API Gateway
Serverless
2. Analyze GWAS -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
SaaS

Modern Big Data Pipelines
• Problem #1 - Scan
• Solution: Serverless Cloud Pipeline
• Problem # 2 - Analyze
• Solution: SaaS Cloud ML Pipeline

Cancer Genomics
Data Pipelines
Lynn & Samantha Langit
CSIRO Bioinformatics & variant-spark
June 2017 - Oslo

Recommandé

Genome-scale Big Data PipelinesLynn Langit

VariantSpark - a Spark library for genomicsLynn Langit

Genomic Scale Big Data PipelinesLynn Langit

Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services

The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster

Reusable Software and Open Data To Optimize AgricultureDavid LeBauer

Accelerating Time to Science: Transforming Research in the CloudJamie Kinney

Big data at experimental facilitiesIan Foster

Recommandé

Genome-scale Big Data PipelinesLynn Langit

VariantSpark - a Spark library for genomicsLynn Langit

Genomic Scale Big Data PipelinesLynn Langit

Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services

The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster

Reusable Software and Open Data To Optimize AgricultureDavid LeBauer

Accelerating Time to Science: Transforming Research in the CloudJamie Kinney

Big data at experimental facilitiesIan Foster

Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus

Big Data, Big Computing, AI, and Environmental ScienceIan Foster

Data Automation at Light SourcesIan Foster

Machine Learning in Healthcare DiagnosticsLarry Smarr

Foster CRA March 2022.pptxIan Foster

Coding the ContinuumIan Foster

2014 moore-dddc.titus.brown

Living Outside the Comfort Zone - Daron green florianopolis 5-7-2014Microsoft Azure for Research

High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...Larry Smarr

NERSC, AI and the Superfacility, Debbie BardPacificResearchPlatform

Big data ecosystemSlideCentral

Foss4G 2009 Scenz Gridnoho

Cloud Accelerated GenomicsIdan Tohami

Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA

Accelerating data-intensive science by outsourcing the mundaneIan Foster

DuraMat Data Management and AnalyticsAnubhav Jain

Research workflow - 4 June 2018Zachary Labe

Butler - a framework for a large-scale scientific analysis on the cloud - EOS...ATMOSPHERE .

Sgg crest-presentation-finalmarpierc

Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services

Utility HPC: Right Systems, Right Scale, Right ScienceChef Software, Inc.

Contenu connexe

Tendances

Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus

Big Data, Big Computing, AI, and Environmental ScienceIan Foster

Data Automation at Light SourcesIan Foster

Machine Learning in Healthcare DiagnosticsLarry Smarr

Foster CRA March 2022.pptxIan Foster

Coding the ContinuumIan Foster

2014 moore-dddc.titus.brown

Living Outside the Comfort Zone - Daron green florianopolis 5-7-2014Microsoft Azure for Research

High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...Larry Smarr

NERSC, AI and the Superfacility, Debbie BardPacificResearchPlatform

Big data ecosystemSlideCentral

Foss4G 2009 Scenz Gridnoho

Cloud Accelerated GenomicsIdan Tohami

Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA

Accelerating data-intensive science by outsourcing the mundaneIan Foster

DuraMat Data Management and AnalyticsAnubhav Jain

Research workflow - 4 June 2018Zachary Labe

Butler - a framework for a large-scale scientific analysis on the cloud - EOS...ATMOSPHERE .

Sgg crest-presentation-finalmarpierc

Tendances (20)

Accelerating Science with Cloud Technologies in the ABoVE Science Cloud

Big Data, Big Computing, AI, and Environmental Science

Data Automation at Light Sources

Machine Learning in Healthcare Diagnostics

Foster CRA March 2022.pptx

Coding the Continuum

2014 moore-ddd

Living Outside the Comfort Zone - Daron green florianopolis 5-7-2014

High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...

NERSC, AI and the Superfacility, Debbie Bard

Big data ecosystem

Foss4G 2009 Scenz Grid

Cloud Accelerated Genomics

Computational Materials Design and Data Dissemination through the Materials P...

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

Accelerating data-intensive science by outsourcing the mundane

DuraMat Data Management and Analytics

Research workflow - 4 June 2018

Butler - a framework for a large-scale scientific analysis on the cloud - EOS...

Sgg crest-presentation-final

Similaire à Bioinformatics Data Pipelines built by CSIRO on AWS

Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services

Utility HPC: Right Systems, Right Scale, Right ScienceChef Software, Inc.

VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer

Computation and KnowledgeIan Foster

How novel compute technology transforms life science researchDenis C. Bauer

Azure Databricks for Data ScientistsRichard Garris

CS Guest Lecture 2015 10-05 advanced databasesGabe Rudy

Chemical workflows supporting automated research data collectionValery Tkachenko

What’s New in the Berkeley Data Analytics StackTuri, Inc.

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc

AWS Customer Presentation- University of MarylandAmazon Web Services

Computing Outside The Box June 2009Ian Foster

Opportunities for X-Ray science in future computing architecturesIan Foster

Services For Science April 2009Ian Foster

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu

OVium Bioinformatic SolutionsOVium Solutions

Lightning fast genomics with Spark, Adam and ScalaAndy Petrella

Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth

VariantSpark on AWSLynn Langit

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks

Similaire à Bioinformatics Data Pipelines built by CSIRO on AWS (20)

Time to Science/Time to Results: Transforming Research in the Cloud

Utility HPC: Right Systems, Right Scale, Right Science

VariantSpark: applying Spark-based machine learning methods to genomic inform...

Computation and Knowledge

How novel compute technology transforms life science research

Azure Databricks for Data Scientists

CS Guest Lecture 2015 10-05 advanced databases

Chemical workflows supporting automated research data collection

What’s New in the Berkeley Data Analytics Stack

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...

AWS Customer Presentation- University of Maryland

Computing Outside The Box June 2009

Opportunities for X-Ray science in future computing architectures

Services For Science April 2009

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

OVium Bioinformatic Solutions

Lightning fast genomics with Spark, Adam and Scala

Continuous modeling - automating model building on high-performance e-Infrast...

VariantSpark on AWS

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...

Plus de Lynn Langit

Serverless ArchitecturesLynn Langit

10+ Years of Teaching Kids ProgrammingLynn Langit

Blastn plus jupyter on DockerLynn Langit

Testing in Ballerina LanguageLynn Langit

Teaching Kids to create Alexa SkillsLynn Langit

Practical cloudLynn Langit

Understanding Jupyter notebooks using bioinformatics examplesLynn Langit

Teaching Kids ProgrammingLynn Langit

Practical CloudLynn Langit

Serverless RealityLynn Langit

Beyond RelationalLynn Langit

New AWS Services for BioinformaticsLynn Langit

Google Cloud and Data Pipeline PatternsLynn Langit

Scaling Galaxy on Google Cloud PlatformLynn Langit

SQL Server on Google Cloud PlatformLynn Langit

Redis Labs and SQL ServerLynn Langit

Building a data warehouse with AWS Redshift, Matillion and YellowfinLynn Langit

What is 'Teaching Kids Programming'Lynn Langit

Teaching Kids Programming for DevelopersLynn Langit

Plus de Lynn Langit (20)

Serverless Architectures

10+ Years of Teaching Kids Programming

Blastn plus jupyter on Docker

Testing in Ballerina Language

Teaching Kids to create Alexa Skills

Practical cloud

Understanding Jupyter notebooks using bioinformatics examples

Teaching Kids Programming

Practical Cloud

Serverless Reality

Beyond Relational

New AWS Services for Bioinformatics

Google Cloud and Data Pipeline Patterns

Scaling Galaxy on Google Cloud Platform

SQL Server on Google Cloud Platform

Redis Labs and SQL Server

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

What is 'Teaching Kids Programming'

Teaching Kids Programming for Developers

Dernier

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh

OECD bibliometric indicators: Selected highlights, April 2024innovationoecd

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9

Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B

Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju

Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad

Davis plaque method.pptx recombinant DNA technologycaarthichand2003

《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29

User Guide: Magellan MX™ Weather StationColumbia Weather Systems

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems

Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa

Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju

Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD

Speech, hearing, noise, intelligibility.pptxpriyankatabhane

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS

Dernier (20)

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx

OECD bibliometric indicators: Selected highlights, April 2024

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR

Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx

Pests of Bengal gram_Identification_Dr.UPR.pdf

Environmental Biotechnology Topic:- Microbial Biosensor

Davis plaque method.pptx recombinant DNA technology

《Queensland毕业文凭-昆士兰大学毕业证成绩单》

User Guide: Magellan MX™ Weather Station

Microteaching on terms used in filtration .Pharmaceutical Engineering

User Guide: Capricorn FLX™ Weather Station

Bioteknologi kelas 10 kumer smapsa .pptx

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf

Carbon Dioxide Capture and Storage (CSS)

Speech, hearing, noise, intelligibility.pptx

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...

Bioinformatics Data Pipelines built by CSIRO on AWS

1. Cancer Genomics Data Pipelines Lynn & Samantha Langit CSIRO Bioinformatics / Australia June 2017 - Oslo

2. 3 Billion data points per patient DNA sample Up to 25% of the population could be sequenced by 2025

3. Two Perspectives Bioinformatics Research • Insight • Reproducibility Cloud Architecture • Speed • Low Cost • Simplicity

4. Cloud Data Pipeline Pattern Problem • Define business problem Data • Quality • Quantity Candidate Technologies • Ingest • ETL • Biz Analytics • ML • Visualization Build MVPs • Iterate • Learn Assemble Pipeline • Validate each section • Test at scale

6. Genomic Sequencing Results CRISPR-Cas9 for molecular engineering technology enables the accurate editing of genomes for researchers. It…  Pattern-matching unique sequences of DNA  Huge demand for large-scale computation  Time-critical dimension to compute  NIH-approved for human health  Could revolutionize cancer treatments

7. Serverless Lambda Architecture Pattern Lambda function 1 Lambda function 2 Lambda function 3 buckets with objects DynamoDB API Gateway Users

8. CSIRO: Commonwealth Scientific & Industrial Research Organization

9. GT-Scan2 Demo GT-Scan2

10.

11. Scale Genomic Analysis GWAS = genome-wide sequencing data association studies  Analysis on large cohort data or imputed SNP array data  Clustering on genomic profiles to stratify large-cohort genomic data  Viewing datasets with millions of features

12. Cloud Data Pipeline Pattern Problem • Define business problem Data • Quality • Quantity Candidate Technologies • Ingest • ETL • Biz Analytics • ML • Visualization Build MVPs • Iterate • Learn Assemble Pipeline • Validate each section • Test at scale

13. Genomics (ML) Pipeline Pattern

14. What is CSIRO’s solution? For Scale at reasonable cost Use Apache Hadoop For Scale at speed Use Apache Spark for Hadoop For Usability in bioinformatics Create a domain-specific API (OSS library) For global use Leverage Cloud Pipeline Patterns

15. GWAS Analysis with Variant-Spark On premise Hadoop Cluster with Apache Spark Genomics Analysts corporate data center

16. What is Apache Spark?

17. What is variant-spark? Demo

18. 80% faster than ADAM 90% faster than R 90% faster than Python

19. VariantSpark Uses Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently  Analyzes 3,000 samples with 80 million features in < 30 minutes  Enables real-time diagnosis by finding similar patients  Contributes to motor neuron disease (ALS) research in Australia

20. Data Prep Statistics Probabilistic Algorithms Data Viz Machine Learning…

21. Spark ML Classification Algorithms Wide Random Forest Ensemble of Decision Trees Logistic Regression variant-spark other libraries

22. OSS Library variant-spark for all

23.  usable? performant?  extendable? (clean code)  using the best language (Scala)?  using the ‘best version’ of Spark?  using a version of wide random forests that is understandable? Is it…

24. How best to Deploy Cloud Hadoop? • IaaS  EC2 instances with Apache Hadoop, Apache Spark, more… • PaaS  Elastic Map Reduce (EMR) Hadoop cluster • SaaS  Vendor-managed, i.e. DataBricks w/Jupyter Notebooks

25. What is Databricks?

26.

27. DEMO: Jupyter Notebooks

28. Variant-Spark and Databricks Demo

29. Solving Important Questions… Cancer Genomics?

30. DEMO: Who is a Hipster?

31.  AWS EC2 Spot Instances

32. GWAS Analysis with Variant-Spark EC2 Hadoop Cluster with Apache Spark Genomics Analysts Availability Zone 1000 Genomes GWAS input Spot EC2 Hadoop worker instances EC2 Hadoop instances

33. Cloud Data Pipeline Pattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline Analyze GWAS -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python SaaS

34.

35. Cloud Data Pipeline Pattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Scan vcf -> S3/DynamoDB Ingest ETL Analyze Viz S3 Lambda Lambda Lambda/API Gateway Serverless 2. Analyze GWAS -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python SaaS

36. Modern Big Data Pipelines • Problem #1 - Scan • Solution: Serverless Cloud Pipeline • Problem # 2 - Analyze • Solution: SaaS Cloud ML Pipeline

37. Cancer Genomics Data Pipelines Lynn & Samantha Langit CSIRO Bioinformatics & variant-spark June 2017 - Oslo

Notes de l'éditeur

http://www.nature.com/news/first-crispr-clinical-trial-gets-green-light-from-us-panel-1.20137
http://bioinformatics.csiro.au/ and https://www.csiro.au/en/Locations/NSW/North-Ryde
https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
https://aws.amazon.com/blogs/aws/genome-engineering-applications-early-adopters-of-the-cloud/
https://github.com/csirobigdata/variant-spark
https://en.wikipedia.org/wiki/Random_forest --and-- https://spark.apache.org/docs/1.6.2/ml-classification-regression.html
https://databricks.com/
https://aws.amazon.com/blogs/aws/genome-engineering-applications-early-adopters-of-the-cloud/
https://github.com/csirobigdata/variant-spark