4. Cloud Data Pipeline Pattern
Problem
• Define business
problem
Data
• Quality
• Quantity
Candidate
Technologies
• Ingest
• ETL
• Biz Analytics
• ML
• Visualization
Build MVPs
• Iterate
• Learn
Assemble
Pipeline
• Validate each
section
• Test at scale
5.
6. Genomic Sequencing Results
CRISPR-Cas9 for molecular engineering technology
enables the accurate editing of genomes for researchers.
It…
Pattern-matching unique sequences of DNA
Huge demand for large-scale computation
Time-critical dimension to compute
NIH-approved for human health
Could revolutionize cancer treatments
11. Scale Genomic Analysis
GWAS = genome-wide sequencing data association
studies
Analysis on large cohort data or imputed SNP array data
Clustering on genomic profiles to stratify large-cohort
genomic data
Viewing datasets with millions of features
12. Cloud Data Pipeline Pattern
Problem
• Define business
problem
Data
• Quality
• Quantity
Candidate
Technologies
• Ingest
• ETL
• Biz Analytics
• ML
• Visualization
Build MVPs
• Iterate
• Learn
Assemble
Pipeline
• Validate each
section
• Test at scale
14. What is CSIRO’s solution?
For Scale at
reasonable cost Use Apache Hadoop
For Scale at
speed Use Apache Spark for Hadoop
For Usability
in
bioinformatics
Create a domain-specific API (OSS library)
For global use
Leverage Cloud Pipeline Patterns
15. GWAS Analysis with Variant-Spark
On premise Hadoop Cluster
with Apache Spark
Genomics Analysts
corporate data center
18. 80% faster than ADAM
90% faster than R
90% faster than Python
19. VariantSpark
Uses Apache Spark to massively parallelize the generation of
random forests to identify disease genes efficiently
Analyzes 3,000 samples with 80 million features in < 30 minutes
Enables real-time diagnosis by finding similar patients
Contributes to motor neuron disease (ALS) research in Australia
23. usable? performant?
extendable? (clean code)
using the best language
(Scala)?
using the ‘best version’ of
Spark?
using a version of wide
random forests that is
understandable?
Is it…
24. How best to Deploy Cloud Hadoop?
• IaaS
EC2 instances with Apache Hadoop, Apache Spark, more…
• PaaS
Elastic Map Reduce (EMR) Hadoop cluster
• SaaS
Vendor-managed, i.e. DataBricks w/Jupyter Notebooks
32. GWAS Analysis with Variant-Spark
EC2 Hadoop Cluster with Apache Spark
Genomics Analysts
Availability Zone
1000 Genomes
GWAS input
Spot EC2 Hadoop
worker instances
EC2 Hadoop
instances
33. Cloud Data Pipeline Pattern
Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
Analyze GWAS -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
SaaS
34.
35. Cloud Data Pipeline Pattern
Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
1. Scan vcf -> S3/DynamoDB Ingest
ETL
Analyze
Viz
S3
Lambda
Lambda
Lambda/API Gateway
Serverless
2. Analyze GWAS -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
SaaS
36. Modern Big Data Pipelines
• Problem #1 - Scan
• Solution: Serverless Cloud Pipeline
• Problem # 2 - Analyze
• Solution: SaaS Cloud ML Pipeline
http://bioinformatics.csiro.au/ and https://www.csiro.au/en/Locations/NSW/North-Ryde
https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/