SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Analyzing 100’s of TB of
Genomic Data with ADAM
and Toil
Frank Austin Nothaft
@fnothaft, fnothaft@berkeley.edu
6/8/2016
Compulsory Moore’s Law Slide
The Sequencing Abstraction
● Sequencing is a Poisson substring sampling process
● For $1,000, we can sequence a 30x copy of your genome, but what is the
analysis cost?
The Alignment Process
The Sequence Re-assembly Process
End-to-end variant analysis
Broad Institute of MIT/Harvard
End-to-end variant analysis
Align and clean up
errors in raw reads
150 hrs/sample
Broad Institute of MIT/Harvard
End-to-end variant analysis
Generate diffs
between samples
1 week/100
samples
Broad Institute of MIT/Harvard
Genomics is built around flattened tools
Genomics is built around legacy file formats:
● E.g., SAM/BAM → alignment, VCF → variants,
BED/GTF/etc → features
● Manually curated text/binary flat files
These formats dictate:
● What accesses can be optimized (read a full row)
● What predicates can be evaluated (small number of
genomic loci)
● How we write genomic algorithms (sorted iterator over
genome)
Why avoid flat architectures?
Flat architectures tend to expose bad programming interfaces:
● What access patterns do our flat files lock us into?
● GATK: sorted iterator over the genome
What do flat architectures break?
1. Trivial: low level abstractions are not productive
2. Trivial: flat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs
Green field genomics: start with a schema!
A schema provides a narrow waist
We can use our stack to accelerate common
queries!
● In genomics, we commonly have to find
observations that overlap in a coordinate
plane
● This coordinate plane is genomics specific,
and is known a priori
● We can use our knowledge of the coordinate
plane to implement a fast overlap join
ADAM/Spark Yields Horizontal Scalability
● 30–50x speedup over
traditional implementations
● Speedup extends to O
(200MB / node)
● 3x improvement in analysis
cost
This means we can support genomic EDA as well...
● Narrow waist in stack → can swap in/out stages
● If I want to run interactive queries against genomic loci, swap in an
RDD implementation that is optimized for point/range queries
Using ADAM+Spark+Toil
● On-going validation of ADAM on Toil against existing “best practice” tools
● Using Simons Genome Diversity Project:
○ 260 individuals from 127 ethnic groups: more than 100TB of genomic data
● At peak scale, we will be running on ~1,000 r3.8xlarge instances
Toil: Pipeline Architecture for Massive Workflows
● Work led by UCSC/BD2K Center
● Massively Scalable — tested on 32,000 cores
● Resumable after failure of any node
● Portable: installed with a single command
○ Amazon, OpenStack, Azure, HPC, and Google (soon)
● Simple: built entirely in Python
● Open-source → github.com/BD2KGenomics/toil
Develop workflows locally…
Deploy at scale without changing source code!
Toil is a DAG metascheduler
● Users decompose a workflow into jobs and data:
○ Job → python function or class
○ Toil runs jobs on a single node by delegating to a sub-scheduler
○ Files get written to persistent FileStore for communication between jobs running on different
nodes, use local cache to improve performance
● Supports many underlying schedulers/file stores
● When running in the cloud, jobs can be autoscaled
○ Happens transparently to the running jobs
Accommodating “clusters” in a job-oriented workflow
● Hint: a cluster is just a collection
of single node jobs!
○ However, the fate of all of the jobs
are tied!
● Service jobs persist for the
whole lifetime of the task that
spawned them
● Spark has a clean mapping to a
set of services
Vivian et al, 2016
ADAM Evaluation
● ADAM produces statistically
equivalent results to the GATK best
practices pipeline
● Read preprocessing is >30x faster
and 3x cheaper
● In the process of recalling the
Simons Genome Diversity Project
using ADAM
● We have a working pipeline using
both HG19 and GRCh38
0 20 40 60 80 100 120 140 160
Runtime (hours)
ADAM
GATK
Spark
HDFS
Spark
HDFS
Spark
HDFS
Spark
HDFS
Spark
HDFS
ADAM
ADAM Evaluation
● ADAM produces statistically
equivalent results to the GATK best
practices pipeline
● Our end-to-end pipeline is 3.5x faster
while also being 4x cheaper
● In the process of recalling the
Simons Genome Diversity Project
using ADAM
● We have a working pipeline using
both HG19 and GRCh38
0 20 40 60 80 100 120 140 160
Runtime (hours)
ADAM
GATK
Spark
HDFS
Spark
HDFS
Spark
HDFS
Spark
HDFS
Spark
HDFS
ADAM
GATK-HC
Check out the code!
ADAM: https://github.com/bigdatagenomics/adam
Check out a demo!
https://databricks.com/blog/2016/05/24/
genome-sequencing-in-a-nutshell.html
Run ADAM in Databricks CE!
http://goo.gl/xK8x7s
Acknowledgements
● UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric
Tu, Alyssa Morrow, Niranjan Kumar, Ananth Pallaseni, Michael Heuer, Justin Paschall, Taner
Dagdelen, Anthony D. Joseph, Dave Patterson
● Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher
● GenomeBridge: Carl Yeksigian
● Cloudera: Uri Laserson, Tom White
● Microsoft Research: Ravi Pandya, Bill Bolosky
● UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot, Audrey
Musselman-Brown, John Vivian
● And many other open source contributors, especially Neil Ferguson, Andy Petrella, Xavier Tordior,
Deborah Siegel, Denny Lee
● Total of 52 contributors to ADAM/BDG from >12 institutions, 23 contributors to Toil from 5
institutions

Contenu connexe

Tendances

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 

Tendances (20)

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 

En vedette

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
fnothaft
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
robertstevens65
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie
 

En vedette (20)

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized Genomics
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 Presentation
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWS
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Coaching poster
Coaching posterCoaching poster
Coaching poster
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...
 
The need to redefine genomic data sharing - moving towards Open Science Oct ...
The need to redefine genomic data sharing - moving towards Open Science  Oct ...The need to redefine genomic data sharing - moving towards Open Science  Oct ...
The need to redefine genomic data sharing - moving towards Open Science Oct ...
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Similaire à Processing 70Tb Of Genomics Data With ADAM And Toil

Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
fnothaft
 

Similaire à Processing 70Tb Of Genomics Data With ADAM And Toil (20)

Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance Tuning
 

Plus de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Dernier (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Processing 70Tb Of Genomics Data With ADAM And Toil

  • 1. Analyzing 100’s of TB of Genomic Data with ADAM and Toil Frank Austin Nothaft @fnothaft, fnothaft@berkeley.edu 6/8/2016
  • 3. The Sequencing Abstraction ● Sequencing is a Poisson substring sampling process ● For $1,000, we can sequence a 30x copy of your genome, but what is the analysis cost?
  • 6. End-to-end variant analysis Broad Institute of MIT/Harvard
  • 7. End-to-end variant analysis Align and clean up errors in raw reads 150 hrs/sample Broad Institute of MIT/Harvard
  • 8. End-to-end variant analysis Generate diffs between samples 1 week/100 samples Broad Institute of MIT/Harvard
  • 9. Genomics is built around flattened tools Genomics is built around legacy file formats: ● E.g., SAM/BAM → alignment, VCF → variants, BED/GTF/etc → features ● Manually curated text/binary flat files These formats dictate: ● What accesses can be optimized (read a full row) ● What predicates can be evaluated (small number of genomic loci) ● How we write genomic algorithms (sorted iterator over genome)
  • 10. Why avoid flat architectures? Flat architectures tend to expose bad programming interfaces: ● What access patterns do our flat files lock us into? ● GATK: sorted iterator over the genome What do flat architectures break? 1. Trivial: low level abstractions are not productive 2. Trivial: flat architectures create technical lock-in 3. Subtle: low level abstractions can introduce bugs
  • 11. Green field genomics: start with a schema!
  • 12. A schema provides a narrow waist
  • 13. We can use our stack to accelerate common queries! ● In genomics, we commonly have to find observations that overlap in a coordinate plane ● This coordinate plane is genomics specific, and is known a priori ● We can use our knowledge of the coordinate plane to implement a fast overlap join
  • 14. ADAM/Spark Yields Horizontal Scalability ● 30–50x speedup over traditional implementations ● Speedup extends to O (200MB / node) ● 3x improvement in analysis cost
  • 15. This means we can support genomic EDA as well... ● Narrow waist in stack → can swap in/out stages ● If I want to run interactive queries against genomic loci, swap in an RDD implementation that is optimized for point/range queries
  • 16. Using ADAM+Spark+Toil ● On-going validation of ADAM on Toil against existing “best practice” tools ● Using Simons Genome Diversity Project: ○ 260 individuals from 127 ethnic groups: more than 100TB of genomic data ● At peak scale, we will be running on ~1,000 r3.8xlarge instances
  • 17. Toil: Pipeline Architecture for Massive Workflows ● Work led by UCSC/BD2K Center ● Massively Scalable — tested on 32,000 cores ● Resumable after failure of any node ● Portable: installed with a single command ○ Amazon, OpenStack, Azure, HPC, and Google (soon) ● Simple: built entirely in Python ● Open-source → github.com/BD2KGenomics/toil Develop workflows locally… Deploy at scale without changing source code!
  • 18. Toil is a DAG metascheduler ● Users decompose a workflow into jobs and data: ○ Job → python function or class ○ Toil runs jobs on a single node by delegating to a sub-scheduler ○ Files get written to persistent FileStore for communication between jobs running on different nodes, use local cache to improve performance ● Supports many underlying schedulers/file stores ● When running in the cloud, jobs can be autoscaled ○ Happens transparently to the running jobs
  • 19. Accommodating “clusters” in a job-oriented workflow ● Hint: a cluster is just a collection of single node jobs! ○ However, the fate of all of the jobs are tied! ● Service jobs persist for the whole lifetime of the task that spawned them ● Spark has a clean mapping to a set of services Vivian et al, 2016
  • 20. ADAM Evaluation ● ADAM produces statistically equivalent results to the GATK best practices pipeline ● Read preprocessing is >30x faster and 3x cheaper ● In the process of recalling the Simons Genome Diversity Project using ADAM ● We have a working pipeline using both HG19 and GRCh38 0 20 40 60 80 100 120 140 160 Runtime (hours) ADAM GATK Spark HDFS Spark HDFS Spark HDFS Spark HDFS Spark HDFS ADAM
  • 21. ADAM Evaluation ● ADAM produces statistically equivalent results to the GATK best practices pipeline ● Our end-to-end pipeline is 3.5x faster while also being 4x cheaper ● In the process of recalling the Simons Genome Diversity Project using ADAM ● We have a working pipeline using both HG19 and GRCh38 0 20 40 60 80 100 120 140 160 Runtime (hours) ADAM GATK Spark HDFS Spark HDFS Spark HDFS Spark HDFS Spark HDFS ADAM GATK-HC
  • 22. Check out the code! ADAM: https://github.com/bigdatagenomics/adam Check out a demo! https://databricks.com/blog/2016/05/24/ genome-sequencing-in-a-nutshell.html Run ADAM in Databricks CE! http://goo.gl/xK8x7s
  • 23. Acknowledgements ● UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Alyssa Morrow, Niranjan Kumar, Ananth Pallaseni, Michael Heuer, Justin Paschall, Taner Dagdelen, Anthony D. Joseph, Dave Patterson ● Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher ● GenomeBridge: Carl Yeksigian ● Cloudera: Uri Laserson, Tom White ● Microsoft Research: Ravi Pandya, Bill Bolosky ● UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot, Audrey Musselman-Brown, John Vivian ● And many other open source contributors, especially Neil Ferguson, Andy Petrella, Xavier Tordior, Deborah Siegel, Denny Lee ● Total of 52 contributors to ADAM/BDG from >12 institutions, 23 contributors to Toil from 5 institutions