SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
ADAM: Fast, Scalable
Genome Analysis
Frank Austin Nothaft	

AMPLab, University of California, Berkeley, @fnothaft	

	

with: Matt Massie,André Schumacher,Timothy Danford, Chris Hartl, Jey
Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff
Hammerbacher,Anthony Joseph, and Dave Patterson	

	

https://github.com/bigdatagenomics	

http://www.bdgenomics.org
Problem
• Whole genome files are large	

• Biological systems are complex	

• Population analysis requires petabytes of
data	

• Analysis time is often a matter of life and
death
Whole Genome

Data Sizes
Input
Pipeline	

Stage
Output
SNAP
1GB Fasta	

150GB Fastq
Alignment 250GB BAM
ADAM 250GB BAM
Pre-
processing
200GB
ADAM
Avocado
200GB
ADAM
Variant
Calling
10MB ADAM
Variants found at about 1 in 1,000 loci
Shredded Book Analogy
Dickens accidentally shreds the first printing of A Tale of Two Cities	

Text printed on 5 long spools
• How can he reconstruct the text?	

– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments	

– The short fragments from every copy are mixed together	

– Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/
What is ADAM?
• File formats: columnar file format that
allows efficient parallel access to genomes	

• API: interface for transforming, analyzing,
and querying genomic data	

• CLI: a handy toolkit for quickly processing
genomes
Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud	

• Provide data format that has efficient
parallel/distributed access across platforms	

• Enhance semantics of data and allow more
flexible data access patterns
Implementation Overview
• 25K lines of Scala code	

• 100% Apache-licensed open-source	

• 18 contributors from 6 institutions	

• Working towards a production quality release late 2014
ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware	

‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed Filesystem	

‣Local Filesystem
‣Schema-driven records w/ Apache Avro	

‣Store and retrieve records using Parquet	

‣Read BAM Files using Hadoop-BAM
In-Memory
RDD
‣Transform records using Apache Spark	

‣Query with SQL using Shark	

‣Graph processing with GraphX	

‣Machine learning using MLBase
• OSS Created by Twitter and Cloudera, based on
Google Dremel	

• Columnar File Format:	

• Limits I/O to only data that is needed	

• Compresses very well - ADAM files are 5-25%
smaller than BAM files without loss of data	

• Fast scans - load only columns you need, e.g.
scan a read flag on a whole genome, high-
coverage file in less than a minute
Parquet
chrom20 TCGA 4M
chrom20 GAAT 4M1D
chrom20 CCGAT 5M
chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M
Column Oriented
chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M
Row Oriented
Predicate
Projection
Read Data
Cloud Optimizations
• Working on optimizations for loading
Parquet directly from S3	

• Building tools for changing cluster size as
spot prices fluctuate	

• Will separate code out for broader
community use
Scaling Genomics: BQSR
• DNA sequencers read 2% of sequence
incorrectly	

• Per base, estimate L(base is correct)	

• However, these estimates are poor, because
sequencers miss correlated errors
Empirical Error Rate
Spark BQSR
Implementation
• Broadcast 3 GB table of variants, used for
masking	

• Break reads down to bases and map bases
to covariates	

• Calculate empirical values per covariate	

• Broadcast observation, apply across reads
Future Work
• Pushing hard towards production release	

• Plan to release Python (possibly R) bindings	

• Work on interoperability with Global
Alliance for Genomic Health API (http://
genomicsandhealth.org/)
Call for contributions
• As an open source project, we welcome
contributions	

• We maintain a list of open enhancements at
our Github issue tracker	

• Enhancements tagged with “Pick me up!” don’t require a genomics background	

• Github: https://www.github.com/bdgenomics 	

• We’re also looking for two full time
engineers… see Matt Massie!
Acknowledgements
• UC Berkeley: Matt Massie,André Schumacher, Jey
Kottalam, Christos Kozanitis	

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher	

• GenomeBridge: Timothy Danford, CarlYeksigian	

• The Broad Institute: Chris Hartl	

• Cloudera: Uri Laserson	

• Microsoft Research: Jeremy Elson, Ravi Pandya	

• And other open source contributors!
Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, and DARPA XData Award
FA8750-12-2-0331, and gifts from Amazon Web
Services, Google, SAP, The Thomas and Stacey
Siebel Foundation,Apple, Inc., C3Energy, Cisco,
Cloudera, EMC, Ericsson, Facebook, GameOnTalis,
Guavus, HP, Huawei, Intel, Microsoft, NetApp,
Pivotal, Splunk,Virdata,VMware,WANdisco and
Yahoo!.

Contenu connexe

Tendances

Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Spark Summit
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAmazon Web Services
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 
CrateDB 101: Geospatial data
CrateDB 101: Geospatial dataCrateDB 101: Geospatial data
CrateDB 101: Geospatial dataClaus Matzinger
 

Tendances (20)

Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at Scale
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of Maryland
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
CrateDB 101: Geospatial data
CrateDB 101: Geospatial dataCrateDB 101: Geospatial data
CrateDB 101: Geospatial data
 

En vedette

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMMatt Massie
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1MapR Technologies
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016Amazon Web Services
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGAayushi Pal
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 

En vedette (14)

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 

Similaire à ADAM—Spark Summit, 2014

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2015 09 emc lsug
2015 09 emc lsug2015 09 emc lsug
2015 09 emc lsugChris Dwan
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptxaaaa bbb
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...datascience_at
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsNatalino Busa
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"GigaScience, BGI Hong Kong
 

Similaire à ADAM—Spark Summit, 2014 (20)

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 09 emc lsug
2015 09 emc lsug2015 09 emc lsug
2015 09 emc lsug
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptx
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
 

Dernier

Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 

Dernier (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

ADAM—Spark Summit, 2014

  • 1. ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley, @fnothaft with: Matt Massie,André Schumacher,Timothy Danford, Chris Hartl, Jey Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher,Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics http://www.bdgenomics.org
  • 2. Problem • Whole genome files are large • Biological systems are complex • Population analysis requires petabytes of data • Analysis time is often a matter of life and death
  • 3. Whole Genome
 Data Sizes Input Pipeline Stage Output SNAP 1GB Fasta 150GB Fastq Alignment 250GB BAM ADAM 250GB BAM Pre- processing 200GB ADAM Avocado 200GB ADAM Variant Calling 10MB ADAM Variants found at about 1 in 1,000 loci
  • 4. Shredded Book Analogy Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools • How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, … It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness, It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, … It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, … It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, … It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness, It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, … It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, … It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, … Slide credit to Michael Schatz http://schatzlab.cshl.edu/
  • 5. What is ADAM? • File formats: columnar file format that allows efficient parallel access to genomes • API: interface for transforming, analyzing, and querying genomic data • CLI: a handy toolkit for quickly processing genomes
  • 6. Design Goals • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms • Enhance semantics of data and allow more flexible data access patterns
  • 7. Implementation Overview • 25K lines of Scala code • 100% Apache-licensed open-source • 18 contributors from 6 institutions • Working towards a production quality release late 2014
  • 8. ADAM Stack Physical File/Block Record/Split ‣Commodity Hardware ‣Cloud Systems - Amazon, GCE, Azure ‣Hadoop Distributed Filesystem ‣Local Filesystem ‣Schema-driven records w/ Apache Avro ‣Store and retrieve records using Parquet ‣Read BAM Files using Hadoop-BAM In-Memory RDD ‣Transform records using Apache Spark ‣Query with SQL using Shark ‣Graph processing with GraphX ‣Machine learning using MLBase
  • 9. • OSS Created by Twitter and Cloudera, based on Google Dremel • Columnar File Format: • Limits I/O to only data that is needed • Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high- coverage file in less than a minute Parquet
  • 10. chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M Column Oriented chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M Row Oriented Predicate Projection Read Data
  • 11. Cloud Optimizations • Working on optimizations for loading Parquet directly from S3 • Building tools for changing cluster size as spot prices fluctuate • Will separate code out for broader community use
  • 12. Scaling Genomics: BQSR • DNA sequencers read 2% of sequence incorrectly • Per base, estimate L(base is correct) • However, these estimates are poor, because sequencers miss correlated errors
  • 14. Spark BQSR Implementation • Broadcast 3 GB table of variants, used for masking • Break reads down to bases and map bases to covariates • Calculate empirical values per covariate • Broadcast observation, apply across reads
  • 15. Future Work • Pushing hard towards production release • Plan to release Python (possibly R) bindings • Work on interoperability with Global Alliance for Genomic Health API (http:// genomicsandhealth.org/)
  • 16. Call for contributions • As an open source project, we welcome contributions • We maintain a list of open enhancements at our Github issue tracker • Enhancements tagged with “Pick me up!” don’t require a genomics background • Github: https://www.github.com/bdgenomics • We’re also looking for two full time engineers… see Matt Massie!
  • 17. Acknowledgements • UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam, Christos Kozanitis • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, CarlYeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • And other open source contributors!
  • 18. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation,Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,Virdata,VMware,WANdisco and Yahoo!.