Rethinking Data-Intensive Science Using Scalable Analytics Systems

Rethinking Data-Intensive
Science Using Scalable
Analytics Systems
Frank Austin Nothaft
UC Berkeley AMP/ASPIRE Lab, @fnothaft
With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja,
Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

Scientiﬁc revolutions are
driven by data acquisition
revolutions

Genome Sequencing
Source: NIH National Genome Research Institute
2014: ~230,000 genomes sequenced
15-250GB/genome = ~30TB/day
= ~10PB/year
Human Genome!
Project: ~10GB
1000 Genomes: 15TB
TCGA: 3PB

Sequencing advances line up well
with scalable analytics software
Source: NIH National Genome Research Institute
Google
MapReduce
Hadoop MR
Spark
Parquet

Mapping scientific systems to
commodity analytics systems
• Contemporary scientific systems are custom-built
• Leads to functionality from commodity systems being rebuilt
• We have an opportunity to rethink the abstractions that
scientific systems use:
• Migrate from a flat architecture to a stacked
architecture
• Expose higher level programming primitives
• Use commodity tools wherever possible

Common Traits of Legacy Data
Intensive Scientific Systems
1. Computation is workflow/pipeline oriented
2. Processing system has monolithic/flat architecture
3. Data is stored in flat files

Genomics Pipelines
Source: The Broad Institute of MIT/Harvard

Flat File Formats
• Scientific data is typically stored in application
specific file formats:
• Genomic reads: SAM/BAM, CRAM
• Genomic variants: VCF/BCF, MAF
• Genomic features: BED, NarrowPeak, GTF
• Centralized metadata makes it difficult to parallelize
applications

Flat Architectures
• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are ﬂat architectures bad?
1. Trivial: low level abstractions are not productive
2. Trivial: ﬂat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs

The perils of ﬂattening…
• The trivial:
• You can improve performance by pushing data
access order into your data layout
• But now, you can’t easily compose pipeline stages
that have different access orders
• The obscure:
• If you access data via a sorted iterator, will you
incorrectly implement your algorithm?

First, deﬁne a schema
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
A schema provides a
narrow waistrecord AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Accelerate common
access patterns
• In genomics, we commonly
have to ﬁnd observations that
overlap in a coordinate plane
• This coordinate plane is
genomics speciﬁc, and is
known a priori
• We can use our knowledge of
the coordinate plane to
implement a fast overlap join
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Pick appropriate storage
• When accessing scientiﬁc
datasets, we frequently slice and
dice the dataset:
• Algorithms may touch
subsets of columns
• We don’t always touch the
whole dataset
• This is a good match for
columnar storage
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Is introducing a new data
model really a good idea?
Source: XKCD, http://xkcd.com/927/

A subtle point:!
Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to
serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models

A subtle point:!
Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models

A well designed stack
simpliﬁes application design
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Variant calling & analysis,
RNA-seq analysis, etc.
Disk, SDD, block
store, memory cache
HDFS, Tachyon, HPC file
systems, S3
Load data from Parquet and
legacy formats
Spark, Spark-SQL,
Hadoop
Enriched Read/Variant
Avro Schema for reads,
variants, and genotypes
Users define analyses
via transformations
Enriched models provide convenient
methods on common models
The evidence access layer
efficiently executes transformations
Schemas define the logical
structure of basic genomic objects
Common interfaces map logical
schema to bytes on disk
Parallel file system layer
coordinates distribution of data
Decoupling storage enables
performance/cost tradeoff

How does this perform
on real scientiﬁc data?

ADAM performs genomic
preprocessing
Source: The Broad Institute of MIT/Harvard

ADAM’s Performance
• Achieve linear scalability out
to 128 nodes for most tasks
• Up to 3x improvement over
current tools on a single node
Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge
Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git

Astronomy Pipelines
Source: The LSST Project

Astronomy Image
Co-addition Performance
• Scales out to 16 nodes
• ~3x improvement over extant
tool on a single node
Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)

Conclusions
• There is a huge increase in the amount of scientiﬁc
data being processed
• Although scientiﬁc processing pipelines tend to be
custom solutions, we can replace these pipelines
with general, DBMS backed solutions
• When we move to a general solution, we can gain
performance without losing correctness

Acknowledgements
• ADAM (https://www.github.com/bigdatagenomics/adam):!
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey
Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony
Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman,
Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Ravi Pandya!
• UC Santa Cruz: Benedict Paten, David Haussler!
• KIRA (https://www.github.com/BIDS/Kira):!
• UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary,
Oliver Zahn, Saul Perlmutter!
• PoC code at https://github.com/zhaozhang/SparkMontage

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Rethinking Data-Intensive Science Using Scalable Analytics Systems

Similaire à Rethinking Data-Intensive Science Using Scalable Analytics Systems (20)

Plus de fnothaft

Plus de fnothaft (9)

Dernier

Dernier (20)

Rethinking Data-Intensive Science Using Scalable Analytics Systems