2017 nov reflow sbtb

Cloud Science
Marius Eriksen, GRAIL
Scale by the Bay, November 2017

How?
Sequence cell-free DNA from blood, at high depth. (Oversample.)
Analyze these data to look for cancers:  
bioinformatics, statistics, machine learning.
Up to 1TB of input data (raw sequence reads) per sample.

Bioinformatics
Build software tools to analyze sequencing data.
Interdisciplinary: data structures, algorithms, biology, statistics,
mathematics.
Classic example: sequence alignment.

BAM
BAM
BAM
FASTQ1
FASTQ2
…
…
Align
Align
…
FASTQ
N
Merge
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
BincountsQC
A (simple) bioinformatics workflow

A (simple) bioinformatics workflow
BA
M
BA
M
BA
M
FAST
Q1
FAST
Q2
…
…
Align
Align
…
FAST
QN
Merg
e
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
BincountsQC
External
reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2

Cross-sample analysis
B
A
M
B
A
M
B
A
M
F
A
S
T
Q
1
F
A
S
T
Q
2…
…
Alig
n
Alig
n
…
F
A
S
T
Q
N
M
e
r
g
e
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
QC
External reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
B
A
M
B
A
M
B
A
M
F
A
S
T
Q
1
F
A
S
T
Q
2…
…
Alig
n
Alig
n
…
F
A
S
T
Q
N
M
e
r
g
e
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
QC
External reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
B
A
M
B
A
M
B
A
M
F
A
S
T
Q
1
F
A
S
T
Q
2…
…
Alig
n
Alig
n
…
F
A
S
T
Q
N
M
e
r
g
e
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
QC
External reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
Classifier 1 Classifier 2 Classifier 3

Computing infrastructure
HPC is typical: very expensive filer, expensive compute nodes, data shared
through network file system.
Or just single-node computation on large machines.
Workqueue systems: Sun Grid Engine is popular.
This is not cloud friendly: storage through blobstores (S3, GCS, etc.); elastic
compute with on-demand capacity; can’t assume much about your
environment.

Workflow frameworks
Most bioinformatics computing is done through a workflow framework.
There’s a cottage industry of these.
Most of them are low level: they require the user to construct an explicit
graph of execution nodes + dependencies; don’t assume anything about data
model.
Cumbersome, difficult to compose, and ties the hands of the implementor.

Example: Apache Airflow
“Hello world” example from 
Apache Airflow.
Invokes two commands in bash.
Is only involved in “orchestration” — 
distribution must be solved at another 
layer. (e.g., by invoking Hadoop.)

What is Reflow?
Basic idea: get rid of the notion of a “workflow” — these are just programs!
Program them directly, applying the ideas of functional and data flow
programming languages. Lazy evaluation makes it easier to reason about
performance (yes, really), and simpler to compose. Type safety is useful.
Define data model (referential transparency) that gives the runtime a lot of
leverage.
Combine into a single, vertically-integrated system that transparently
parallelizes and distributes work across private, elastic clusters.
(“Serverless”, “cloud native”.)

Desiderata: DSL
A simple, statically typed, functional language with Go-like syntax. (Quick
familiarity.) Just enough power for “workflows”, no more.
Compound data structures and composition (structs, lists, maps,
comprehensions.)
Referentially transparent, lazily evaluated. (Don’t perform unnecessary
work.)
Module system for reusability and testing.
Self documenting.

Desiderata: runtime
Interpret programs directly “on the cloud” — and elastically.
Distribute tools via Docker images.
Cache all costly reductions: top-down, and then bottom-up evaluation.
Use work stealing to distribute evaluation.
Authenticate the user once; bootstrap credentials to other resources.
Portable: multiple cloud providers, on-premise, laptop.

Evaluation
Evaluation is parallelized where there are no data dependencies.
All costly reductions are memoized (e.g., to S3), and can be reused.
Lazy evaluation enhances reasoning and composability.
The net effect is incremental evaluation! We always compute the smallest
difference our semantics permit.

Example
func cleanup(data file) file = exec(…) 
func analyze(data file) file = exec(…) 
func merge(data [file]) file = exec(…) 
 
val samples [file] = […] 
 
val cleaned = [cleanup(s) | s <- samples] 
val analyzed = [analyze(s) | s <- cleaned] 
val merged = merge(analyzed)

Runtime
Evaluator
(e.g., your
laptop) Cache
repository
(e.g., S3)
Assoc
(e.g.,
DynamoDB)
Worker 1
dockerd
alloc 1
repository
exec 1
exec 2
Worker 2
dockerd
alloc 1
repository
exec 1
exec 2

Work stealing evaluator todo
task 1
primary alloc
repository
dockerd
task 2
task 3
task 4
task 1
task 2
stealer alloc
repository
dockerd
task 3
task 4

Work stealing
primary alloc
repository
dockerd
task 1
task 2
stealer alloc
repository
dockerd
2. transfer dependent objects
3. perform work
evaluator
1. lease task
cache repository
4. transfer results
4. transfer results
5. return task

Runtime
Take advantage of computing semantics to simplify.
Fault tolerance: just restart.
Evaluator is completely stateless: state is computed from program+cache.
Apply end-to-end principle: evaluator maintains keep alive to allocs; restart
on failure.
Result: simple, robust, vertically integrated compute stack. (~30K LOC)

Computing model
Because of lazy evaluation, hierarchies can be introspected cheaply.
Because of caching, any change (e.g., parts of pipeline, set of samples,
etc.) is incrementally computed: subcomputations are reused automatically.
Since we’re just computing values (e.g., a tree of sample analyses), Reflow
can also replace the superstructures around scientific computation (e.g.,
management of what’s been run where, their statuses, etc.)
Because of referential transparency, the runtime is given wide latitude in
cache management, data movement, tradeoffs of compute vs. storage costs,
etc.

Versioning and reproducibility
Hermetic and narrow environments help reproducibility.
We can now use ordinary source control to maintain versions.
% reflow ls -l root.rf/… 
… 
% git checkout v1.1 — root.rf 
% reflow ls -l root.rf/… 
# different results

Status
Open sourced in 10/26! https://github.com/grailbio/reflow
Inside of GRAIL, we use Reflow for all bioinformatics computation, lots of
ad-hoc computing (e.g., building models, running one-off workflows,
exploratory analyses, etc.)
Still has a few kinks, but the model works well.
We have definitely untied the hands of the implementor.

Thanks.
 
 
 
 
 
 
 
 
Marius Eriksen 
marius@grail.com 
@marius 
 
github.com/grailbio/reflow

2017 nov reflow sbtb

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 2017 nov reflow sbtb

Similaire à 2017 nov reflow sbtb (20)

Dernier

Dernier (20)

2017 nov reflow sbtb