3. How?
Sequence cell-free DNA from blood, at high depth. (Oversample.)
Analyze these data to look for cancers:
bioinformatics, statistics, machine learning.
Up to 1TB of input data (raw sequence reads) per sample.
4. Bioinformatics
Build software tools to analyze sequencing data.
Interdisciplinary: data structures, algorithms, biology, statistics,
mathematics.
Classic example: sequence alignment.
6. A (simple) bioinformatics workflow
BA
M
BA
M
BA
M
FAST
Q1
FAST
Q2
…
…
Align
Align
…
FAST
QN
Merg
e
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
BincountsQC
External
reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
8. Computing infrastructure
HPC is typical: very expensive filer, expensive compute nodes, data shared
through network file system.
Or just single-node computation on large machines.
Workqueue systems: Sun Grid Engine is popular.
This is not cloud friendly: storage through blobstores (S3, GCS, etc.); elastic
compute with on-demand capacity; can’t assume much about your
environment.
9. Workflow frameworks
Most bioinformatics computing is done through a workflow framework.
There’s a cottage industry of these.
Most of them are low level: they require the user to construct an explicit
graph of execution nodes + dependencies; don’t assume anything about data
model.
Cumbersome, difficult to compose, and ties the hands of the implementor.
10. Example: Apache Airflow
“Hello world” example from
Apache Airflow.
Invokes two commands in bash.
Is only involved in “orchestration” —
distribution must be solved at another
layer. (e.g., by invoking Hadoop.)
11. What is Reflow?
Basic idea: get rid of the notion of a “workflow” — these are just programs!
Program them directly, applying the ideas of functional and data flow
programming languages. Lazy evaluation makes it easier to reason about
performance (yes, really), and simpler to compose. Type safety is useful.
Define data model (referential transparency) that gives the runtime a lot of
leverage.
Combine into a single, vertically-integrated system that transparently
parallelizes and distributes work across private, elastic clusters.
(“Serverless”, “cloud native”.)
12. Desiderata: DSL
A simple, statically typed, functional language with Go-like syntax. (Quick
familiarity.) Just enough power for “workflows”, no more.
Compound data structures and composition (structs, lists, maps,
comprehensions.)
Referentially transparent, lazily evaluated. (Don’t perform unnecessary
work.)
Module system for reusability and testing.
Self documenting.
13. Desiderata: runtime
Interpret programs directly “on the cloud” — and elastically.
Distribute tools via Docker images.
Cache all costly reductions: top-down, and then bottom-up evaluation.
Use work stealing to distribute evaluation.
Authenticate the user once; bootstrap credentials to other resources.
Portable: multiple cloud providers, on-premise, laptop.
17. Evaluation
Evaluation is parallelized where there are no data dependencies.
All costly reductions are memoized (e.g., to S3), and can be reused.
Lazy evaluation enhances reasoning and composability.
The net effect is incremental evaluation! We always compute the smallest
difference our semantics permit.
18. Example
func cleanup(data file) file = exec(…)
func analyze(data file) file = exec(…)
func merge(data [file]) file = exec(…)
val samples [file] = […]
val cleaned = [cleanup(s) | s <- samples]
val analyzed = [analyze(s) | s <- cleaned]
val merged = merge(analyzed)
21. Work stealing
primary alloc
repository
dockerd
task 1
task 2
stealer alloc
repository
dockerd
2. transfer dependent objects
3. perform work
evaluator
1. lease task
cache repository
4. transfer results
4. transfer results
5. return task
22. Runtime
Take advantage of computing semantics to simplify.
Fault tolerance: just restart.
Evaluator is completely stateless: state is computed from program+cache.
Apply end-to-end principle: evaluator maintains keep alive to allocs; restart
on failure.
Result: simple, robust, vertically integrated compute stack. (~30K LOC)
23. Computing model
Because of lazy evaluation, hierarchies can be introspected cheaply.
Because of caching, any change (e.g., parts of pipeline, set of samples,
etc.) is incrementally computed: subcomputations are reused automatically.
Since we’re just computing values (e.g., a tree of sample analyses), Reflow
can also replace the superstructures around scientific computation (e.g.,
management of what’s been run where, their statuses, etc.)
Because of referential transparency, the runtime is given wide latitude in
cache management, data movement, tradeoffs of compute vs. storage costs,
etc.
24. Versioning and reproducibility
Hermetic and narrow environments help reproducibility.
We can now use ordinary source control to maintain versions.
% reflow ls -l root.rf/…
…
% git checkout v1.1 — root.rf
% reflow ls -l root.rf/…
# different results
25. Status
Open sourced in 10/26! https://github.com/grailbio/reflow
Inside of GRAIL, we use Reflow for all bioinformatics computation, lots of
ad-hoc computing (e.g., building models, running one-off workflows,
exploratory analyses, etc.)
Still has a few kinks, but the model works well.
We have definitely untied the hands of the implementor.