SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Cloud Science
Marius Eriksen, GRAIL
Scale by the Bay, November 2017
How?
Sequence cell-free DNA from blood, at high depth. (Oversample.)
Analyze these data to look for cancers: 

bioinformatics, statistics, machine learning.
Up to 1TB of input data (raw sequence reads) per sample.
Bioinformatics
Build software tools to analyze sequencing data.
Interdisciplinary: data structures, algorithms, biology, statistics,
mathematics.
Classic example: sequence alignment.
BAM
BAM
BAM
FASTQ1
FASTQ2
…
…
Align
Align
…
FASTQ
N
Merge
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
BincountsQC
A (simple) bioinformatics workflow
A (simple) bioinformatics workflow
BA
M
BA
M
BA
M
FAST
Q1
FAST
Q2
…
…
Align
Align
…
FAST
QN
Merg
e
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
Bincounts
Dupmark Filter
BincountsQC
External
reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
Cross-sample analysis
B
A
M
B
A
M
B
A
M
F
A
S
T
Q
1
F
A
S
T
Q
2…
…
Alig
n
Alig
n
…
F
A
S
T
Q
N
M
e
r
g
e
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
QC
External reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
B
A
M
B
A
M
B
A
M
F
A
S
T
Q
1
F
A
S
T
Q
2…
…
Alig
n
Alig
n
…
F
A
S
T
Q
N
M
e
r
g
e
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
QC
External reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
B
A
M
B
A
M
B
A
M
F
A
S
T
Q
1
F
A
S
T
Q
2…
…
Alig
n
Alig
n
…
F
A
S
T
Q
N
M
e
r
g
e
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
Dupm
ark
Filter
Binco
unts
QC
External reference
Model 1 Filter
Infer
Report
Filter
Infer
Report
Model 2
Classifier 1 Classifier 2 Classifier 3
Computing infrastructure
HPC is typical: very expensive filer, expensive compute nodes, data shared
through network file system.
Or just single-node computation on large machines.
Workqueue systems: Sun Grid Engine is popular.
This is not cloud friendly: storage through blobstores (S3, GCS, etc.); elastic
compute with on-demand capacity; can’t assume much about your
environment.
Workflow frameworks
Most bioinformatics computing is done through a workflow framework.
There’s a cottage industry of these.
Most of them are low level: they require the user to construct an explicit
graph of execution nodes + dependencies; don’t assume anything about data
model.
Cumbersome, difficult to compose, and ties the hands of the implementor.
Example: Apache Airflow
“Hello world” example from

Apache Airflow.
Invokes two commands in bash.
Is only involved in “orchestration” —

distribution must be solved at another

layer. (e.g., by invoking Hadoop.)
What is Reflow?
Basic idea: get rid of the notion of a “workflow” — these are just programs!
Program them directly, applying the ideas of functional and data flow
programming languages. Lazy evaluation makes it easier to reason about
performance (yes, really), and simpler to compose. Type safety is useful.
Define data model (referential transparency) that gives the runtime a lot of
leverage.
Combine into a single, vertically-integrated system that transparently
parallelizes and distributes work across private, elastic clusters.
(“Serverless”, “cloud native”.)
Desiderata: DSL
A simple, statically typed, functional language with Go-like syntax. (Quick
familiarity.) Just enough power for “workflows”, no more.
Compound data structures and composition (structs, lists, maps,
comprehensions.)
Referentially transparent, lazily evaluated. (Don’t perform unnecessary
work.)
Module system for reusability and testing.
Self documenting.
Desiderata: runtime
Interpret programs directly “on the cloud” — and elastically.
Distribute tools via Docker images.
Cache all costly reductions: top-down, and then bottom-up evaluation.
Use work stealing to distribute evaluation.
Authenticate the user once; bootstrap credentials to other resources.
Portable: multiple cloud providers, on-premise, laptop.
Hello world.
(Demo)
Code walkthrough.
(Demo)
Evaluation
Evaluation is parallelized where there are no data dependencies.
All costly reductions are memoized (e.g., to S3), and can be reused.
Lazy evaluation enhances reasoning and composability.
The net effect is incremental evaluation! We always compute the smallest
difference our semantics permit.
Example
func cleanup(data file) file = exec(…)

func analyze(data file) file = exec(…)

func merge(data [file]) file = exec(…)



val samples [file] = […]



val cleaned = [cleanup(s) | s <- samples]

val analyzed = [analyze(s) | s <- cleaned]

val merged = merge(analyzed)
Runtime
Evaluator
(e.g., your
laptop) Cache
repository
(e.g., S3)
Assoc
(e.g.,
DynamoDB)
Worker 1
dockerd
alloc 1
repository
exec 1
exec 2
Worker 2
dockerd
alloc 1
repository
exec 1
exec 2
Work stealing evaluator todo
task 1
primary alloc
repository
dockerd
task 2
task 3
task 4
task 1
task 2
stealer alloc
repository
dockerd
task 3
task 4
Work stealing
primary alloc
repository
dockerd
task 1
task 2
stealer alloc
repository
dockerd
2. transfer dependent objects
3. perform work
evaluator
1. lease task
cache repository
4. transfer results
4. transfer results
5. return task
Runtime
Take advantage of computing semantics to simplify.
Fault tolerance: just restart.
Evaluator is completely stateless: state is computed from program+cache.
Apply end-to-end principle: evaluator maintains keep alive to allocs; restart
on failure.
Result: simple, robust, vertically integrated compute stack. (~30K LOC)
Computing model
Because of lazy evaluation, hierarchies can be introspected cheaply.
Because of caching, any change (e.g., parts of pipeline, set of samples,
etc.) is incrementally computed: subcomputations are reused automatically.
Since we’re just computing values (e.g., a tree of sample analyses), Reflow
can also replace the superstructures around scientific computation (e.g.,
management of what’s been run where, their statuses, etc.)
Because of referential transparency, the runtime is given wide latitude in
cache management, data movement, tradeoffs of compute vs. storage costs,
etc.
Versioning and reproducibility
Hermetic and narrow environments help reproducibility.
We can now use ordinary source control to maintain versions.
% reflow ls -l root.rf/…

…

% git checkout v1.1 — root.rf

% reflow ls -l root.rf/…

# different results
Status
Open sourced in 10/26! https://github.com/grailbio/reflow
Inside of GRAIL, we use Reflow for all bioinformatics computation, lots of
ad-hoc computing (e.g., building models, running one-off workflows,
exploratory analyses, etc.)
Still has a few kinks, but the model works well.
We have definitely untied the hands of the implementor.
Thanks.
















Marius Eriksen

marius@grail.com

@marius



github.com/grailbio/reflow

Contenu connexe

Tendances

Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DRNguyen Tran
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
task scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithmtask scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithmSwathi Rampur
 

Tendances (20)

Heap Management
Heap ManagementHeap Management
Heap Management
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
task scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithmtask scheduling in cloud datacentre using genetic algorithm
task scheduling in cloud datacentre using genetic algorithm
 
Chapter 7 Run Time Environment
Chapter 7   Run Time EnvironmentChapter 7   Run Time Environment
Chapter 7 Run Time Environment
 

Similaire à 2017 nov reflow sbtb

Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platformconfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskASI Data Science
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 

Similaire à 2017 nov reflow sbtb (20)

Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Clustering
ClusteringClustering
Clustering
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platform
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
3rd presentation
3rd presentation3rd presentation
3rd presentation
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 

Dernier

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 

Dernier (20)

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 

2017 nov reflow sbtb

  • 1. Cloud Science Marius Eriksen, GRAIL Scale by the Bay, November 2017
  • 2.
  • 3. How? Sequence cell-free DNA from blood, at high depth. (Oversample.) Analyze these data to look for cancers: 
 bioinformatics, statistics, machine learning. Up to 1TB of input data (raw sequence reads) per sample.
  • 4. Bioinformatics Build software tools to analyze sequencing data. Interdisciplinary: data structures, algorithms, biology, statistics, mathematics. Classic example: sequence alignment.
  • 5. BAM BAM BAM FASTQ1 FASTQ2 … … Align Align … FASTQ N Merge Dupmark Filter Bincounts Dupmark Filter Bincounts Dupmark Filter Bincounts Dupmark Filter BincountsQC A (simple) bioinformatics workflow
  • 6. A (simple) bioinformatics workflow BA M BA M BA M FAST Q1 FAST Q2 … … Align Align … FAST QN Merg e Dupmark Filter Bincounts Dupmark Filter Bincounts Dupmark Filter Bincounts Dupmark Filter BincountsQC External reference Model 1 Filter Infer Report Filter Infer Report Model 2
  • 7. Cross-sample analysis B A M B A M B A M F A S T Q 1 F A S T Q 2… … Alig n Alig n … F A S T Q N M e r g e Dupm ark Filter Binco unts Dupm ark Filter Binco unts Dupm ark Filter Binco unts Dupm ark Filter Binco unts QC External reference Model 1 Filter Infer Report Filter Infer Report Model 2 B A M B A M B A M F A S T Q 1 F A S T Q 2… … Alig n Alig n … F A S T Q N M e r g e Dupm ark Filter Binco unts Dupm ark Filter Binco unts Dupm ark Filter Binco unts Dupm ark Filter Binco unts QC External reference Model 1 Filter Infer Report Filter Infer Report Model 2 B A M B A M B A M F A S T Q 1 F A S T Q 2… … Alig n Alig n … F A S T Q N M e r g e Dupm ark Filter Binco unts Dupm ark Filter Binco unts Dupm ark Filter Binco unts Dupm ark Filter Binco unts QC External reference Model 1 Filter Infer Report Filter Infer Report Model 2 Classifier 1 Classifier 2 Classifier 3
  • 8. Computing infrastructure HPC is typical: very expensive filer, expensive compute nodes, data shared through network file system. Or just single-node computation on large machines. Workqueue systems: Sun Grid Engine is popular. This is not cloud friendly: storage through blobstores (S3, GCS, etc.); elastic compute with on-demand capacity; can’t assume much about your environment.
  • 9. Workflow frameworks Most bioinformatics computing is done through a workflow framework. There’s a cottage industry of these. Most of them are low level: they require the user to construct an explicit graph of execution nodes + dependencies; don’t assume anything about data model. Cumbersome, difficult to compose, and ties the hands of the implementor.
  • 10. Example: Apache Airflow “Hello world” example from
 Apache Airflow. Invokes two commands in bash. Is only involved in “orchestration” —
 distribution must be solved at another
 layer. (e.g., by invoking Hadoop.)
  • 11. What is Reflow? Basic idea: get rid of the notion of a “workflow” — these are just programs! Program them directly, applying the ideas of functional and data flow programming languages. Lazy evaluation makes it easier to reason about performance (yes, really), and simpler to compose. Type safety is useful. Define data model (referential transparency) that gives the runtime a lot of leverage. Combine into a single, vertically-integrated system that transparently parallelizes and distributes work across private, elastic clusters. (“Serverless”, “cloud native”.)
  • 12. Desiderata: DSL A simple, statically typed, functional language with Go-like syntax. (Quick familiarity.) Just enough power for “workflows”, no more. Compound data structures and composition (structs, lists, maps, comprehensions.) Referentially transparent, lazily evaluated. (Don’t perform unnecessary work.) Module system for reusability and testing. Self documenting.
  • 13. Desiderata: runtime Interpret programs directly “on the cloud” — and elastically. Distribute tools via Docker images. Cache all costly reductions: top-down, and then bottom-up evaluation. Use work stealing to distribute evaluation. Authenticate the user once; bootstrap credentials to other resources. Portable: multiple cloud providers, on-premise, laptop.
  • 15.
  • 17. Evaluation Evaluation is parallelized where there are no data dependencies. All costly reductions are memoized (e.g., to S3), and can be reused. Lazy evaluation enhances reasoning and composability. The net effect is incremental evaluation! We always compute the smallest difference our semantics permit.
  • 18. Example func cleanup(data file) file = exec(…)
 func analyze(data file) file = exec(…)
 func merge(data [file]) file = exec(…)
 
 val samples [file] = […]
 
 val cleaned = [cleanup(s) | s <- samples]
 val analyzed = [analyze(s) | s <- cleaned]
 val merged = merge(analyzed)
  • 19. Runtime Evaluator (e.g., your laptop) Cache repository (e.g., S3) Assoc (e.g., DynamoDB) Worker 1 dockerd alloc 1 repository exec 1 exec 2 Worker 2 dockerd alloc 1 repository exec 1 exec 2
  • 20. Work stealing evaluator todo task 1 primary alloc repository dockerd task 2 task 3 task 4 task 1 task 2 stealer alloc repository dockerd task 3 task 4
  • 21. Work stealing primary alloc repository dockerd task 1 task 2 stealer alloc repository dockerd 2. transfer dependent objects 3. perform work evaluator 1. lease task cache repository 4. transfer results 4. transfer results 5. return task
  • 22. Runtime Take advantage of computing semantics to simplify. Fault tolerance: just restart. Evaluator is completely stateless: state is computed from program+cache. Apply end-to-end principle: evaluator maintains keep alive to allocs; restart on failure. Result: simple, robust, vertically integrated compute stack. (~30K LOC)
  • 23. Computing model Because of lazy evaluation, hierarchies can be introspected cheaply. Because of caching, any change (e.g., parts of pipeline, set of samples, etc.) is incrementally computed: subcomputations are reused automatically. Since we’re just computing values (e.g., a tree of sample analyses), Reflow can also replace the superstructures around scientific computation (e.g., management of what’s been run where, their statuses, etc.) Because of referential transparency, the runtime is given wide latitude in cache management, data movement, tradeoffs of compute vs. storage costs, etc.
  • 24. Versioning and reproducibility Hermetic and narrow environments help reproducibility. We can now use ordinary source control to maintain versions. % reflow ls -l root.rf/…
 …
 % git checkout v1.1 — root.rf
 % reflow ls -l root.rf/…
 # different results
  • 25. Status Open sourced in 10/26! https://github.com/grailbio/reflow Inside of GRAIL, we use Reflow for all bioinformatics computation, lots of ad-hoc computing (e.g., building models, running one-off workflows, exploratory analyses, etc.) Still has a few kinks, but the model works well. We have definitely untied the hands of the implementor.