Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk
Taro L. Saito
Treasure Data, Inc.
leo@xerial.org

September 6th, 2014
ScalaMatsuri @ Tokyo
1xerial.org/silk

About Me
xerial.org/silk2

Treasure Data Console
xerial.org/silk3

Processing Job Table
xerial.org/silk4

Functional Style Writing
xerial.org/silk5

Need an Optimization?
xerial.org/silk6

Procedural Style Writing
l Describes How to Process Data.
xerial.org/silk7

Declarative Style Writing
l Less programming
l System decides how to optimize the code
l Hash joins, bloom filters and various optimization techniques are
now available.
xerial.org/silk8

Weaving Silk
In-memory weaver
Cluster weaver (Spark?)
MapReduce weaver
Result
Your own weaver (using TD?)
l Making data processing code independent from the execution method!
xerial.org/silk9
Silk[A]
(operation DAG)
Weave
(Execute)Silk Product

Cluster Weaver: Logical Plan to Physical Plan on Cluster
l Physical plan on cluster
xerial.org/silk10
I1
I2
I3
P1
P2
P3
P1
P2
P3
P1
P2
P3
S1
S2
S3
S1
S2
S3
S1
S2
S3
R1
S1
S1
S1
S2
S2
S2
S3
S3
S3
P1
P1
P1
P2
P2
P2
P3
P3
P3
R2
R3
Partition
(hashing)
serializationshuffledeserializationmerge sort
Silk[people]
Scatter

DAG-based Data Processing Engines
l Spark
l Creates a task schedule for distributed processing
l Summingbird
l Integrates stream and batch data processing
l e.g. Running Scalding and Storm at the same time
l Apache Tez
l Creates a dag schedule for optimizing MapReduce pipelines
l GNU Makefile
l Describes a pipeline of UNIX commands
Why do we need another framework?
xerial.org/silk11

Challenge: Isolate Code Writing and Its Execution
weaver
Result
Result
Result
l Why canʼ’t we run the program until finish writing?
l How can we departure from compile-‐‑‒then-‐‑‒run paradigm?
xerial.org/silk12
Silk[A]
(operation DAG)
Weave

l W
xerial.org/silk13

Genome Science is A Big Data Science
l By sequencing, we can find 3 millions of SNPs for each person
l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for
narrowing down the candidate SNPs

l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)
l DNA Sequencer (Illumina, PacBio, etc.)
l f: An alignment program
l Output: Alignment results 750GB (sequence + alignment data)
l Total storage space required: 1.2TB
Output
f
Input
University of Tokyo Genome Browser (UTGB)
xerial.org/silk14

Human Genome Data Processing Workflows in Silk
l c”(UNIX Command)”
xerial.org/silk15

Human Genome Data Processing Workflows
l Makefile: The result ($@) is stored into a file
l Silk: The result is stored in variable
l Computation of each command may take 1 or more hours
xerial.org/silk16

SBT: A Good Hint
l SBT
l Supports incremental
compilation and testing
l sbt ~∼test-‐‑‒only
l Monitor source code change
l Running specific tests
l sbt ~∼test-‐‑‒quick
l Running failed tests only

A
fB
C
g
D
E
F
G
l How do we compute the not-‐‑‒yet started part of a Scala
program?
l We need to know:
l A-‐‑‒B and D-‐‑‒E are running
l If B is finished, we can start B-‐‑‒C
xerial.org/silk17

Writing A Dataflow
l Apply function f to the input A, then produce the output B
l This step may take more than 1 hours in big data analysis

18
A
B
f
val B = A.map(f)

xerial.org/silk
a
Program v1

Distribution and Recovery
l Resume only B2 = A2.map(f)
xerial.org/silk19
A0
A1
A2
B1
B2
f
B0
Failure!
A
B
f
a
Program v1
Retry

Extending Dataflows
Program v2
l While running program v1, we may want to add another code
(program v2)
l We need to know variable B is now being processed
20
A
B
f
C
g
Program v1
xerial.org/silk

Labeling Program with Variable Names
Program v2
l Storing intermediate results using variable names
l variable names := program markers
l But, we lost the variable names after compilation
l Solution: Extract variable names from AST upon compile time
l Using Scala Macros (Since Scala 2.10)
21
A
B
f
val B = A.map(f)
val C = B.map(g)

C
g
Program v1
xerial.org/silk

Scala Program (AST) to DAG Schedule (Logical Plan)
Program v2
l Translate a program into a set of Silk operation objects
l val B = MapOp(input:A, output:”B”, function:f)
l val C = MapOp(input:B, output:”C”, function:g)
l Operations in Silk form a DAG
l val C = MapOp(
input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g)
22
A
B
f
C
g
Program v1
xerial.org/silk

Using Scala Macros
l Produce operation objects with Scala Macros
l map(f:A=B) produces MapOp[A, B](…)
l Why do we need to use Macro here?
l To extract FContext (target variable name, enclosing method, class,
etc.) from AST.
xerial.org/silk23

l s
xerial.org/silk24

Extract target variable name and enclosing method
xerial.org/silk25

Finding Target Variable
xerial.org/silk26

Program v2
l Translate a program into a set of Silk operation objects
l val B = MapOp(input:A, output:”B”, function:f)
l val C = MapOp(input:B, output:”C”, function:g)
l Silk uses these variable names to store the intermediate data
27
A
B
f
C
g
Program v1
xerial.org/silk

l Silk defines various types of operations
xerial.org/silk28

Object-Oriented Dataflow Programming
l Reusing and overriding dataflows
xerial.org/silk29

Summary
weaver
Result
Result
Result
Cluster weaver
l Declarative-‐‑‒style coding is necessary for creating DAG schedule
l DAG schedules are labeled with variable names using ScalaMacros
l Weaver: An abstraction of how to execute the code.
l Weaver manages the running and finished parts of the code
xerial.org/silk30
Silk[A]
(operation DAG)
Weave

http://xerial.org/silk
xerial.org/silk31

Silk[A]
Weaving Silk materializes objects
Resource Table
(CPU, memory)
User program
builds workflows
Static optimization
DAG Schedule
• read file, toSilk
• map, reduce, join,
• groupBy
• UNIX commands
• etc.
• Register ClassBox
• Submit schedule
Silk Master
dispatch
Silk Client
ZooKeeper
Node Table
Slice Table
Task Scheduler
Task Status
Task Executor
Resource Monitor
Silk Client
Task Scheduler
Task Executor
Resource Monitor
ensemble mode
(at least 3 ZK instances)
• Leader election
• Collects locations of slices
and ClassBox jars
• Watches active nodes
• Watches available resources
• Submits tasks
• Run-time optimization
• Resource allocation
• Monitoring resource usage
• Launches Web UI
• Manages assigned task status
• Object serialization/deserialization
• Serves slice data
Local ClassBox
classpaths local jar files
ClassBox Table
weave
• Dispatches tasks to clients
• Manages master resource table
• Authorizes resource allocation
• Automatic recovery by
leader election in ZK
Data Server
Data Server
Silk[A]
SilkSingle[A] SilkSeq[A]
weave
A
single object
Seq[A]
sequence of objects
Local machine
Cluster
xerial.org/silk33

Integrating Varieties of Data Sources
l WormTSS: http://wormtss.utgenome.org/
l Integrating various data sources
xerial.org/silk34

Varieties of Data Analysis
Using R, JFreeChart, etc.
Need a automated
pipeline to redo the entire
analysis for answering the
paper review within a
month.
xerial.org/silk35

Makefile
l Describes dependencies of commands through files
l Good: We can resume and update the data flow processing
l Bad: Makefile of WormTSS analysis exceeds 1,000 lines
36

Splitting Data Analysis Into Command Modules
l Added a new command as we needed a new analysis and data processing
l The result:
l hundreds of commands!
l # of files limits the parallelism
37

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Similaire à Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo (20)

Plus de Taro L. Saito

Plus de Taro L. Saito (20)

Dernier

Dernier (20)

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo