Silk is a framework for building dataflows in Scala. In Silk users write data processing code with collection operators (e.g., map, filter, reduce, join, etc.). Silk uses Scala Macros to construct a DAG of dataflows, nodes of which are annotated with variable names in the program. By using these variable names as markers in the DAG, Silk can support interruption and resume of dataflows and querying the intermediate data. By separating dataflow descriptions from its computation, Silk enables us to switch executors, called weavers, for in-memory or cluster computing without modifying the code. In this talk, we will show how Silk helps you run data-processing pipelines as you write the code.
7. Procedural Style Writing
Weaving Dataflows with Silk
l Describes How to Process Data.
xerial.org/silk7
8. Declarative Style Writing
Weaving Dataflows with Silk
l Less programming
l System decides how to optimize the code
l Hash joins, bloom filters and various optimization techniques are
now available.
xerial.org/silk8
9. Weaving Silk
Weaving Dataflows with Silk
In-memory weaver
Cluster weaver (Spark?)
MapReduce weaver
Result
Your own weaver (using TD?)
l Making data processing code independent from the execution method!
xerial.org/silk9
Silk[A]
(operation DAG)
Weave
(Execute)Silk Product
10. Cluster Weaver: Logical Plan to Physical Plan on Cluster
Weaving Dataflows with Silk
l Physical plan on cluster
xerial.org/silk10
I1
I2
I3
P1
P2
P3
P1
P2
P3
P1
P2
P3
S1
S2
S3
S1
S2
S3
S1
S2
S3
R1
S1
S1
S1
S2
S2
S2
S3
S3
S3
P1
P1
P1
P2
P2
P2
P3
P3
P3
R2
R3
Partition
(hashing)
serializationshuffledeserializationmerge sort
Silk[people]
Scatter
11. DAG-based Data Processing Engines
Weaving Dataflows with Silk
l Spark
l Creates a task schedule for distributed processing
l Summingbird
l Integrates stream and batch data processing
l e.g. Running Scalding and Storm at the same time
l Apache Tez
l Creates a dag schedule for optimizing MapReduce pipelines
l GNU Makefile
l Describes a pipeline of UNIX commands
Why do we need another framework?
xerial.org/silk11
12. Challenge: Isolate Code Writing and Its Execution
Weaving Dataflows with Silk
weaver
Result
Result
Result
l Why canʼ’t we run the program until finish writing?
l How can we departure from compile-‐‑‒then-‐‑‒run paradigm?
xerial.org/silk12
Silk[A]
(operation DAG)
Weave
(Execute)Silk Product
14. Genome Science is A Big Data Science
Weaving Dataflows with Silk
l By sequencing, we can find 3 millions of SNPs for each person
l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for
narrowing down the candidate SNPs
l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)
l DNA Sequencer (Illumina, PacBio, etc.)
l f: An alignment program
l Output: Alignment results 750GB (sequence + alignment data)
l Total storage space required: 1.2TB
Output
f
Input
University of Tokyo Genome Browser (UTGB)
xerial.org/silk14
15. Human Genome Data Processing Workflows in Silk
Weaving Dataflows with Silk
l c”(UNIX Command)”
xerial.org/silk15
16. Human Genome Data Processing Workflows
Weaving Dataflows with Silk
l Makefile: The result ($@) is stored into a file
l Silk: The result is stored in variable
l Computation of each command may take 1 or more hours
xerial.org/silk16
17. SBT: A Good Hint
Weaving Dataflows with Silk
l SBT
l Supports incremental
compilation and testing
l sbt ~∼test-‐‑‒only
l Monitor source code change
l Running specific tests
l sbt ~∼test-‐‑‒quick
l Running failed tests only
A
fB
C
g
D
E
F
G
l How do we compute the not-‐‑‒yet started part of a Scala
program?
l We need to know:
l A-‐‑‒B and D-‐‑‒E are running
l If B is finished, we can start B-‐‑‒C
xerial.org/silk17
18. Writing A Dataflow
Weaving Dataflows with Silk
l Apply function f to the input A, then produce the output B
l This step may take more than 1 hours in big data analysis
18
A
B
f
val B = A.map(f)
xerial.org/silk
a
Program v1
19. Distribution and Recovery
Weaving Dataflows with Silk
l Resume only B2 = A2.map(f)
xerial.org/silk19
A0
A1
A2
B1
B2
f
B0
Failure!
A
B
f
a
Program v1
Retry
20. Extending Dataflows
Weaving Dataflows with Silk
Program v2
l While running program v1, we may want to add another code
(program v2)
l We need to know variable B is now being processed
20
A
B
f
C
g
Program v1
xerial.org/silk
21. Labeling Program with Variable Names
Weaving Dataflows with Silk
Program v2
l Storing intermediate results using variable names
l variable names := program markers
l But, we lost the variable names after compilation
l Solution: Extract variable names from AST upon compile time
l Using Scala Macros (Since Scala 2.10)
21
A
B
f
val B = A.map(f)
val C = B.map(g)
C
g
Program v1
xerial.org/silk
22. Scala Program (AST) to DAG Schedule (Logical Plan)
Weaving Dataflows with Silk
Program v2
l Translate a program into a set of Silk operation objects
l val B = MapOp(input:A, output:”B”, function:f)
l val C = MapOp(input:B, output:”C”, function:g)
l Operations in Silk form a DAG
l val C = MapOp(
input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g)
22
A
B
f
C
g
Program v1
xerial.org/silk
23. Using Scala Macros
Weaving Dataflows with Silk
l Produce operation objects with Scala Macros
l map(f:A=B) produces MapOp[A, B](…)
l Why do we need to use Macro here?
l To extract FContext (target variable name, enclosing method, class,
etc.) from AST.
xerial.org/silk23
27. Weaving Dataflows with Silk
Program v2
l Translate a program into a set of Silk operation objects
l val B = MapOp(input:A, output:”B”, function:f)
l val C = MapOp(input:B, output:”C”, function:g)
l Silk uses these variable names to store the intermediate data
27
A
B
f
C
g
Program v1
xerial.org/silk
28. Weaving Dataflows with Silk
l Silk defines various types of operations
xerial.org/silk28
30. Summary
Weaving Dataflows with Silk
weaver
Result
Result
Result
Cluster weaver
l Declarative-‐‑‒style coding is necessary for creating DAG schedule
l DAG schedules are labeled with variable names using ScalaMacros
l Weaver: An abstraction of how to execute the code.
l Weaver manages the running and finished parts of the code
xerial.org/silk30
Silk[A]
(operation DAG)
Weave
(Execute)Silk Product
33. Silk[A]
Weaving Silk materializes objects
Resource Table
(CPU, memory)
User program
builds workflows
Static optimization
DAG Schedule
• read file, toSilk
• map, reduce, join,
• groupBy
• UNIX commands
• etc.
• Register ClassBox
• Submit schedule
Silk Master
dispatch
Silk Client
ZooKeeper
Node Table
Slice Table
Task Scheduler
Task Status
Task Executor
Resource Monitor
Silk Client
Task Scheduler
Task Executor
Resource Monitor
ensemble mode
(at least 3 ZK instances)
• Leader election
• Collects locations of slices
and ClassBox jars
• Watches active nodes
• Watches available resources
• Submits tasks
• Run-time optimization
• Resource allocation
• Monitoring resource usage
• Launches Web UI
• Manages assigned task status
• Object serialization/deserialization
• Serves slice data
Local ClassBox
classpaths local jar files
ClassBox Table
weave
• Dispatches tasks to clients
• Manages master resource table
• Authorizes resource allocation
• Automatic recovery by
leader election in ZK
Data Server
Data Server
Silk[A]
SilkSingle[A] SilkSeq[A]
weave
A
single object
Seq[A]
sequence of objects
Local machine
Cluster
xerial.org/silk33
34. Integrating Varieties of Data Sources
Weaving Dataflows with Silk
l WormTSS: http://wormtss.utgenome.org/
l Integrating various data sources
xerial.org/silk34
35. Varieties of Data Analysis
Weaving Dataflows with Silk
Using R, JFreeChart, etc.
Need a automated
pipeline to redo the entire
analysis for answering the
paper review within a
month.
xerial.org/silk35
36. Makefile
Weaving Dataflows with Silk
l Describes dependencies of commands through files
l Good: We can resume and update the data flow processing
l Bad: Makefile of WormTSS analysis exceeds 1,000 lines
36
37. Splitting Data Analysis Into Command Modules
Weaving Dataflows with Silk
l Added a new command as we needed a new analysis and data processing
l The result:
l hundreds of commands!
l # of files limits the parallelism
37