7. Data Processing from 10,000 Feet
7
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
It is hard to add new application optimization features
to existing frameworks.
8. Dynamic Optimization
Dynamic skew handling
Optimizing job execution based on its characteristics
Adapting execution to resource elasticity
8
9. Onyx
Key observation: current data processing frameworks
are not flexible and extensible.
9
=> Onyx: A new flexible and extensible data processing
system
12. IR (Intermediate Representation) DAG
: Program-agnostic DAG with Annotations
12
Vertex Edge
Vertex Labels
Type: Operator/Loop
Placement: GPUNode/
ReservedNode/TransientNode/Any
Parallelism
Edge Labels
Type: 1:1/Broadcast/Shuffle
Mode: Push/Pull
Storage: Memory/Disk/RemoteDisk
13. MapReduce IR DAG Example
13
Shuffle,Pull,Disk
Classical MapReduce
Small-scale MapReduce
Shuffle,Push,Memory
Map
Map Reduce
Reduce
14. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-time annotation pass examples
● Parallelism pass
● Executor placement pass
● Data flow model pass
● Stage partitioning pass
14
15. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-time reshaping pass examples
● Loop extraction pass
● Loop fusion pass (loop optimization)
● Common subexpression elimination pass
● Data skew reshaping pass
Runtime pass example
● Data skew runtime pass
15
16. Compiler to Runtime
1616
Type: “Map” Operator
Placement: “Reserved” Node
Parallelism: 100
Shuffle,Pull,Disk
Type: “Reduce” Operator
Placement: “Reserved” Node
Parallelism: 50
Reduce Stage
Index
Map Stage
Index
Optimized IR DAG
17. Compiler to Runtime
1717
Stage Stage
“Map”Tasks “Reduce”Tasks.
.
.
.
.
.
.
X 100
.
.
X 50
I/O channels for
intermediate data flow
between tasks
Execution Plan
23. Onyx Implementation
● Programming Models:
○ Apache Beam applications supported
○ Spark applications coming up shortly
● Implemented on Apache REEF
○ which uses YARN or Mesos for resource management
● Implemented using Java 8
○ makes good use of lambda and stream 23
27. Job Execution Demo
Will show how:
1. Job execution can be controlled flexibly and
2. Job execution properties can be extended using:
a. Annotation Pass
b. Policy
3. An iterative part of a job can be represented using:
a. LoopExtraction Pass (a Reshaping Pass)
4. Status of a running job can be monitored using:
a. a Web UI 27
MapReduce
ALS
28. MapReduce
We will show two executions of MapReduce using different
settings:
● Intermediate data is saved in disk, and pulled by the reducers
● Intermediate data is saved in memory, and pushed to the reducers
28
29. Demo
Map Data in Disk, Pulled
29
Shuffle,Pull,Disk
Reduce
Stage
Map
Stage
30. Demo
Map Data in Memory, Pushed
30
Shuffle,Push,Memory
Reduce
Stage
Map
Stage
34. Alternating Least Squares Example
● Alternating Least Square is an ML algorithm used
commonly in recommendation systems.
● Most ML algorithms are iterative processes
=> ALS is one of them!
34
35. Alternating Least Squares Example
Naively…
35
(Read input data) . . . . . . . . . . . . (Write output). . . . . . .
Iteration 1 Iteration 2 Iteration N
But what if we want to decide this
“N” according to some condition?
(ex. model convergence in ML)
A set of operators that executes the ALS algorithm
36. Alternating Least Squares Example
Something special we have for the ALS example: Loops!
36
(Read input data) . . . . . . . . . . . . (Write output)
LoopVertex
with termination condition
(Read input data) . . . . . . . . . (Write output). . . . . .
Iteration 1 Iteration NIteration 2
41. Dynamic Optimization
Will show how Onyx achieves dynamic optimization using:
1. Reshaping Pass
=> for metric collection
2. Runtime Pass
=> for generating a dynamically optimized plan
41
42. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
42
Onyx Compiler
Onyx Runtime
AnnotationPass(es)
IR DAG
43. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
43
Onyx Compiler
Onyx Runtime
ReshapingPass
IR DAG
47. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
47
Onyx Compiler
Onyx Runtime
StageStage
Optimized IR DAGExecution Plan Conversion
48. ● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Dynamic Data Partitioning Example
48
Onyx Compiler
Onyx Runtime
Stage
Stage
Execution Plan
Execution Plan Conversion
49. Dynamic Data Partitioning Example
49
Onyx Compiler
Onyx Runtime
Execute!
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Stage
Stage
Execution Plan
50. Dynamic Data Partitioning Example
50
Onyx Compiler
Onyx Runtime
Data Size Metric
Job Executing...
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
51. Dynamic Data Partitioning Example
51
Onyx Compiler
Onyx Runtime
New IR DAG
RuntimePass(es)
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
52. Dynamic Data Partitioning Example
52
Onyx Compiler
Onyx Runtime
Execute!
New Execution Plan
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Stage
Stage
53. Lessons Learned
1. Dynamic Optimization: extensible to any job
a. A Reshaping Pass to define when customizable metric should be
received from Runtime
b. A Runtime Pass to define how to change the DAG using the received
metric
53
54. Lessons Learned
2. Extend the various options for execution properties by
a. Implementing new Compile-Time Passes (Annotation + Reshaping)
b. Adding new implementations of the interfaces of the configurable
components for Runtime
54
55. Lessons Learned
3. Flexibly control the execution properties by:
a. Pre-defined/newly implemented Compile-Time Passes
b. Using Composite Passes
c. Using Policies
55
57. Harnessing Transient Resources with Onyx
57
Pado (EuroSys 2017): A Special Data Processing Engine for
Harnessing Transient Resources
as a simple policy on
Onyx, a flexible and extensible data processing system.
79. Operator Placement Example with the
Transient Resource Policy
Multinomial Logistic Regression(MLR)
: Machine learning application for classifying
inputs, like tumors as malignant or benign, and
ad clicks as profitable or not.
Gradients are used to update the regression
model, which is used for prediction.
79
101. Containers
● Amazon EC2s(with local SSDs) as containers
● 40 Transient Containers, 5 Reserved Containers
● All containers used for computation
101
102. Workloads
● Alternating Least Squares
Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta
Information, v. 1.0. https://webscope. sandbox.yahoo.com/catalog.php?datatype=r
● Multinomial Logistic Regression
Synthetic
● Map-Reduce
Page view statistics for Wikimedia projects.
https://dumps.wikimedia.org/other/pagecounts-raw
102
104. Summary
● Introduces a new data processing system that is flexible
and extensible
○ Compiler that represents various execution policies
○ Runtime that are modular and reconfigurable
● Adapts data processing seamlessly for new deployment
and application requirements
104
105. 105
We are working on creating an Apache incubator
project. We look forward contribution from many
developers!
We are hiring software developers!
Contact: onyx@spl.snu.ac.kr
Software platform lab site: http://spl.snu.ac.kr
106. Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
106