4. Running example: PageRank
This program is not efficient. Which parts?
4
map function shuffles
whole graph structure
in every iteration
scores are computed
even if the nodes have
converged
5. Issues for iterative analysis
How to optimize the program?
Reusing the intermediate (shuffled) data
Skip computing the scores of converted nodes
Possible but difficult to manually remove
the above redundant computations
Actually, Spark, HaLoop, REX force
programmers to manually remove them
Our goal: Automatically remove redundant
computations for iterative queries
5
6. Overview
OptIQ is a new optimization framework for
iterative queries with convergence property
Declarative high level language; programmers
are freed from burden of removing redundancy
OptIQ Integrates traditional optimization
techniques in database and compiler areas
Two techniques for removing redundancy
view materialization for invariant views
incrementalization for variant views
We implement on Hive and Spark
6
7. Iterative query language
SQL extended with iteration
Syntax
Behavior
initialize: statements before iteration
update table is updated by step query repeatedly
until convergence
return: statements after iteration
7
12. View materialization
Purpose is to reuse unmodified attributes
of update table during iterations
Procedure
1. Decompose update table into variant and
invariant tables by conservative analysis
2. Materialize sub-query in step query that only
accesses invariant table
3. Rewrite step query to use materialized view,
query processing using view
12
13. Table decomposition
discriminate modified/unmodified attributes
unmodified attribute: src, dest in Graph
modified attribute: score in Graph
decompose update table
Graph’ = select src, IT.dest, VT.score
from VT, IT
where VT.src = IT.src 13
18. Automatic incrementalization
Not all records are updated in iterations.
Purpose is to reuse unmodified tuples in
variant views.
Procedure
1. Detect delta table between iterations before
starting 1st iteration.
2. Derive incremental queries. Both input and
output are delta tables.
3. Execute queries in incremental mode as much
as possible.
18
19. Delta table detection
Delta table is detected easily, since we
have already identified variant views.
ΔT = T’ – T,
Update operations for update tables
insertion
deletion
update
19
20. Deriving incremental queries
Many literatures for incremental query
evaluations [9,13, 19]
We focus on incremental query
evaluation for update operations, since
they are frequent in iterative queries.
20
21. Deriving incremental queries
Query:
where step query q, update table T, delta table ΔT,
terminate condition φ
Suppose q is distributed:
We obtain incremental query:
where ψ is an optional filter
21
25. Additional rules for group-by
insertion/deletion rules for group-by
sum, count: insertion and deletion
max, min: only for insertion (not distributive for deletion)
25
26. MapReduce implementation
We extend Hive for OptIQ
Iterative query processing
convergence is tested by joining old and new
update tables
View materialization
partition invariant views by group-by/join keys
for efficient group-by/join operations
Incrementalization
apply incrementalization as much as possible
delta table is kept on DFS
Putting MR design patterns together
26
27. Experiments
Purpose
How effective OptIQ is for real analysis?
How much errors occur caused by
incrementalization?
OptIQ is applicable for MapReduce and Spark?
Environment: 11 computers
Workload
Datasets: graph (wikipedia, web graph),
multidimensional data (US cencus, mnist8m)
Analysis: PageRank, RWR, k-means clustering
27
32. Related work
Iterative MapReduce runtime system
Twister: iterative MR computation
Iterative mapReduce programming models
HaLoop: manual view caching
iMapReuce:
Spark: in-memory cluster computing for iterative
applications, manual optimization for map-side join
Pregel: Bulk synchronous parallel model
GraphLab: Distributed graph computation model
PEGAUS: matrix multiplication model on MapReduce
32
33. Related work cont.
Declaratiave MapReduce programming
HiveQL and Pig : SQL on MapReduce
HadoopDB: Integration of RBMS and MapReduce
MRQL: iterative query language, algebraic/MR-level
optimization; map fusion, join/group-by fusion
Query optimization in MapReduce
Comet: algebraic-level (shared selection, grouping,
time-spanned views) and MR-level sharing (shared
scan, shared shuffle)
Ysmart: sharing among group-by and joins
REX: explicit incremental computation
33
34. Conclusion
OptIQ is optimization for iterative queries
with convergence property
Two techniques for removing redundancy
view materialization for invariant views
incrementalization for variant views
We implement on Hive and Spark
OptIQ improves the performance up to five
times faster
34
35. Future work
Apply OptIQ to another analysis: NMF, affinity
propagation, logistic regression
adaptive and incremental evaluation techniques
for matrix computation, such as PageRank, NMF,
centrality computation
35