SlideShare a Scribd company logo
1 of 35
Download to read offline
Makoto Onizuka, Hiroyuki Kato, Soichiro
Hidaka, Keisuke Nakano, Zhenjiang Hu
1
Demand for Big Data Analysis
 Big Data Analysis
 Cyber space: Click log, query log
 Real space: shopping log, sensing data
 Machine learning
 Algorithm: classification, clustering
 Data type: relation, vector, graph, time series
 Distributed computing framework
 Interface: MPI, MapReduce, BSP (bulk
synchronous parallel)
2
Iterative analysis examples
 Clustering
 Partitioning: k-means, EM-algorithm, affinity
propagation
 Hierarchical clustering: Ward's method,
BIRCH
 Matrix factorization
 Graph mining
 PageRank, Random walk with restarts
3
Running example: PageRank
This program is not efficient. Which parts?
4
map function shuffles
whole graph structure
in every iteration
scores are computed
even if the nodes have
converged
Issues for iterative analysis
 How to optimize the program?
 Reusing the intermediate (shuffled) data
 Skip computing the scores of converted nodes
 Possible but difficult to manually remove
the above redundant computations
 Actually, Spark, HaLoop, REX force
programmers to manually remove them
 Our goal: Automatically remove redundant
computations for iterative queries
5
Overview
 OptIQ is a new optimization framework for
iterative queries with convergence property
 Declarative high level language; programmers
are freed from burden of removing redundancy
 OptIQ Integrates traditional optimization
techniques in database and compiler areas
 Two techniques for removing redundancy
 view materialization for invariant views
 incrementalization for variant views
 We implement on Hive and Spark
6
Iterative query language
 SQL extended with iteration
 Syntax
 Behavior
 initialize: statements before iteration
 update table is updated by step query repeatedly
until convergence
 return: statements after iteration
7
Example: PageRank
8
Example: k-means
9
Query Optimization
 Goal: remove redundant computations
 Question: What is redundant computation?
 Operations on unmodified attributes of tuples
 Operations on attributes of unmodified tuples
 OptIQ reuses partial results of step queries
 View materialization reuses operations on
unmodified attributes
 incrementalization reuses operations on unmodified
tuples
10
Query Optimization cont.
11
View materialization
 Purpose is to reuse unmodified attributes
of update table during iterations
 Procedure
1. Decompose update table into variant and
invariant tables by conservative analysis
2. Materialize sub-query in step query that only
accesses invariant table
3. Rewrite step query to use materialized view,
query processing using view
12
Table decomposition
 discriminate modified/unmodified attributes
 unmodified attribute: src, dest in Graph
 modified attribute: score in Graph
 decompose update table
 Graph’ = select src, IT.dest, VT.score
from VT, IT
where VT.src = IT.src 13
Example: PageRank
 Table decomposition
 Remove Graph’ table from query
 discriminate
14
simplification
Subquery lifting
 construct read-only (invariant) views
accessed by step queries
 extract loop-invariant computations by
using unmodified attributes
 Procedure
1. Constant let statement lifting (to initialize
clause)
2. Invariant subquery lifting (to initialize clause)
3. Common subquery elimination with query
rewrite, unnesting, identity query elimination
15
Example: PageRank
16
Invariant subquery lifting
Identity query elimination, VT = Score
Example: k-means
17
table decomposition
query elimination (for VT)
simplification
Automatic incrementalization
 Not all records are updated in iterations.
Purpose is to reuse unmodified tuples in
variant views.
 Procedure
1. Detect delta table between iterations before
starting 1st iteration.
2. Derive incremental queries. Both input and
output are delta tables.
3. Execute queries in incremental mode as much
as possible.
18
Delta table detection
 Delta table is detected easily, since we
have already identified variant views.
 ΔT = T’ – T,
 Update operations for update tables
 insertion
 deletion
 update
19
Deriving incremental queries
 Many literatures for incremental query
evaluations [9,13, 19]
 We focus on incremental query
evaluation for update operations, since
they are frequent in iterative queries.
20
Deriving incremental queries
 Query:
where step query q, update table T, delta table ΔT,
terminate condition φ
 Suppose q is distributed:
We obtain incremental query:
where ψ is an optional filter
21
Distribution rules
 Rules for relational operators
 selection
 projection
 join
 group-by
22
Example: PageRank
 Remember the query after lifting
 In algebraic form:
23
Example: PageRank
 This is re-written to:
24
Additional rules for group-by
 insertion/deletion rules for group-by
 sum, count: insertion and deletion
 max, min: only for insertion (not distributive for deletion)
25
MapReduce implementation
 We extend Hive for OptIQ
 Iterative query processing
 convergence is tested by joining old and new
update tables
 View materialization
 partition invariant views by group-by/join keys
for efficient group-by/join operations
 Incrementalization
 apply incrementalization as much as possible
 delta table is kept on DFS
 Putting MR design patterns together
26
Experiments
 Purpose
 How effective OptIQ is for real analysis?
 How much errors occur caused by
incrementalization?
 OptIQ is applicable for MapReduce and Spark?
 Environment: 11 computers
 Workload
 Datasets: graph (wikipedia, web graph),
multidimensional data (US cencus, mnist8m)
 Analysis: PageRank, RWR, k-means clustering
27
PageRank: performance
28
PageRank: convergence
29
k-means: performance
30
k-means: convergence
31
Related work
 Iterative MapReduce runtime system
 Twister: iterative MR computation
 Iterative mapReduce programming models
 HaLoop: manual view caching
 iMapReuce:
 Spark: in-memory cluster computing for iterative
applications, manual optimization for map-side join
 Pregel: Bulk synchronous parallel model
 GraphLab: Distributed graph computation model
 PEGAUS: matrix multiplication model on MapReduce
32
Related work cont.
 Declaratiave MapReduce programming
 HiveQL and Pig : SQL on MapReduce
 HadoopDB: Integration of RBMS and MapReduce
 MRQL: iterative query language, algebraic/MR-level
optimization; map fusion, join/group-by fusion
 Query optimization in MapReduce
 Comet: algebraic-level (shared selection, grouping,
time-spanned views) and MR-level sharing (shared
scan, shared shuffle)
 Ysmart: sharing among group-by and joins
 REX: explicit incremental computation
33
Conclusion
 OptIQ is optimization for iterative queries
with convergence property
 Two techniques for removing redundancy
 view materialization for invariant views
 incrementalization for variant views
 We implement on Hive and Spark
 OptIQ improves the performance up to five
times faster
34
Future work
 Apply OptIQ to another analysis: NMF, affinity
propagation, logistic regression
 adaptive and incremental evaluation techniques
for matrix computation, such as PageRank, NMF,
centrality computation
35

More Related Content

What's hot

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementationSri Prasanna
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmNilaNila16
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Object Detection & Machine Learning Paper
 Object Detection & Machine Learning Paper Object Detection & Machine Learning Paper
Object Detection & Machine Learning PaperJoseph Mogannam
 
Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)Sri Prasanna
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSijcses
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Graph Matching
Graph MatchingGraph Matching
Graph Matchinggraphitech
 

What's hot (19)

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce
MapreduceMapreduce
Mapreduce
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Object Detection & Machine Learning Paper
 Object Detection & Machine Learning Paper Object Detection & Machine Learning Paper
Object Detection & Machine Learning Paper
 
Map reduce (from Google)
Map reduce (from Google)Map reduce (from Google)
Map reduce (from Google)
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Graph Matching
Graph MatchingGraph Matching
Graph Matching
 

Viewers also liked

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 

Viewers also liked (16)

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 

Similar to Optimization for iterative queries on Mapreduce

Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)Subhajit Sahu
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisVolha Banadyseva
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)Yu Liu
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Download It
Download ItDownload It
Download Itbutest
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 

Similar to Optimization for iterative queries on Mapreduce (20)

Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
T180304125129
T180304125129T180304125129
T180304125129
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Download It
Download ItDownload It
Download It
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
E05312426
E05312426E05312426
E05312426
 

Recently uploaded

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Recently uploaded (20)

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

Optimization for iterative queries on Mapreduce

  • 1. Makoto Onizuka, Hiroyuki Kato, Soichiro Hidaka, Keisuke Nakano, Zhenjiang Hu 1
  • 2. Demand for Big Data Analysis  Big Data Analysis  Cyber space: Click log, query log  Real space: shopping log, sensing data  Machine learning  Algorithm: classification, clustering  Data type: relation, vector, graph, time series  Distributed computing framework  Interface: MPI, MapReduce, BSP (bulk synchronous parallel) 2
  • 3. Iterative analysis examples  Clustering  Partitioning: k-means, EM-algorithm, affinity propagation  Hierarchical clustering: Ward's method, BIRCH  Matrix factorization  Graph mining  PageRank, Random walk with restarts 3
  • 4. Running example: PageRank This program is not efficient. Which parts? 4 map function shuffles whole graph structure in every iteration scores are computed even if the nodes have converged
  • 5. Issues for iterative analysis  How to optimize the program?  Reusing the intermediate (shuffled) data  Skip computing the scores of converted nodes  Possible but difficult to manually remove the above redundant computations  Actually, Spark, HaLoop, REX force programmers to manually remove them  Our goal: Automatically remove redundant computations for iterative queries 5
  • 6. Overview  OptIQ is a new optimization framework for iterative queries with convergence property  Declarative high level language; programmers are freed from burden of removing redundancy  OptIQ Integrates traditional optimization techniques in database and compiler areas  Two techniques for removing redundancy  view materialization for invariant views  incrementalization for variant views  We implement on Hive and Spark 6
  • 7. Iterative query language  SQL extended with iteration  Syntax  Behavior  initialize: statements before iteration  update table is updated by step query repeatedly until convergence  return: statements after iteration 7
  • 10. Query Optimization  Goal: remove redundant computations  Question: What is redundant computation?  Operations on unmodified attributes of tuples  Operations on attributes of unmodified tuples  OptIQ reuses partial results of step queries  View materialization reuses operations on unmodified attributes  incrementalization reuses operations on unmodified tuples 10
  • 12. View materialization  Purpose is to reuse unmodified attributes of update table during iterations  Procedure 1. Decompose update table into variant and invariant tables by conservative analysis 2. Materialize sub-query in step query that only accesses invariant table 3. Rewrite step query to use materialized view, query processing using view 12
  • 13. Table decomposition  discriminate modified/unmodified attributes  unmodified attribute: src, dest in Graph  modified attribute: score in Graph  decompose update table  Graph’ = select src, IT.dest, VT.score from VT, IT where VT.src = IT.src 13
  • 14. Example: PageRank  Table decomposition  Remove Graph’ table from query  discriminate 14 simplification
  • 15. Subquery lifting  construct read-only (invariant) views accessed by step queries  extract loop-invariant computations by using unmodified attributes  Procedure 1. Constant let statement lifting (to initialize clause) 2. Invariant subquery lifting (to initialize clause) 3. Common subquery elimination with query rewrite, unnesting, identity query elimination 15
  • 16. Example: PageRank 16 Invariant subquery lifting Identity query elimination, VT = Score
  • 17. Example: k-means 17 table decomposition query elimination (for VT) simplification
  • 18. Automatic incrementalization  Not all records are updated in iterations. Purpose is to reuse unmodified tuples in variant views.  Procedure 1. Detect delta table between iterations before starting 1st iteration. 2. Derive incremental queries. Both input and output are delta tables. 3. Execute queries in incremental mode as much as possible. 18
  • 19. Delta table detection  Delta table is detected easily, since we have already identified variant views.  ΔT = T’ – T,  Update operations for update tables  insertion  deletion  update 19
  • 20. Deriving incremental queries  Many literatures for incremental query evaluations [9,13, 19]  We focus on incremental query evaluation for update operations, since they are frequent in iterative queries. 20
  • 21. Deriving incremental queries  Query: where step query q, update table T, delta table ΔT, terminate condition φ  Suppose q is distributed: We obtain incremental query: where ψ is an optional filter 21
  • 22. Distribution rules  Rules for relational operators  selection  projection  join  group-by 22
  • 23. Example: PageRank  Remember the query after lifting  In algebraic form: 23
  • 24. Example: PageRank  This is re-written to: 24
  • 25. Additional rules for group-by  insertion/deletion rules for group-by  sum, count: insertion and deletion  max, min: only for insertion (not distributive for deletion) 25
  • 26. MapReduce implementation  We extend Hive for OptIQ  Iterative query processing  convergence is tested by joining old and new update tables  View materialization  partition invariant views by group-by/join keys for efficient group-by/join operations  Incrementalization  apply incrementalization as much as possible  delta table is kept on DFS  Putting MR design patterns together 26
  • 27. Experiments  Purpose  How effective OptIQ is for real analysis?  How much errors occur caused by incrementalization?  OptIQ is applicable for MapReduce and Spark?  Environment: 11 computers  Workload  Datasets: graph (wikipedia, web graph), multidimensional data (US cencus, mnist8m)  Analysis: PageRank, RWR, k-means clustering 27
  • 32. Related work  Iterative MapReduce runtime system  Twister: iterative MR computation  Iterative mapReduce programming models  HaLoop: manual view caching  iMapReuce:  Spark: in-memory cluster computing for iterative applications, manual optimization for map-side join  Pregel: Bulk synchronous parallel model  GraphLab: Distributed graph computation model  PEGAUS: matrix multiplication model on MapReduce 32
  • 33. Related work cont.  Declaratiave MapReduce programming  HiveQL and Pig : SQL on MapReduce  HadoopDB: Integration of RBMS and MapReduce  MRQL: iterative query language, algebraic/MR-level optimization; map fusion, join/group-by fusion  Query optimization in MapReduce  Comet: algebraic-level (shared selection, grouping, time-spanned views) and MR-level sharing (shared scan, shared shuffle)  Ysmart: sharing among group-by and joins  REX: explicit incremental computation 33
  • 34. Conclusion  OptIQ is optimization for iterative queries with convergence property  Two techniques for removing redundancy  view materialization for invariant views  incrementalization for variant views  We implement on Hive and Spark  OptIQ improves the performance up to five times faster 34
  • 35. Future work  Apply OptIQ to another analysis: NMF, affinity propagation, logistic regression  adaptive and incremental evaluation techniques for matrix computation, such as PageRank, NMF, centrality computation 35