Generate-Test-and-Aggregate is a class of algorithms that can automatically derive efficient MapReduce programs.
MapReduce is a useful and popular programming model for large-scale parallel processing. However, for many complex problems, it is usually not easy to develop the efficient parallel algorithms that match MapReduce paradigm well.
The generator-based parallelization approach has been developed and introduced to simplify parallel programming by its automatic generating and optimizing mechanism. Efficient parallel algorithms can be generated from users' naive but correct programs by making use of generators which exploit knowledge of optimization theorems in the field of skeletal parallel programming. The obtained efficient-parallel algorithms are in the form that very fit for implementation with MapReduce.
By such an approach, a large class of generate-and-test-like computations can be efficiently programmed and computed over MapReduce. Thus a novel programming interface and framework can be built on top of MapReduce, and that would be helpful for resolving the difficulties on programmability and efficiency. In this paper we will introduce a framework that has such a novel programming interface for MapReduce. With this framework, users can just concentrate on making naive correct programs. We will show that a lot of so-called generate-and-test-like computations can be easily and efficiently implemented by this framework over MapReduce.
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Implementing Generate-Test-and-Aggregate Algorithms on Hadoop
1. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementing Generate-Test-and-Aggregate
Algorithms on Hadoop
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4
1The Graduate University for Advanced Studies
2,4National Institute of Informatics
3University of Tokyo
September 28, 2011
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
2. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
MapReduce
Computation in three phases: map, shuffle and reduce
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
3. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Programming with MapReduce
Programmers need to implement the following classes (Hadoop)
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
4. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Programming with MapReduce
The main difficulties of MapReduce Programming :
Nontrivial problems are usually difficult to be computed in a
divide-and-conquer fashion
Efficiency of parallel algorithms is difficult to be obtained
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
5. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithm
consists of
generate can generate all possible solution candidates.
test filters the intermediate data.
aggregate computes a summary of valid intermediate data.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
6. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithm
consists of
generate can generate all possible solution candidates.
test filters the intermediate data.
aggregate computes a summary of valid intermediate data.
GTA is a very useful and common strategy for a large class of
problems
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
7. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
Fill a knapsack with items, each of certain value and weight, such that
the total value of packed items is maximal while adhering to a weight
restriction of the knapsack.
picture from Wikipedia
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
8. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
9. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
E.g, there are 3 items: (1kg, $1), (1kg, $2), (2kg, $2)
sublists [(1kg, $1), (1kg, $2), (2kg, $2)]
= [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(1kg, $1), (1kg, $2), (2kg, $2)],
[(1kg, $1), (2kg, $2)], [(1kg, $2)], [(1kg, $2), (2kg, $2)], [(2kg, $2)]
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
10. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
Spouse the capacity of knapsack is 2 kg
filter [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(1kg, $1), (1kg, $2), (2kg, $2)],
[(1kg, $1), (2kg, $2)], [(1kg, $2)], [(1kg, $2), (2kg, $2)], [(2kg, $2)]
= [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(2kg, $2)], [(1kg, $2)]
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
11. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
maxvalue [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(2kg, $2)], [(1kg, $2)]
= $3
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
12. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
An Example: Knapsack Problem
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ filter ◦ sublists
This program is simple but inefficient because it generates
exponential intermediate data (2n).
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
13. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
Theorems of Gernerating Efficient Parallel GTA Programs
Efficient parallel programs can be derived from users’
naive but correct programs in terms of a generate, a test, and an
aggregate functions [Emoto et. al., 2011]
aggregate ◦ test ◦ generate ⇒ list homomorphism
List homomorphisms is a class of recursive functions which match very well
with the divide-and-conquer paradigm [Bird, 87; Cole, 95].
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
14. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
MapReduce
GTA algorithm
Parallelization of GTA algorithm
The Emoto’s theorem is under the following assumptions:
aggregate is a semiring homomorphism.
test is a list homomorphism.
generate is a polymorphism over semiring structures.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
15. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Motivation and Objective
The Emoto’s fusion theorem shows us a possible way to
systematically implement efficient parallel programs with GTA
algorithm
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
16. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Motivation and Objective
The Emoto’s fusion theorem shows us a possible way to
systematically implement efficient parallel programs with GTA
algorithm
We need to evaluate this approach by
implementing a practical library, which should
have easy-to-use programming interface help users design
GTA algorithms
be able to generate efficient parallel programs on MapReduce
(Hadoop)
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
17. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
System Overview
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
19. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
h[ ] = id⊕
h[a] = f a
h(x ++ y) = h x ⊕ h y
1 public interface MapReducer<Elem , Val , Res> {
2 public Val identity () ;
3 public Val element ( Elem elem ) ;
4 public Val combine ( Val left , Val right ) ;
5 public Res postprocess ( Val val ) ;
6 }
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
20. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
(A, ⊕, ⊗) → (S, ⊕ , ⊗ )
1 public interface Aggregator<A ,S> {
2 public S zero () ;
3 public S one () ;
4 public S singleton ( A a ) ;
5 public S plus ( S left , S right ) ;
6 public S times ( S left , S right ) ;
7 }
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
21. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
Test is almost list homomorphism, it inherits MapReducer
1 public interface Test<Elem , Key> extends MapReducer<Elem , ←
Key , Boolean> {}
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
22. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
MapReducer is an Interface of list homomorphism
Aggregator defines a semiring homomorphism
Test inherits MapReducer
Generator implements a MapReducer
polymorphic over semiring: Constructor
filter embedding: embed function return a new generator
1 public abstract class Generator<Elem , Single , Val , Res>
2 implements MapReducer<Elem , Val , Res> {
3 //The c o n t r a c t o r takes an i n s t a n c e of Aggregator
4 public Generator ( Aggregator< Single , Val> aggregator ) { . . . }
5
6 // take an i n s t a n c e of Test and r e t u r n a new i n s t a n c e of Generator
7 public <Key> Generator<Elem , Single , WritableMap<Key , Val>,Res>
8 embed ( final Test<Single , Key> test ) {
9 final Generator<Elem , Single , Val , Res> base = this ;
10 return new Generator<Elem , Single , WritableMap<Key , Val>,Res>
11 ( new Aggregator<Single , WritableMap<Key , Val>>(){ . . . }
12 }
13 public Val process ( List<Elem> list ) { . . . }
14 . . .
15 }
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
23. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementation on Hadoop
1 Users need to make their own Generator, Test, and Aggregator
by extending/implementing the library provided ones1
2 An instance of Generator will be created at run-time on each
working-node, which is also an efficient list homomorphism
3 The instance list homomorphism can be executed by Hadoop
in parallel
1
Our library provides commonly used Generators and Aggregators.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
24. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Java Codes
Let’s have a look at the actual implementation of GTA Knapsack...
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
25. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Performance Evaluation
Environment: hardware
We configured clusters with 2, 4, 8, 16, and 32 nodes (virtual
machines). Each computing/data node has one CPU (VM, Xeon
E5530@2.4GHz, 1 core), 3 GB memory.
Test data
102 × 220 (≈ 108) knapsack items (3.2GB)
Each item’s weight is between 0 to 10 and the capacity of the
knapsack is 100.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
26. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Evaluation on Hadoop
The Knapsack program scales well when increasing nodes of cluster
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
27. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Conclusion
The implementation of GTA library on Hadoop can
hide the technical details of MapReduce(Hadoop)
automatically do parallelization and optimization
generate MapReduce programs which have good scalability
make coding, testing and code-reusing much simpler
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
28. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Future Work
Optimization of current framework to gain better performance
Extension of current framework
Other approaches of systematic parallel programming
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
29. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Thanks
Questions?
The project is hosted on
http://screwdriver.googlecode.com
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
30. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Appendix: The Computation on Semiring
Definition (Semiring)
Given a set S and two binary operations ⊕ and ⊗, the triple (S, ⊕, ⊗) is called a
semiring if and only if
(S, ⊕) is a commutative monoid with identity element id⊕
(S, ⊗) is a monoid with identity element id⊗
⊗ is associative and distributes over ⊕
id⊕ is a zero of ⊗: id⊕ ⊗ a = a ⊗ id⊕ = id⊕
(Int, +, ×) is a semiring, (PositiveInt, +, max) is another semiring
Definition (Semiring homomorphism)
Given two semirings (S, ⊕, ⊗) and (S , ⊕ , ⊗ ), a function hom : S → S is a semiring
homomorphism from (S, ⊕, ⊗) to (S , ⊕ , ⊗ ), iff it is a monoid homomorphism from
(S, ⊕) to (S , ⊕ ) and also a monoid homomorphism from (S, ⊗) to (S , ⊗ ).
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
31. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Theorem (Filter-Embedding Fusion)
Given a set A, a finite monoid (M, ), a monoid homomorphism hom from ([A], ++ )
to (M, ), a semiring (S, ⊕, ⊗), a semiring homomorphism aggregate from
( [A] , ×++ ) to (S, ⊕, ⊗), a function ok : M → Bool and a polymorphic semiring
generator generate, the following equation holds:
aggregate ◦ filter(ok ◦ hom)
◦ generate ,x++ (λx → [x] )
= postprocessM ok
◦ generate⊕M ,⊗M
(λx → aggregateM [x] )
The result of fusion is an efficient algorithm in form of a list
homomorphism.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
32. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Definition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
33. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Definition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
Instance of a list homomorphism
sum [a] = a
sum (x ++ y) = sum x + sum y.
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
34. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Definition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
A list homomorphism can be automatically parallelized by
MapReduce [Yu et. al., EuroPar11].
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo
35. Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Evaluation on Hadoop
We test 3.2GB data on {2 , 4, 8, 16, 32} nodes clusters and 32
GB data on {32, 64} nodes clusters
2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes
time(sec.) 1602 882 482 317 961 511
speedup – × 1.82 × 1.83 × 1.52 – × 1.88
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo