Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Topic 7: Shortcomings in the MapReduce Paradigm
1. 7: Shortcomings in the MapReduce Paradigm
Zubair Nabi
zubair.nabi@itu.edu.pk
April 19, 2013
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 1 / 31
2. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 2 / 31
3. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 3 / 31
4. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
5. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
6. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
7. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
8. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
9. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest cluster
has 4500 nodes!
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
10. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
11. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
12. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
13. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
14. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
It is just rehashing old database concepts
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
15. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
It is just rehashing old database concepts
It is missing most DBMS functionalities, such as updates, transactions,
etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
16. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
It is just rehashing old database concepts
It is missing most DBMS functionalities, such as updates, transactions,
etc.
It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
17. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 6 / 31
18. Introduction
Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
19. Introduction
Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work
Such reducers become “stragglers”
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
20. Introduction
Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work
Such reducers become “stragglers”
A large number of real-world applications follow long-tailed distributions
(Zipf-like)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
21. Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of words
account for most occurrences
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
22. Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of words
account for most occurrences
For instance, of 242,758 words in the dataset used to generate the
figure, the 10, 100, and 1000 most frequent words account for 22%,
43%, and 64% of the entire set
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
23. Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of words
account for most occurrences
For instance, of 242,758 words in the dataset used to generate the
figure, the 10, 100, and 1000 most frequent words account for 22%,
43%, and 64% of the entire set
Such skewed intermediate results lead to uneven distribution of
workload across reduce workers
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
24. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
25. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
26. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
27. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
28. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
29. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact that
Google currently indexes more than 25 billion webpages with skewed
links
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
30. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact that
Google currently indexes more than 25 billion webpages with skewed
links
For instance, Facebook has 49,376,609 incoming links (at the time of
writing) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
31. Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
32. Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users and
content
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
33. Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users and
content
Web caching schemes as well as email and social networks
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
34. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 11 / 31
35. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
36. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
The framework launches a speculative copy of each straggler on
another machine expecting it to finish quickly
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
37. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
The framework launches a speculative copy of each straggler on
another machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the slowest
straggler
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
38. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
The framework launches a speculative copy of each straggler on
another machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the slowest
straggler
On Google clusters, speculative execution can reduce job completion
by 44%
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
39. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
40. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
41. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
42. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
43. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
44. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
45. Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
46. Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
47. Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
Both breakdown in heterogeneous environments which consist of
multiple generations of hardware
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
48. Assumption 3
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
49. Assumption 3
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
Breaks down due to shared resources
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
50. Assumption 4
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
51. Assumption 4
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
Breaks down due the fact that in reduce tasks the shuffle phase takes
the longest time as opposed to the other 2
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
52. Assumption 5
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
53. Assumption 5
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
Breaks down due to the fact that task completion is spread across time
due to uneven workload
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
54. Assumption 6
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
55. Assumption 6
6 Tasks within the same phase, require roughly the same amount of work
Breaks down due to data skew
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
56. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 19 / 31
57. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
58. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Hacks need to be put into place for different data flow, such as joins or
multiple stages
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
59. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Hacks need to be put into place for different data flow, such as joins or
multiple stages
Custom code has to be written for common DB operations, such as
projection and filtering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
60. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Hacks need to be put into place for different data flow, such as joins or
multiple stages
Custom code has to be written for common DB operations, such as
projection and filtering
The opaque nature of map and reduce functions makes it impossible to
perform optimizations, such as operator reordering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
61. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 21 / 31
62. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
63. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Simplifies fault-tolerance
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
64. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
65. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (stream
processing)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
66. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 23 / 31
67. Introduction
1 Not all applications can be broken down into just two-phases, such as
complex SQL-like queries
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
68. Introduction
1 Not all applications can be broken down into just two-phases, such as
complex SQL-like queries
2 Tasks take in just one input and produce one output
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
69. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 25 / 31
70. Introduction
1 Hadoop is widely employed for iterative computations
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
71. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
72. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
73. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
74. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
75. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 27 / 31
76. Introduction
1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
77. Introduction
1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input
2 For instance, most iterations of PageRank run with very small
modifications
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
78. Introduction
1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input
2 For instance, most iterations of PageRank run with very small
modifications
3 Unfortunately, even with a small change in input, MapReduce
re-performs the entire computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
79. References
1 MapReduce: A major step backwards:
http://homes.cs.washington.edu/~billhowe/
mapreduce_a_major_step_backwards.html
2 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and
Ion Stoica. 2008. Improving MapReduce performance in
heterogeneous environments. In Proceedings of the 8th USENIX
conference on Operating systems design and implementation
(OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42.
3 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 29 / 31
80. References (2)
4 Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,
Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In
Proceedings of the 7th USENIX conference on Networked systems
design and implementation (NSDI’10). USENIX Association, Berkeley,
CA, USA.
5 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
Fetterly. 2007. Dryad: distributed data-parallel programs from
sequential building blocks. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007
(EuroSys ’07). ACM, New York, NY, USA, 59-72.
6 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven
Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal
execution engine for distributed data-flow computing. In Proceedings of
the 8th USENIX conference on Networked systems design and
implementation (NSDI’11). USENIX Association, Berkeley, CA, USA.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 30 / 31
81. References (3)
7 Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A.
Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incremental
computations. In Proceedings of the 2nd ACM Symposium on Cloud
Computing (SOCC ’11). ACM, New York, NY, USA.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 31 / 31