4. Hadoop and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across the cluster
MAP
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(manzana, 15) , (melocoton, 7) , . . .]
4
5. Hadoop and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across the cluster
MAP
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(manzana, 15) , (melocoton, 7) , . . .]
REDUCE
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The REDUCE function operates on all values for a single key
e.g., (melocoton, [7, 42, 13, . . .])
4
6. Hadoop and MapReduce Problem Statement
The Problem With Scheduling
Current Workloads
Huge job size variance
Running time: seconds to hours
I/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., CMU TR ’12]
Consequence
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
5
11. HFSP Implementation HFSP In General
HFSP In A Nutshell
Job Size Estimation
Naive estimation at first
After the first s “training” tasks have run, we make a better
estimation
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
10
12. HFSP Implementation HFSP In General
HFSP In A Nutshell
Job Size Estimation
Naive estimation at first
After the first s “training” tasks have run, we make a better
estimation
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
Scheduling Policy
We treat MAP and REDUCE phases as separate jobs
A virtual cluster outputs a per-job simulated completion time
Preempt running tasks of jobs that complete later in the virtual
cluster
10
13. HFSP Implementation Size Estimation
Job Size Estimation (1)
Initial Estimation
ξ · k · l
k: # of tasks
l: average size of past MAP/REDUCE tasks
ξ ∈ [1, ∞]: aggressivity for scheduling jobs in training phase
ξ = 1 (default): tend to schedule training jobs right away
they may have to be preempted
ξ = ∞: wait for training to end before deciding
may require more “waves”
11
14. HFSP Implementation Size Estimation
Job Size Estimation (2)
MAP Phase
From the size of the s samples, generate an empirical CDF
(Least-square) fit to a parametric distribution
Predicted job size: k time the expected value of the fitted
distribution
12
15. HFSP Implementation Size Estimation
Job Size Estimation (2)
MAP Phase
From the size of the s samples, generate an empirical CDF
(Least-square) fit to a parametric distribution
Predicted job size: k time the expected value of the fitted
distribution
Data Locality
Experimentally, we find out it’s not an issue
For the s sample jobs, there are plenty of unprocessed blocks around
We use delay scheduling [Zaharia et al., EuroSys ’10]
12
16. HFSP Implementation Size Estimation
Job Size Estimation (3)
REDUCE Phase
Shuffle time: getting data to the reducer
time between scheduling a REDUCE task and executing a REDUCE
function the first time
average of sample shuffle sizes, weighted by data size
Execution time
we set a timeout ∆ (default 60s)
if the timeout is hit, estimated execution time is
∆
p
where progress p is the fraction of data processed
Compute estimated reduce time as before
13
17. HFSP Implementation Virtual Cluster
Virtual Cluster
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completion
time, based on
number of tasks per job
available task slots in the real cluster
Simulation is updated when
new jobs arrive
tasks complete
14
18. HFSP Implementation Preemption
Job Preemption
Supported in Hadoop
KILL running tasks
wastes work
WAIT for them to finish
may take long
15
19. HFSP Implementation Preemption
Job Preemption
Supported in Hadoop
KILL running tasks
wastes work
WAIT for them to finish
may take long
Our Choice
MAP tasks: WAIT
generally small
For REDUCE tasks, we implemented SUSPEND and RESUME
avoids the drawbacks of both WAIT and KILL
15
21. HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our Solution
We delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
16
22. HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our Solution
We delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
Configurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
16
23. HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our Solution
We delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
Configurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
If not all running tasks should be preempted, suspend the
youngest
likely to finish later
may have smaller memory footprint
16
25. Experiments Setup and Traces
Experimental Setup
Platform
100 m1.xlarge Amazon EC2 instances
4 x 2 GHz cores, 1.6 TB storage, 15 GB RAM each
Workloads
Generated with the SWIM workload generator [Chen et al., MASCOTS ’11]
Sinthetized from Facebook traces [Chen et al., VLDB ’12]
FB2009: 100 jobs, most are small; 22 minutes submission schedule
FB2010: 93 jobs, small jobs filtered out; 1h submission schedule
Configuration
We compare to Hadoop’s FAIR scheduler
similar to a processor-sharing discipline
Delay scheduling enabled both for FAIR and HFSP
18
26. Experiments Results
FB2009
0
0.25
0.5
0.75
1
0 0.5 1 1.5 2 2.5
Fractionofcompletedjobs
Sojourn Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 20 40 60 80 100
Sojourn Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 50 100 150 200 250
Sojourn Time [min]
HFSP
FAIR
Small jobs Medium jobs Large jobs
The FIFO scheduler would mostly fall outside of the graph
Small jobs (few tasks) are not problematic in either case
they are allocated enough tasks
Medium and large jobs instead require a significant amount of
the cluster resources
“focusing” all resources of the cluster pays off
19
27. Experiments Results
FB2010
0
0.25
0.5
0.75
1
0 100 200 300 400 500
Fractionofcompletedjobs
Map Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 75 150 225 300 375
Reduce Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 125 250 375 500 625 750
Sojourn Time [min]
HFSP
FAIR
MAP phase REDUCE phase Aggregate
Larger jobs, longer queues, more pressure on the scheduler
Median MAP sojourn time is more than halved
Main reason: less “waves” because cluster resources are focused
On aggregate, when the first job completes with FAIR, 20% jobs
are done with HFSP.
20
28. Experiments Results
Cluster Size
0
20
40
60
80
100
120
10 20 30 40 50 60 70 80 90 100
Averagesojourntime[min]
Cluster nodes number
HFSP
FAIR
Experiment done with the Mumak Hadoop official emulator and
FB2009
For smaller clusters, scheduling makes a bigger difference
21
29. Experiments Results
Robustness to Estimation Errors
140
150
160
170
180
190
200
210
220
230
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
AverageSojournTime[s]
α
FAIR
HFSP (α=0)
Experimental settings as before: FB2009 and Mumak again
For a job size estimation of θ, we introduce an error and pick a
value uniformly in
[(1 − α) θ, (1 + α) θ]
22
30. Experiments Results
Preemption: Costs
Question
Could the costs associated to swapping make SUSPEND not worth it?
Measurements
Linux can read and write swap close to maximum disk speed
100 MB/s for us
Worst-Case Analysis
In the FB2010 experiment, 10% of REDUCE tasks are suspended
The JVM heap space for REDUCE tasks is 1GB
as advised in Hadoop docs
Therefore, a SUSPEND/RESUME induces swapping for at most 20 s
one order of magnitude less than average size of preempted tasks
23
31. Experiments Conclusions
Take-Home Messages
Size-based scheduling on Hadoop is viable, and particularly appealing
for companies with (semi-)interactive jobs and smaller clusters
Even simple approximate means for size estimation are sufficient, as
HFSP is robust with respect to errors
OS delegation to POSIX SIGSTOP and SIGCONT signals is an efficient
way to perform preemption in Hadoop
HFSP is available as free software at
http://bitbucket.org/bigfootproject/hfsp
Paper at http://arxiv.org/abs/1302.2749
24