HFSP: the Hadoop Fair Sojourn Protocol

HFSP: the Hadoop Fair Sojourn Protocol
Mario Pastorelli, Antonio Barbuzzi, Damiano Carra, Matteo
Dell’Amico, Pietro Michiardi
May 13, 2013
1

Outline
1 Hadoop and MapReduce
2 Fair Sojourn Protocol
3 HFSP Implementation
4 Experiments
2

Hadoop and MapReduce
Outline
4 Experiments
3

Hadoop and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across the cluster
MAP
One task per block
Hadoop ﬁlesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(manzana, 15) , (melocoton, 7) , . . .]
4

Hadoop and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across the cluster
MAP
One task per block
Hadoop ﬁlesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(manzana, 15) , (melocoton, 7) , . . .]
REDUCE
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The REDUCE function operates on all values for a single key
e.g., (melocoton, [7, 42, 13, . . .])
4

Hadoop and MapReduce Problem Statement
The Problem With Scheduling
Current Workloads
Huge job size variance
Running time: seconds to hours
I/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., CMU TR ’12]
Consequence
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
5

Fair Sojourn Protocol
Outline
4 Experiments
6

Fair Sojourn Protocol Introduction To FSP
Fair Sojourn Protocol [Friedman & Henderson, SIGMETRICS ’03]
100
usage (%)
cluster
50
10 15 37.5 42.5 50
time
(s)
100
usage (%)
cluster
10 5020 30
50
time
(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
Simulate completion time using a simulated processor sharing
discipline
Schedule all resources to the job that would complete ﬁrst 7

Fair Sojourn Protocol Introduction To FSP
Multi-Processor FSP
10 13 3923.5
usage (%)
cluster
100
50
24.5
time
(s)
10 13 20 23 39
100
50
usage (%)
cluster
time
(s)
job 1
job 2
job 3
job 1
job 2
job 3
In our case, some jobs may not require all cluster resources
8

HFSP Implementation
Outline
4 Experiments
9

HFSP Implementation HFSP In General
HFSP In A Nutshell
Job Size Estimation
Naive estimation at ﬁrst
After the ﬁrst s “training” tasks have run, we make a better
estimation
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
10

HFSP Implementation HFSP In General
HFSP In A Nutshell
Job Size Estimation
Naive estimation at ﬁrst
After the ﬁrst s “training” tasks have run, we make a better
estimation
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs
“shortcut” for very small jobs
Scheduling Policy
We treat MAP and REDUCE phases as separate jobs
A virtual cluster outputs a per-job simulated completion time
Preempt running tasks of jobs that complete later in the virtual
cluster
10

HFSP Implementation Size Estimation
Job Size Estimation (1)
Initial Estimation
ξ · k · l
k: # of tasks
l: average size of past MAP/REDUCE tasks
ξ ∈ [1, ∞]: aggressivity for scheduling jobs in training phase
ξ = 1 (default): tend to schedule training jobs right away
they may have to be preempted
ξ = ∞: wait for training to end before deciding
may require more “waves”
11

MAP Phase
From the size of the s samples, generate an empirical CDF
(Least-square) ﬁt to a parametric distribution
Predicted job size: k time the expected value of the ﬁtted
distribution
12

MAP Phase
From the size of the s samples, generate an empirical CDF
(Least-square) fit to a parametric distribution
Predicted job size: k time the expected value of the fitted
distribution
Data Locality
Experimentally, we find out it’s not an issue
For the s sample jobs, there are plenty of unprocessed blocks around
We use delay scheduling [Zaharia et al., EuroSys ’10]
12

REDUCE Phase
Shuffle time: getting data to the reducer
time between scheduling a REDUCE task and executing a REDUCE
function the first time
average of sample shuffle sizes, weighted by data size
Execution time
we set a timeout ∆ (default 60s)
if the timeout is hit, estimated execution time is
∆
p
where progress p is the fraction of data processed
Compute estimated reduce time as before
13

HFSP Implementation Virtual Cluster
Virtual Cluster
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completion
time, based on
number of tasks per job
available task slots in the real cluster
Simulation is updated when
new jobs arrive
tasks complete
14

HFSP Implementation Preemption
Job Preemption
Supported in Hadoop
KILL running tasks
wastes work
WAIT for them to ﬁnish
may take long
15

Job Preemption
Supported in Hadoop
KILL running tasks
wastes work
WAIT for them to ﬁnish
may take long
Our Choice
MAP tasks: WAIT
generally small
For REDUCE tasks, we implemented SUSPEND and RESUME
avoids the drawbacks of both WAIT and KILL
15

Job Preemption: SUSPEND and RESUME
Our Solution
We delegate to the OS: SIGSTOP and SIGCONT
16

Our Solution
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
16

Our Solution
Conﬁgurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
16

Our Solution
Conﬁgurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
If not all running tasks should be preempted, suspend the
youngest
likely to ﬁnish later
may have smaller memory footprint
16

Experiments
Outline
4 Experiments
17

Experiments Setup and Traces
Experimental Setup
Platform
100 m1.xlarge Amazon EC2 instances
4 x 2 GHz cores, 1.6 TB storage, 15 GB RAM each
Workloads
Generated with the SWIM workload generator [Chen et al., MASCOTS ’11]
Sinthetized from Facebook traces [Chen et al., VLDB ’12]
FB2009: 100 jobs, most are small; 22 minutes submission schedule
FB2010: 93 jobs, small jobs ﬁltered out; 1h submission schedule
Conﬁguration
We compare to Hadoop’s FAIR scheduler
similar to a processor-sharing discipline
Delay scheduling enabled both for FAIR and HFSP
18

Experiments Results
FB2009
0
0.25
0.5
0.75
1
0 0.5 1 1.5 2 2.5
Fractionofcompletedjobs
Sojourn Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 20 40 60 80 100
Sojourn Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 50 100 150 200 250
Sojourn Time [min]
HFSP
FAIR
Small jobs Medium jobs Large jobs
The FIFO scheduler would mostly fall outside of the graph
Small jobs (few tasks) are not problematic in either case
they are allocated enough tasks
Medium and large jobs instead require a signiﬁcant amount of
the cluster resources
“focusing” all resources of the cluster pays off
19

Experiments Results
FB2010
0
0.25
0.5
0.75
1
0 100 200 300 400 500
Fractionofcompletedjobs
Map Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 75 150 225 300 375
Reduce Time [min]
HFSP
FAIR
0
0.25
0.5
0.75
1
0 125 250 375 500 625 750
Sojourn Time [min]
HFSP
FAIR
MAP phase REDUCE phase Aggregate
Larger jobs, longer queues, more pressure on the scheduler
Median MAP sojourn time is more than halved
Main reason: less “waves” because cluster resources are focused
On aggregate, when the ﬁrst job completes with FAIR, 20% jobs
are done with HFSP.
20

Experiments Results
Cluster Size
0
20
40
60
80
100
120
10 20 30 40 50 60 70 80 90 100
Averagesojourntime[min]
Cluster nodes number
HFSP
FAIR
Experiment done with the Mumak Hadoop ofﬁcial emulator and
FB2009
For smaller clusters, scheduling makes a bigger difference
21

Experiments Results
Robustness to Estimation Errors
140
150
160
170
180
190
200
210
220
230
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
AverageSojournTime[s]
α
FAIR
HFSP (α=0)
Experimental settings as before: FB2009 and Mumak again
For a job size estimation of θ, we introduce an error and pick a
value uniformly in
[(1 − α) θ, (1 + α) θ]
22

Experiments Results
Preemption: Costs
Question
Could the costs associated to swapping make SUSPEND not worth it?
Measurements
Linux can read and write swap close to maximum disk speed
100 MB/s for us
Worst-Case Analysis
In the FB2010 experiment, 10% of REDUCE tasks are suspended
The JVM heap space for REDUCE tasks is 1GB
as advised in Hadoop docs
Therefore, a SUSPEND/RESUME induces swapping for at most 20 s
one order of magnitude less than average size of preempted tasks
23

Experiments Conclusions
Take-Home Messages
Size-based scheduling on Hadoop is viable, and particularly appealing
for companies with (semi-)interactive jobs and smaller clusters
Even simple approximate means for size estimation are sufﬁcient, as
HFSP is robust with respect to errors
OS delegation to POSIX SIGSTOP and SIGCONT signals is an efﬁcient
way to perform preemption in Hadoop
HFSP is available as free software at
http://bitbucket.org/bigfootproject/hfsp
Paper at http://arxiv.org/abs/1302.2749
24

HFSP: the Hadoop Fair Sojourn Protocol

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à HFSP: the Hadoop Fair Sojourn Protocol

Similaire à HFSP: the Hadoop Fair Sojourn Protocol (20)

Dernier

Dernier (20)

HFSP: the Hadoop Fair Sojourn Protocol