The document discusses scheduling MapReduce jobs in HPC clusters. It presents a MapReduce Job Adaptor that allows MapReduce jobs to run alongside regular HPC jobs in a shared HPC cluster managed by a resource management system. The adaptor estimates completion times for MapReduce jobs to minimize average turnaround times and better utilize unused cluster resources. An evaluation of the adaptor in a simulated cluster shows it reduces turnaround times for MapReduce-only and mixed workloads compared to a naive approach.
Injustice - Developers Among Us (SciFiDevCon 2024)
Scheduling MapReduce Jobs in HPC Clusters
1. Scheduling
MapReduce
Jobs
in
HPC
Clusters
Marcelo
Neves,
Tiago
Ferreto,
Cesar
De
Rose
marcelo.neves@acad.pucrs.br
Faculty
of
InformaEcs,
PUCRS
Porto
Alegre,
Brazil
August
30,
2012
3. IntroducEon
• MapReduce
(MR)
– A
parallel
programming
model
– Simplicity,
efficiency
and
high
scalability
– It
has
become
a
de
facto
standard
for
large-‐scale
data
analysis
• MR
has
also
aTracted
the
aTenEon
of
the
HPC
community
– Simpler
approach
to
address
the
parallelizaEon
problem
– Highly
visible
cases
where
MR
has
been
successfully
used
by
companies
like
Google,
Facebook
and
Yahoo!
3
4. HPC
Clusters
and
MapReduce
• HPC
Clusters
– Shared
among
mulEple
users/organizaEons
– Resource
Management
System
(RMS),
such
as
PBS/Torque
– ApplicaEons
are
submiTed
as
batch
jobs
– Users
have
to
explicitly
allocate
the
resources,
specifying
the
number
of
nodes
and
amount
of
Eme
• MR
ImplementaEons
(e.g.
Hadoop)
– Have
their
own
complete
job
management
system
– Users
do
not
have
to
explicitly
allocate
resources
– Require
a
dedicated
cluster
4
5. Problem
• Two
disEnct
clusters
are
required
How
to
run
MapReduce
jobs
in
a
exisEng
HPC
cluster
along
with
regular
HPC
jobs?
5
6. Current
soluEons
• Hadoop
on
Demand
(HOD)
and
MyHadoop
– Create
on
demand
MR
installaEons
as
RMS’s
jobs
– It’s
not
transparent,
users
sEll
must
to
specify
the
number
of
nodes
and
amount
of
Eme
to
be
allocated
• MESOS
– Shares
a
cluster
between
mulEple
different
frameworks
– Creates
another
level
of
resource
management
– Management
is
taken
away
from
the
cluster’s
RMS
6
7. MapReduce
Job
Adaptor
HPC Job
(# of nodes, time)
Resource
HPC User Management
System
MR Job
Adaptor Cluster
MR User
MR Job
MR Job (# of nodes, time)
(# of map tasks, # of reduce tasks,
job profile)
7
8. MapReduce
Job
Adaptor
• The
adaptor
has
three
main
goals:
– Facilitate
the
execuEon
of
MR
jobs
in
HPC
clusters
– Minimize
the
average
turnaround
Eme
of
the
jobs
– Exploit
unused
resources
in
the
cluster
(the
result
of
the
various
shapes
of
HPC
job
requests)
8
9. CompleEon
Eme
esEmaEon
• MR
performance
model
by
Verma
et
al.
1
– Job
profile
with
performance
invariants
– EsEmate
upper/lower
bounds
of
job
compleEon
• NJM=
number
of
map
tasks
• NJR=
number
of
reduce
tasks
• SJM=
number
of
map
slots
• SJR=
number
of
reduce
slots
1.
Verma
et
al.:
Aria:
automaEc
resource
inference
and
allocaEon
for
mapreduce
environments
(2011)
9
11. EvaluaEon
• Simulated
environment
(using
the
SimGrid
toolkit)
– Cluster
composed
by
128
nodes
with
2
cores
each
– RMS
based
on
ConservaEve
Backfilling
(CBF)
algorithm
– Stream
of
job
submissions
• HPC
workload
– SyntheEc
workload
based
on
model
by
Lublin
et
al.1
– Real-‐world
HPC
traces
from
the
Parallel
Workloads
Archive
(SDSC
SP2)
• MR
workload
– SyntheEc
workload
derived
from
Facebook
workloads
described
by
Zaharia
et
al.
2
1.
Lublin
et
al.:
The
workload
on
parallel
supercomputers:
Modeling
the
characterisEcs
of
rigid
jobs
(2003)
2.
Zaharia
et
al.:
Delay
scheduling:
a
simple
technique
for
achieving
locality
and
fairness
in
cluster
scheduling
(2010)
11
12. Turnaround
Time
and
System
UElizaEon
• Workload:
– HPC:
“peak
hour”
of
Lublin’s
model
– MR:
hour
of
Facebook-‐like
job
submissions
≈
40%
≈
15%
• The
adaptor
obtained
shorter
turnaround
Emes
and
beTer
cluster
uElizaEon
in
all
cases
– MR-‐only:
turnaround
was
reduced
in
≈
40%
– HPC+MR:
overall
turnaround
was
reduced
in
≈
15%
– HPC+MR:
turnaround
of
MR
jobs
was
reduced
in
≈
73%
12
13. 2500
2000
Influence
of
the
Job
Size
Average turnaround time (minutes)
• Shorter
turnaround
1500
regardless
the
job
size
2500
• BeTer
results
for
bins
with
Naive
1000
2000
smaller
jobs
Adaptor
Average turnaround time (minutes)
500
#
Map
#
Reduce
%
Jobs
at
1500
Bin
Tasks
Tasks
Facebook
1
1
0
39%
2
2
0
16%
0
3
10
1 2 3
3 14%
5
4 6 7 1000
8 9
4
50
0
9%
5
100
0
6%
Bin
6
200
50
6%
500
7
400
0
4%
8
800
180
4%
9
2400
0
3%
0
Job
sizes
in
Facebook
workload
1 2 3 4 5 6 7 8 9
(based
on
Zaharia
et
al.)
13
Bin
14. 1500
Influence
of
System
Load
1250
Average turnaround time (minutes)
1000
Algorithm
1000
Adaptor 1500
Algorithm
Naive
Adaptor
800 Naive
750
1250
Average turnaround time (minutes)
Average turnaround time (minutes)
500
600 1000
Algorithm Alg
Adaptor
250 Naive
750
400
100
10 15 20 25 30
HPC job inter arrival time (seconds) 50
1 5 500
10 15 20 25 30
Mean MR job inter arrival time (seconds)
200
250
100
50 100
50
5 10 15 20 25 30 1 5 10 15 20 25 30
Mean HPC job inter arrival time (seconds) Mean MR job inter arrival time (seconds) 14
15. Real-‐world
Workload
• Workload:
– HPC:
a
day-‐long
trace
from
SDSC
SP2
– MR:
1000
Facebook-‐like
MR
jobs
≈
54
%
≈
80
%
• The
adaptor’s
algorithm
performed
beTer
in
all
cases
15
16. Conclusion
• Although
MR
has
gained
aTenEon
by
HPC
community
• There
is
sEll
a
quesEon
of
how
to
run
MR
jobs
along
with
regular
HPC
jobs
in
a
HPC
cluster
• MR
Job
Adaptor
– Allows
transparent
MR
job
submission
on
HPC
clusters
– Minimizes
the
average
turnaround
Eme
– Improve
the
overall
uElizaEon,
by
exploiEng
unused
resources
in
the
cluster
16