Scheduling MapReduce Jobs in HPC Clusters

Scheduling
MapReduce
Jobs
in

HPC
Clusters

Marcelo
Neves,
Tiago
Ferreto,
Cesar
De
Rose

marcelo.neves@acad.pucrs.br

Faculty
of
InformaEcs,
PUCRS

Porto
Alegre,
Brazil

August
30,
2012

Outline

•  IntroducEon

•  HPC
Clusters
and
MapReduce

•  MapReduce
Job
Adaptor

•  EvaluaEon

•  Conclusion

2

IntroducEon

•  MapReduce
(MR)

–  A
parallel
programming
model

–  Simplicity,
eﬃciency
and
high
scalability

–  It
has
become
a
de
facto
standard
for
large-‐scale
data

analysis

•  MR
has
also
aTracted
the
aTenEon
of
the
HPC

community

–  Simpler
approach
to
address
the
parallelizaEon
problem

–  Highly
visible
cases
where
MR
has
been
successfully
used

by
companies
like
Google,
Facebook
and
Yahoo!

3

HPC
Clusters
and
MapReduce

•  HPC
Clusters

–  Shared
among
mulEple
users/organizaEons

–  Resource
Management
System
(RMS),
such
as
PBS/Torque

–  ApplicaEons
are
submiTed
as
batch
jobs

–  Users
have
to
explicitly
allocate
the
resources,
specifying

the
number
of
nodes
and
amount
of
Eme

•  MR
ImplementaEons
(e.g.
Hadoop)

–  Have
their
own
complete
job
management
system

–  Users
do
not
have
to
explicitly
allocate
resources

–  Require
a
dedicated
cluster

4

Problem

•  Two
disEnct
clusters
are
required

How
to
run
MapReduce
jobs
in
a
exisEng

HPC
cluster
along
with
regular
HPC
jobs?

5

Current
soluEons

•  Hadoop
on
Demand
(HOD)
and
MyHadoop

–  Create
on
demand
MR
installaEons
as
RMS’s
jobs

–  It’s
not
transparent,
users
sEll
must
to
specify
the

number
of
nodes
and
amount
of
Eme
to
be
allocated

•  MESOS

–  Shares
a
cluster
between
mulEple
diﬀerent

frameworks

–  Creates
another
level
of
resource
management

–  Management
is
taken
away
from
the
cluster’s
RMS

6

MapReduce
Job
Adaptor

HPC Job
(# of nodes, time)

Resource
HPC User Management
System

MR Job
Adaptor Cluster

MR User
MR Job
MR Job (# of nodes, time)
(# of map tasks, # of reduce tasks,
job profile)

7

MapReduce
Job
Adaptor

•  The
adaptor
has
three
main
goals:

–  Facilitate
the
execuEon
of
MR
jobs
in
HPC
clusters

–  Minimize
the
average
turnaround
Eme
of
the
jobs

–  Exploit
unused
resources
in
the
cluster
(the
result

of
the
various
shapes
of
HPC
job
requests)

8

CompleEon
Eme
esEmaEon

•  MR
performance
model
by
Verma
et
al.
1

–  Job
proﬁle
with
performance
invariants

–  EsEmate
upper/lower
bounds
of
job
compleEon

•  NJM=
number
of
map
tasks

•  NJR=
number
of
reduce
tasks

•  SJM=
number
of
map
slots

•  SJR=
number
of
reduce
slots

1.
Verma
et
al.:
Aria:
automaEc
resource
inference
and
allocaEon
for
mapreduce
environments
(2011)

9

EvaluaEon

•  Simulated
environment
(using
the
SimGrid
toolkit)

–  Cluster
composed
by
128
nodes
with
2
cores
each

–  RMS
based
on
ConservaEve
Backﬁlling
(CBF)
algorithm

–  Stream
of
job
submissions

•  HPC
workload

–  SyntheEc
workload
based
on
model
by
Lublin
et
al.1

–  Real-‐world
HPC
traces
from
the
Parallel
Workloads
Archive
(SDSC
SP2)

•  MR
workload

–  SyntheEc
workload
derived
from
Facebook
workloads
described
by

Zaharia
et
al.
2

1.
Lublin
et
al.:
The
workload
on
parallel
supercomputers:
Modeling
the
characterisEcs
of
rigid
jobs
(2003)

2.
Zaharia
et
al.:
Delay
scheduling:
a
simple
technique
for
achieving
locality
and
fairness
in
cluster

scheduling
(2010)

11

Turnaround
Time
and
System
UElizaEon

•  Workload:

–  HPC:

“peak
hour”
of
Lublin’s
model

–  MR:

hour
of
Facebook-‐like
job
submissions

≈
40%
≈
15%

•  The
adaptor
obtained
shorter
turnaround
Emes
and
beTer

cluster
uElizaEon
in
all
cases

–  MR-‐only:
turnaround
was
reduced
in
≈
40%

–  HPC+MR:
overall
turnaround
was
reduced
in
≈
15%

–  HPC+MR:
turnaround
of
MR
jobs
was
reduced
in
≈
73%

12

2500
2000
Inﬂuence
of
the
Job
Size

Average turnaround time (minutes)

•  Shorter
turnaround

1500

regardless
the
job
size

2500
•  BeTer
results
for
bins
with
Naive
1000

2000
smaller
jobs

Adaptor

500

#
Map
#
Reduce
%
Jobs
at

1500
Bin

Tasks
Tasks
Facebook

1
1
0
39%

2
2
0
16%

0

3
10

1 2 3
3 14%
5
4 6 7 1000
8 9
4
50
0
9%

5
100
0
6%
Bin
6
200
50
6%

500

7
400
0
4%

8
800
180
4%

9
2400
0
3%

0

Job
sizes
in
Facebook
workload

1 2 3 4 5 6 7 8 9
(based
on
Zaharia
et
al.)
13

Bin

1500

Inﬂuence
of
System
Load
1250

1000
Algorithm
1000
Adaptor 1500
Algorithm
Naive
Adaptor
800 Naive
750
1250


500
600 1000
Algorithm Alg
Adaptor
250 Naive
750
400
100
10 15 20 25 30
HPC job inter arrival time (seconds) 50

1 5 500
10 15 20 25 30
Mean MR job inter arrival time (seconds)

200
250
100
50 100
50

5 10 15 20 25 30 1 5 10 15 20 25 30
Mean HPC job inter arrival time (seconds) Mean MR job inter arrival time (seconds) 14

Real-‐world
Workload

•  Workload:

–  HPC:
a
day-‐long
trace
from
SDSC
SP2

–  MR:
1000
Facebook-‐like
MR
jobs

≈
54
%
≈
80
%

•  The
adaptor’s
algorithm
performed
beTer
in
all
cases

15

Conclusion

•  Although
MR
has
gained
aTenEon
by
HPC

community

•  There
is
sEll
a
quesEon
of
how
to
run
MR
jobs

along
with
regular
HPC
jobs
in
a
HPC
cluster

•  MR
Job
Adaptor

–  Allows
transparent
MR
job
submission
on
HPC

clusters

–  Minimizes
the
average
turnaround
Eme

–  Improve
the
overall
uElizaEon,
by
exploiEng
unused

resources
in
the
cluster

16

Scheduling MapReduce Jobs in HPC Clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (9)

En vedette

En vedette (9)

Similaire à Scheduling MapReduce Jobs in HPC Clusters

Similaire à Scheduling MapReduce Jobs in HPC Clusters (20)

Dernier

Dernier (20)

Scheduling MapReduce Jobs in HPC Clusters