CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

CHARACTERIZING
APU
PERFORMANCE
IN
HADOOPCL

ON
HETEROGENEOUS
DISTRIBUTED
PLATFORMS

MAX
GROSSMAN,
MAURICIO
BRETERNITZ,
AND
VIVEK
SARKAR

RICE
UNIVERSITY
&
AMD

MOTIVATION

! Cloud
offers
elasHcity,
lowered
startup
costs,
unified
plaQorm
for
all

! Generally
see
worse
and
less
predictable
performance

‒ Noisy
neighbor

! Economics
of
scale
=>
cloud
is
here
to
stay

“I
don’t
care
where
my
code
runs,
as
long

as
it
finishes…
someday”
–
Bob
the
Cloud

User

2
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

STATE-‐OF-‐THE-‐ART

! Hadoop

‒ Java
programming
language

‒ JDK
libraries

‒ Arbitrary
data
types

‒ Reliability

‒ Simple
MapReduce
distributed

programming
model

!  AbstracHons
built
on
Hadoop

‒ H2O
from
0xdata

‒ Mahout
machine
learning
framework

3
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

PROBLEMS

1.  Poor
computaHonal
performance

‒  JVM
execuHon,
short-‐lived
tasks
implies
poor
JIT,

high
startup
cost
for
creaHng
child
processes

2.  Poor
I/O
performance

‒  SerializaHon,
deserializaHon
of
arbitrary
data
types

3.  Manual
tweaking
of
intertwined
tunables

‒  In
an
unstable
cloud
environment,
you
never
have

it
right

4.  Scheduling
execuHon
&
communicaHon
with
a

holisHc
view
of
the
plaQorm

4
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

" A
small
sampling
of
Hadoop
tunables…

A
POTENTIAL
SOLUTION

!  OpenCL

‒ SIMD
programming
model

‒ MulH-‐architecture
and
mulH-‐vendor
support

‒ APIs
for
launching
compute
and
copy
tasks

!  An
expert
programmer
could:

1. 
2. 
3. 
4. 

Translate
all
applicaHon
code
to
OpenCL
kernels

Compile
OpenCL
kernels,
API
calls
into
naHve
library

Call
naHve
library
from
Java
via
JNI

Spend
a
lot
of
Hme
debugging
performance
and

correctness

! SHll
not
good
enough!

5
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

Host

Host

ApplicaHon

Device

clEnqueueNDRange()

Hadoop

Reliability

Distributed
PlaQorm

APARAPI

bytecode
to

OpenCL

kernels

OpenCL

MulH-‐architecture
execuHon

in
naHve
threads

6
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

!  Hardware
aware
plaQorm
manager

!  Machine-‐learning,
mulH-‐device
scheduler

based
on
device
occupancy
and
past

kernel
performance

!  Architecture
aware
opHmizing
compiler

!  Hadoop-‐like
API

HADOOPCL
ARCHITECTURE

class
PiMapper
extends

DoubleDoubleBoolIntHadoopCLMapper
{

public
void
map(double
x,

double
y)
{

if(x
*
x
+
y
*
y
>
0.25)
{

write(false,
1);

}
else
{

write(true,
1);

}

}

}

job.waitForCompletion(true);

7
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

javac

.class

!  HadoopCL
programming
model
supports

‒  Java
syntax

‒  MapReduce
abstracHons

‒  Dynamic
memory
allocaHon

‒  Variety
of
data
types
(primiHves,
sparse
vectors,
tuples,

etc)
and
can
be
extended
to
more

‒  Constant
globals
accessible
from
anywhere

!  HadoopCL
does
not
support

‒  Arbitrary
inputs,
outputs

‒  Massive
data
elements
(i.e.
sparse
vectors
larger
than

device
memory)

‒  Object
references

HADOOPCL
ARCHITECTURE

$
hadoop
jar
Pi.jar
input
output

NameNode
+

JobTracker

DataNode

DataNode

8
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

Hadoop
DataNode

Task

Map
or
Reduce

HadoopCL

Child

TaskTracker

HadoopCL
ML
Device

Scheduler

HadoopCL

Child

HadoopCL

Child

HadoopCL

Child

HADOOPCL
ARCHITECTURE

Task

Map
or
Reduce

‒  Data
is
buffered
in
chunks
for

processing
on
the
OpenCL
device

!  HadoopCL
explicitly
manages
buffers

to
prevent
large
GC
overheads

!  Kernel
Executor
handles

‒  Auto-‐generaHon
and
opHmizaHon
of

OpenCL
kernels
from
JVM
bytecode

‒  Transfer
of
inputs,
outputs
to
device

‒  Asynchronous
launch
of
OpenCL

kernels

9
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

Input

Buffer

Queue

Launch

Retry

OpenCL

Device
Output
Output

Buffer

Kernel

Queue

Executor

Input

Collector

Input

Buffer

Rele
ase

!  Each
Child
JVM
encloses
a
data-‐
driven
pipeline
of

communicaHon
and
computaHon

tasks

HadoopCL
Child

Input

Buffer

Manager

Output

Buffer

Manager

TOPICS
IN
HADOOPCL

!  Extending
APARAPI
with
architecture-‐
and
data-‐aware
compiler
opHmizaHons

1.  A
number
of
HadoopCL-‐speciﬁc
funcHons
are
auto-‐generated
from
APARAPI
at
runHme

2.  When
GPU
execuHon
is
detected
and
a
vector
data-‐type
is
in
use,
the
HadoopCL
runHme

auto-‐strides
input
vectors
before
copying
to
the
device

‒ 

APARAPI
must
emit
strided
code
to
match
data
layout,
fails
in
certain
cases

double
MahoutKMeansMapper__dot(...){

double
agg
=
0.0;

for
(int
i
=
0;
i
<
length1;
i++){

int
currentIndex
=
index1[(i)
*
this-‐>nPairs];

int
j
=
0;

for
(;
j<length2
&&
currentIndex!=index2[j];
j++)
;

if
(j
!=
length2)

agg
=
agg
+
(val1[(i)
*
this-‐>nPairs]
*
val2[j]);

}

return(agg);

}

10
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

double
MahoutKMeansMapper__dot(...){

double
agg
=
0.0;

for
(int
i
=
0;
i
<
length1;
i++){

int
currentIndex
=
index1[i];

int
j
=
0;

for
(;
j<length2
&&
currentIndex!=index2[j];
j++)
;

if
(j!=length2)

agg
=
agg
+
(val1[i]
*
val2[j]);

}

return(agg);

}

TOPICS
IN
HADOOPCL

!  Enabling
OpenCL
dynamic
memory
allocaHon
through
restart-‐able
kernels

‒ Note:
there
are
no
side
eﬀects
of
mappers
or
reducers
unHl
they
commit
(i.e.
write())

OpenCL
Device

Heap

public
void
map(int
key,
double
val)
{

int[]
outputVec
=
new
int[10];

...

write(key,
outputVec);

}

Mapper.java

free

nWrites

nInputs

writeOﬀsetLookup

11
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

__kernel
void
map(int
key,
double
val)
{

int
oldOffset
=
atomic_add(free,
10);

if
(oldOffset
+
10
>=
heapSize)
{

nWrites[inputIndex]
=
-‐1;

return;

}

...

writeOffsetLookup[inputIndex]
=
oldOffset;

nWrites[inputIndex]
=
nWrites[inputIndex]
+
1;

}

Mapper.cl

TOPICS
IN
HADOOPCL

!  Auto-‐scheduling
OpenCL
kernels
across
execuHon
plaQorms
through
machine
learning

‒ HadoopCL
TaskTracker
is
responsible
for

1.  Assigning
each
Task
an
execuHon
plaQorm
(GPU,
CPU,
or
JVM)

2.  Recording
execuHon
Hme
for
each
task
along
with
the
kernel
executed
and
average
device

occupancy
during
that
task’s
execuHon

!  Device
assignment
is
based
on
programmer
hints
and/or
recorded
data
from
previous

runs

‒  Data
is
recorded
in
ﬁles
to
be
used
across
Jobs

12
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

EVALUATION

!  Mahout
Kmeans

‒ Mahout
provides
Hadoop
MapReduce

implementaHons
of
a
variety
of
ML
algorithms

‒ KMeans
iteraHvely
searches
for
K
clusters

!  HadoopCL
KMeans
port

‒ Mapper
is
trivial,
for
each
point
iterates
through

all
clusters
and
outputs
the
closest

‒ Reducer
is
more
complex

‒ Both
OpenCL
and
Java
versions
implemented,
as

HadoopCL
allows
the
programmer
to
force
JVM

execuHon

13
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

EVALUATION

!  Evaluated
on
a
10-‐node
AMD
APU
cluster

!  Two
datasets
with
varying
parameters
tested

‒ Wiki
data
set

‒ ASF
e-‐mail
archives
data
set

‒ Varied
K,
the
number
of
clusters

‒ Varied
the
type
of
pruning
done
on
the
input
data

(prune
all
but
the
N
most
frequent
tokens
vs.
prune

each
vector
to
be
at
most
length
M)

‒ Varied
the
amount
of
pruning
done
(i.e.
the
values
of

N
and
M)

‒ Enable
and
disable
HadoopCL
features
to
observe

impact
on
performance

14
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

EVALUATION

!  Graphs
here

15
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

CONCLUSION

!  HadoopCL
offers
the
flexibility,
reliability,
and

programmability
of
Hadoop
accelerated
by
naHve,

heterogeneous
OpenCL
threads

!  Using
HadoopCL
is
a
tradeoff:
lose
parts
of
the
Java

language
but
gain
improved
performance

!  EvaluaHon
of
KMeans
with
real-‐world
data
sets
shows

that
HadoopCL
is
flexible
and
efficient
enough
to

improve
performance
of
real-‐world
applicaHons

!  Future
work
to
target
HSA
instead
of
OpenCL

Max
Grossman,
jmg3@rice.edu

16
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

DISCLAIMER
&
ATTRIBUTION

The
informaHon
presented
in
this
document
is
for
informaHonal
purposes
only
and
may
contain
technical
inaccuracies,
omissions
and
typographical
errors.

The
informaHon
contained
herein
is
subject
to
change
and
may
be
rendered
inaccurate
for
many
reasons,
including
but
not
limited
to
product
and
roadmap

changes,
component
and
motherboard
version
changes,
new
model
and/or
product
releases,
product
differences
between
differing
manufacturers,
souware

changes,
BIOS
flashes,
firmware
upgrades,
or
the
like.
AMD
assumes
no
obligaHon
to
update
or
otherwise
correct
or
revise
this
informaHon.
However,
AMD

reserves
the
right
to
revise
this
informaHon
and
to
make
changes
from
Hme
to
Hme
to
the
content
hereof
without
obligaHon
of
AMD
to
noHfy
any
person
of

such
revisions
or
changes.

AMD
MAKES
NO
REPRESENTATIONS
OR
WARRANTIES
WITH
RESPECT
TO
THE
CONTENTS
HEREOF
AND
ASSUMES
NO
RESPONSIBILITY
FOR
ANY

INACCURACIES,
ERRORS
OR
OMISSIONS
THAT
MAY
APPEAR
IN
THIS
INFORMATION.

AMD
SPECIFICALLY
DISCLAIMS
ANY
IMPLIED
WARRANTIES
OF
MERCHANTABILITY
OR
FITNESS
FOR
ANY
PARTICULAR
PURPOSE.
IN
NO
EVENT
WILL
AMD
BE

LIABLE
TO
ANY
PERSON
FOR
ANY
DIRECT,
INDIRECT,
SPECIAL
OR
OTHER
CONSEQUENTIAL
DAMAGES
ARISING
FROM
THE
USE
OF
ANY
INFORMATION

CONTAINED
HEREIN,
EVEN
IF
AMD
IS
EXPRESSLY
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGES.

ATTRIBUTION

©
2013
Advanced
Micro
Devices,
Inc.
All
rights
reserved.
AMD,
the
AMD
Arrow
logo
and
combinaHons
thereof
are
trademarks
of
Advanced
Micro
Devices,

Inc.
in
the
United
States
and/or
other
jurisdicHons.

SPEC

is
a
registered
trademark
of
the
Standard
Performance
EvaluaHon
CorporaHon
(SPEC).
Other

names
are
for
informaHonal
purposes
only
and
may
be
trademarks
of
their
respecHve
owners.

17
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

SAMPLE
SHAPES

18
|

PRESENTATION
TITLE

|

NOVEMBER
21,
2013

|

CONFIDENTIAL

CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

Similaire à CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman (20)

Plus de AMD Developer Central

Plus de AMD Developer Central (20)

Dernier

Dernier (20)

CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman