Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
Exploring the Future Potential of AI-Enabled Smartphone Processors
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman
1. CHARACTERIZING
APU
PERFORMANCE
IN
HADOOPCL
ON
HETEROGENEOUS
DISTRIBUTED
PLATFORMS
MAX
GROSSMAN,
MAURICIO
BRETERNITZ,
AND
VIVEK
SARKAR
RICE
UNIVERSITY
&
AMD
2. MOTIVATION
! Cloud
offers
elasHcity,
lowered
startup
costs,
unified
plaQorm
for
all
! Generally
see
worse
and
less
predictable
performance
‒ Noisy
neighbor
! Economics
of
scale
=>
cloud
is
here
to
stay
“I
don’t
care
where
my
code
runs,
as
long
as
it
finishes…
someday”
–
Bob
the
Cloud
User
2
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
3. STATE-‐OF-‐THE-‐ART
! Hadoop
‒ Java
programming
language
‒ JDK
libraries
‒ Arbitrary
data
types
‒ Reliability
‒ Simple
MapReduce
distributed
programming
model
! AbstracHons
built
on
Hadoop
‒ H2O
from
0xdata
‒ Mahout
machine
learning
framework
3
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
4. PROBLEMS
1. Poor
computaHonal
performance
‒ JVM
execuHon,
short-‐lived
tasks
implies
poor
JIT,
high
startup
cost
for
creaHng
child
processes
2. Poor
I/O
performance
‒ SerializaHon,
deserializaHon
of
arbitrary
data
types
3. Manual
tweaking
of
intertwined
tunables
‒ In
an
unstable
cloud
environment,
you
never
have
it
right
4. Scheduling
execuHon
&
communicaHon
with
a
holisHc
view
of
the
plaQorm
4
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
" A
small
sampling
of
Hadoop
tunables…
5. A
POTENTIAL
SOLUTION
! OpenCL
‒ SIMD
programming
model
‒ MulH-‐architecture
and
mulH-‐vendor
support
‒ APIs
for
launching
compute
and
copy
tasks
! An
expert
programmer
could:
1.
2.
3.
4.
Translate
all
applicaHon
code
to
OpenCL
kernels
Compile
OpenCL
kernels,
API
calls
into
naHve
library
Call
naHve
library
from
Java
via
JNI
Spend
a
lot
of
Hme
debugging
performance
and
correctness
! SHll
not
good
enough!
5
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
Host
Host
ApplicaHon
Device
clEnqueueNDRange()
6. Hadoop
Reliability
Distributed
PlaQorm
APARAPI
bytecode
to
OpenCL
kernels
OpenCL
MulH-‐architecture
execuHon
in
naHve
threads
6
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
! Hardware
aware
plaQorm
manager
! Machine-‐learning,
mulH-‐device
scheduler
based
on
device
occupancy
and
past
kernel
performance
! Architecture
aware
opHmizing
compiler
! Hadoop-‐like
API
7. HADOOPCL
ARCHITECTURE
class
PiMapper
extends
DoubleDoubleBoolIntHadoopCLMapper
{
public
void
map(double
x,
double
y)
{
if(x
*
x
+
y
*
y
>
0.25)
{
write(false,
1);
}
else
{
write(true,
1);
}
}
}
job.waitForCompletion(true);
7
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
javac
.class
! HadoopCL
programming
model
supports
‒ Java
syntax
‒ MapReduce
abstracHons
‒ Dynamic
memory
allocaHon
‒ Variety
of
data
types
(primiHves,
sparse
vectors,
tuples,
etc)
and
can
be
extended
to
more
‒ Constant
globals
accessible
from
anywhere
! HadoopCL
does
not
support
‒ Arbitrary
inputs,
outputs
‒ Massive
data
elements
(i.e.
sparse
vectors
larger
than
device
memory)
‒ Object
references
8. HADOOPCL
ARCHITECTURE
$
hadoop
jar
Pi.jar
input
output
NameNode
+
JobTracker
DataNode
DataNode
8
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
Hadoop
DataNode
Task
Map
or
Reduce
HadoopCL
Child
TaskTracker
HadoopCL
ML
Device
Scheduler
HadoopCL
Child
HadoopCL
Child
HadoopCL
Child
9. HADOOPCL
ARCHITECTURE
Task
Map
or
Reduce
‒ Data
is
buffered
in
chunks
for
processing
on
the
OpenCL
device
! HadoopCL
explicitly
manages
buffers
to
prevent
large
GC
overheads
! Kernel
Executor
handles
‒ Auto-‐generaHon
and
opHmizaHon
of
OpenCL
kernels
from
JVM
bytecode
‒ Transfer
of
inputs,
outputs
to
device
‒ Asynchronous
launch
of
OpenCL
kernels
9
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
Input
Buffer
Queue
Launch
Retry
OpenCL
Device
Output
Output
Buffer
Kernel
Queue
Executor
Input
Collector
Input
Buffer
Rele
ase
! Each
Child
JVM
encloses
a
data-‐
driven
pipeline
of
communicaHon
and
computaHon
tasks
HadoopCL
Child
Input
Buffer
Manager
Output
Buffer
Manager
10. TOPICS
IN
HADOOPCL
! Extending
APARAPI
with
architecture-‐
and
data-‐aware
compiler
opHmizaHons
1. A
number
of
HadoopCL-‐specific
funcHons
are
auto-‐generated
from
APARAPI
at
runHme
2. When
GPU
execuHon
is
detected
and
a
vector
data-‐type
is
in
use,
the
HadoopCL
runHme
auto-‐strides
input
vectors
before
copying
to
the
device
‒
APARAPI
must
emit
strided
code
to
match
data
layout,
fails
in
certain
cases
double
MahoutKMeansMapper__dot(...){
double
agg
=
0.0;
for
(int
i
=
0;
i
<
length1;
i++){
int
currentIndex
=
index1[(i)
*
this-‐>nPairs];
int
j
=
0;
for
(;
j<length2
&&
currentIndex!=index2[j];
j++)
;
if
(j
!=
length2)
agg
=
agg
+
(val1[(i)
*
this-‐>nPairs]
*
val2[j]);
}
return(agg);
}
10
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
double
MahoutKMeansMapper__dot(...){
double
agg
=
0.0;
for
(int
i
=
0;
i
<
length1;
i++){
int
currentIndex
=
index1[i];
int
j
=
0;
for
(;
j<length2
&&
currentIndex!=index2[j];
j++)
;
if
(j!=length2)
agg
=
agg
+
(val1[i]
*
val2[j]);
}
return(agg);
}
11. TOPICS
IN
HADOOPCL
! Enabling
OpenCL
dynamic
memory
allocaHon
through
restart-‐able
kernels
‒ Note:
there
are
no
side
effects
of
mappers
or
reducers
unHl
they
commit
(i.e.
write())
OpenCL
Device
Heap
public
void
map(int
key,
double
val)
{
int[]
outputVec
=
new
int[10];
...
write(key,
outputVec);
}
Mapper.java
free
nWrites
nInputs
writeOffsetLookup
11
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
__kernel
void
map(int
key,
double
val)
{
int
oldOffset
=
atomic_add(free,
10);
if
(oldOffset
+
10
>=
heapSize)
{
nWrites[inputIndex]
=
-‐1;
return;
}
...
writeOffsetLookup[inputIndex]
=
oldOffset;
nWrites[inputIndex]
=
nWrites[inputIndex]
+
1;
}
Mapper.cl
12. TOPICS
IN
HADOOPCL
! Auto-‐scheduling
OpenCL
kernels
across
execuHon
plaQorms
through
machine
learning
‒ HadoopCL
TaskTracker
is
responsible
for
1. Assigning
each
Task
an
execuHon
plaQorm
(GPU,
CPU,
or
JVM)
2. Recording
execuHon
Hme
for
each
task
along
with
the
kernel
executed
and
average
device
occupancy
during
that
task’s
execuHon
! Device
assignment
is
based
on
programmer
hints
and/or
recorded
data
from
previous
runs
‒ Data
is
recorded
in
files
to
be
used
across
Jobs
12
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
13. EVALUATION
! Mahout
Kmeans
‒ Mahout
provides
Hadoop
MapReduce
implementaHons
of
a
variety
of
ML
algorithms
‒ KMeans
iteraHvely
searches
for
K
clusters
! HadoopCL
KMeans
port
‒ Mapper
is
trivial,
for
each
point
iterates
through
all
clusters
and
outputs
the
closest
‒ Reducer
is
more
complex
‒ Both
OpenCL
and
Java
versions
implemented,
as
HadoopCL
allows
the
programmer
to
force
JVM
execuHon
13
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
14. EVALUATION
! Evaluated
on
a
10-‐node
AMD
APU
cluster
! Two
datasets
with
varying
parameters
tested
‒ Wiki
data
set
‒ ASF
e-‐mail
archives
data
set
‒ Varied
K,
the
number
of
clusters
‒ Varied
the
type
of
pruning
done
on
the
input
data
(prune
all
but
the
N
most
frequent
tokens
vs.
prune
each
vector
to
be
at
most
length
M)
‒ Varied
the
amount
of
pruning
done
(i.e.
the
values
of
N
and
M)
‒ Enable
and
disable
HadoopCL
features
to
observe
impact
on
performance
14
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
15. EVALUATION
! Graphs
here
15
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL
16. CONCLUSION
! HadoopCL
offers
the
flexibility,
reliability,
and
programmability
of
Hadoop
accelerated
by
naHve,
heterogeneous
OpenCL
threads
! Using
HadoopCL
is
a
tradeoff:
lose
parts
of
the
Java
language
but
gain
improved
performance
! EvaluaHon
of
KMeans
with
real-‐world
data
sets
shows
that
HadoopCL
is
flexible
and
efficient
enough
to
improve
performance
of
real-‐world
applicaHons
! Future
work
to
target
HSA
instead
of
OpenCL
Max
Grossman,
jmg3@rice.edu
16
|
PRESENTATION
TITLE
|
NOVEMBER
21,
2013
|
CONFIDENTIAL