Contenu connexe Similaire à Impala Resource Management - OUTDATED (20) Impala Resource Management - OUTDATED1. 1
©
Cloudera,
Inc.
All
rights
reserved.
Impala
Resource
Management:
A
Brief
Overview
MaAhew
Jacobs
|
@maAjacobs
November
2015
Relevant
through
Impala
2.2/CDH5.4
2. 2
©
Cloudera,
Inc.
All
rights
reserved.
Impala
Resource
Management:
Overview
• Problem:
how
to
best
uIlize
cluster
resources
State
of
the
world
as
of
Impala
2.2/CDH5.4
• Within
Impala
• READY
FOR
USE:
Built-‐in
Admission
Control
(introduced
in
Impala
1.3/CDH
5.0)
• Between
Impala
and
the
rest
of
the
world
• READY
FOR
USE:
“StaIc
ParIIoning”
from
Cloudera
Manager
• NOT
READY:
IntegraIon
with
YARN
• Experimental
integraIon
shipped
in
Impala
1.3/CDH
5.0
• Some
known
issues
exist,
do
not
use
it
today!
More
on
this
later…
• We’re
acIvely
working
on
this,
stay
tuned!
3. 3
©
Cloudera,
Inc.
All
rights
reserved.
Talk
Overview
This
is
a
very
brief
overview!
Many
details
we
can’t
cover
in
20min
L
• How
to
be
successful
today
(including
with
Impala
2.3/CDH5.5)
• Overview
of
Impala
on
YARN
• Architecture
• Why
you
can’t
use
it
yet
• How
it
might
look
when
you
can
4. 4
©
Cloudera,
Inc.
All
rights
reserved.
“Resource
Management”
Today
• Use
one
or
both
of:
• StaIc
ParIIoning
with
Cloudera
Manager
(also
called
“StaIc
Resource
Pools”)
• Impala’s
built
in
Admission
Control
• StaIc
ParIIoning:
dedicate
resources
for
Impala,
HBase,
YARN,
etc.
• Easy
to
use
and
works
well.
Set
up
by
Cloudera
Manager,
uses
cgroups
• E.g.
Impala
gets
100GB/30%
CPU,
HBase
gets
50GB/20%
CPU,
etc.
• Admission
Control:
throAle
Impala
queries
• Set
a
limit
on
the
max
#
queries
or
max
memory
used
by
those
queries
• E.g.
queue
queries
once
more
than
20
queries
are
running
concurrently,
or
queue
once
more
than
100GB
is
used
5. 5
©
Cloudera,
Inc.
All
rights
reserved.
When
to
Use
AC?
StaIc
ParIIoning?
With
Admission
Control
Without
Admission
Control
With
Sta2c
Par22oning
• Using
Impala
with
other
systems
(e.g.
Hive,
Spark)
and
need
to
guarantee
each
get
resources
• Heavy
Impala
workload,
need
to
make
sure
queries
aren’t
stepping
on
each
other
• Using
Impala
with
other
systems
and
need
to
guarantee
each
get
resources
• Light
to
moderate
Impala
workload,
not
using
all
available
resources
yet
Without
Sta2c
Par22oning
• Impala
only
cluster,
or
other
systems
have
very
light,
non-‐compeIng
workloads
• Heavy
Impala
workload,
need
to
make
sure
queries
aren’t
stepping
on
each
other
• Enough
cluster
resources
are
available
for
all
workloads
to
consume
as
much
as
necessary
6. 6
©
Cloudera,
Inc.
All
rights
reserved.
(Aside:
A
Plethora
of
Mem
Limits)
• Process
(impalad)
memory
limit
• Max
memory
the
process
can
use
across
all
queries.
When
a
query
consumes
memory
such
that
the
process
hits
this
limit
the
query
is
killed
• Set
with
the
“-‐-‐mem_limit”
impalad
command-‐line
argument,
or
“Impala
Daemon
Memory
Limit”
in
CM.
The
value
is
specified
in
terms
of
single-‐impalad
memory.
• Pool
(admission
control)
memory
limit
• Max
memory
the
queries
in
a
pool/queue
can
use.
The
value
is
used
only
to
admit
queries,
not
enforced
once
queries
are
admiAed.
The
value
is
specified
as
the
cluster-‐wide
limit,
i.e.
aggregate
limit
across
all
impalads.
• hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-‐impala/latest/topics/
impala_admission.html
• Query
(query
opIon)
memory
limit
• Max
memory
a
query
can
use;
if
a
query
uses
more
than
it
may
have
to
be
killed
(if
it
can’t
spill).
• Set
via
the
“set
mem_limit=Xg”
query
opIon.
Can
set
a
default
query
opIon
via
impalad
command-‐line
arguments
(see
the
next
slide).
• The
value
is
specified
in
terms
of
single-‐impalad
memory,
e.g.
Xg
per
node
• hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-‐impala/latest/topics/
impala_mem_limit.html
7. 7
©
Cloudera,
Inc.
All
rights
reserved.
Important!
AC
with
Mem
Limits
is
Tricky
• Admission
based
on
pool
memory
limits
will
use:
• the
query
memory
limit
if
it
is
set
(set
MEM_LIMIT=Xg;)
• Otherwise
falls
back
to
an
esImate
from
planning,
this
is
usually
wrong!
• Do
not
use
memory
limits
unless
you
set
query
memory
limits
• Consider
serng
a
default
value
for
the
‘mem_limit’
query
opIon
• Set
via
the
‘-‐-‐default_query_opIons’
impalad
argument
• E.g.
-‐-‐default_query_options='mem_limit=5g'
• Can
sIll
override
the
default
with
the
‘set
mem_limit=X;’
query
opIon.
• Picking
a
good
memory
limit
is
hard,
use
CM’s
charts
to
help
understand
your
workload
8. 8
©
Cloudera,
Inc.
All
rights
reserved.
“Resource
Management”
Today,
Summary
• Today:
Use
Admission
Control
and
StaIc
ParIIoning
• We
skipped
over
a
lot
of
details,
see
the
docs
for
more
informaIon
• Impala
Admission
Control:
hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-‐
impala/latest/topics/impala_admission.html
• “StaIc
ParIIoning”
in
Cloudera
Manager:
(also
called
“StaIc
Service
Pools”)
hAp://www.cloudera.com/content/cloudera/en/documentaIon/core/latest/
topics/cm_mc_service_pools.html
• Ask
us
quesIons
on
impala-‐user@cloudera.org
9. 9
©
Cloudera,
Inc.
All
rights
reserved.
Impala
on
YARN
• YARN
is
a
“resource
negoIator”
that
helps
share
cluster
resources
within
Hadoop
• Works
well
for
MapReduce
and
similar
batch-‐oriented
processing
engines
• Doesn’t
work
well
for
services/frameworks
that
need:
• Long
running
processes
• Gang
scheduling
• Very
low-‐latency
scheduling
requirements
• Doesn’t
work
so
well
for
Impala
• (And
also
HBase,
MPI,
Presto,
custom
apps,
etc.)
10. 10
©
Cloudera,
Inc.
All
rights
reserved.
Llama
to
the
Rescue
• Llama
=
Long
Lived
ApplicaIon
MAster
• On
github:
hAp://cloudera.github.io/llama/index.html
• An
interface
between:
• YARN’s
ApplicaIonMaster
(AM)
model
(batch
jobs
where
tasks
are
each
a
process,
coordinated
by
an
AM)
• Impala’s
low-‐latency,
in-‐process
query
model
• Llama
provides:
• Gang-‐scheduling
• “Container”
caching
(to
reduce
resource
acquisiIon
cost)
16. 16
©
Cloudera,
Inc.
All
rights
reserved.
Gang
scheduling
• YARN
returns
resources
in
a
trickle,
as
they
become
available
• For
MR
this
is
perfect,
as
tasks
are
mostly
independent
(and
checkpoint
to
disk)
• For
low-‐latency
queries,
we
require
all
resources
to
be
available
at
once
so
that
query
tasks
can
stream
results
to
one
another
• Llama
buffers
resources
between
YARN
and
Impala
to
make
resource
requests
appear
atomic
and
indivisible
1
17. 17
©
Cloudera,
Inc.
All
rights
reserved.
Resource
caching
• Every
container
requires
YARN
to
make
an
expensive
resource
allocaIon
decision
• We
ask
Llama
to
cache
resources
between
requests
• Containers
stay
in
their
queue
in
Llama,
unIl
YARN
forcefully
reclaims
them
1
18. 18
©
Cloudera,
Inc.
All
rights
reserved.
Impala
on
YARN:
Current
Status
• Experimental
integraIon
was
shipped
in
Impala
1.4
/
CDH
5.0
• Not
ready
for
use
yet!
• A
number
of
known
bugs,
see
umbrella
JIRA
IMPALA-‐2370
to
track
• Some
(but
not
all)
important
fixes
in
upcoming
Impala
2.3
/
CDH
5.5
release
• Ongoing
scale
and
performance
tesIng
work
needed
to
provide
guidance
• In
a
future
release
(post-‐Impala
2.3),
we
will
be
able
to
recommend
usage
for
some
workloads,
w/
guidance