This document discusses high availability for HDFS and provides details on NameNode HA design. It begins with an overview of HDFS availability and reliability. It then discusses the initial goals for NameNode HA, which were to support an active and standby NameNode configuration with manual or automatic failover. The document also outlines some high-level use cases and provides a high-level overview of the NameNode HA design.
1. HDFS
High
Availability
Suresh
S rinivas-‐
H ortonworks
Aaron
T .
M yers
-‐
C loudera
2. Overview
• Part
1
–
Suresh
Srinivas(Hortonworks)
− HDFS
Availability
and
Reliability
–
what
is
the
record?
− HA
Use
Cases
− HA
Design
• Part
2
–
Aaron
T.
Myers
(Cloudera)
− NN
HA
Design
Details
ü AutomaJc
failure
detecJon
and
NN
failover
ü Client-‐NN
connecJon
failover
− OperaJons
and
Admin
of
HA
− Future
Work
2
3. Availability,
Reliability
and
Maintainability
Reliability
=
MTBF/(1
+
MTBF)
• Probability
a
system
performs
its
funcJons
without
failure
for
a
desired
period
of
Jme
Maintainability
=
1/(1+MTTR)
• Probability
that
a
failed
system
can
be
restored
within
a
given
Jmeframe
Availability
=
MTTF/MTBF
• Probability
that
a
system
is
up
when
requested
for
use
• Depends
on
both
on
Reliability
and
Maintainability
Mean
Time
To
Failure
(MTTF):
Average
Jme
between
successive
failures
Mean
Time
To
Repair/Restore
(MTTR):
Average
Jme
to
repair
failed
system
Mean
Time
Between
Failures
(MTBF):
Average
Jme
between
successive
failures
=
MTTR
+
MTTF
3
4. Current
HDFS
Availability
&
Data
Integrity
• Simple
design
for
Higher
Reliability
− Storage:
Rely
on
NaJve
file
system
on
the
OS
rather
than
use
raw
disk
− Single
NameNode
master
ü EnJre
file
system
state
is
in
memory
− DataNodes
simply
store
and
deliver
blocks
ü All
sophisJcated
recovery
mechanisms
in
NN
• Fault
Tolerance
− Design
assumes
disks,
nodes
and
racks
fail
− MulJple
replicas
of
blocks
ü acJve
monitoring
and
replicaJon
ü DN
acJvely
monitor
for
block
deleJon
and
corrupJon
− Restart/migrate
the
NameNode
on
failure
ü Persistent
state:
mulJple
copies
+
checkpoints
ü FuncJons
as
Cold
Standby
− Restart/replace
the
DNs
on
failure
− DNs
tolerate
individual
disk
failures
4
5. How
Well
Did
HDFS
Work?
• Data
Reliability
− Lost
19
out
of
329
Million
blocks
on
10
clusters
with
20K
nodes
in
2009
− 7-‐9’s
of
reliability
− Related
bugs
fixed
in
20
and
21.
• NameNode
Availability
− 18
months
Study:
22
failures
on
25
clusters
-‐
0.58
failures
per
year
per
cluster
− Only
8
would
have
benefi1ed
from
HA
failover!!
(0.23
failures
per
cluster
year)
− NN
is
very
reliable
ü Resilient
against
overload
caused
by
misbehaving
apps
• Maintainability
− Large
clusters
see
failure
of
one
DataNode/day
and
more
frequent
disk
failures
− Maintenance
once
in
3
months
to
repair
or
replace
DataNodes
5
6. Why
NameNode
HA?
• NameNode
is
highly
reliable
(low
MTTF)
− But
Availability
is
not
the
same
as
Reliability
• NameNode
MTTR
depends
on
− RestarJng
NameNode
daemon
on
failure
ü Operator
restart
–
(failure
detecJon
+
manual
restore)
Jme
ü AutomaJc
restart
–
1-‐2
minutes
− NameNode
Startup
Jme
ü Small/medium
cluster
1-‐2
minutes
ü Very
large
cluster
–
5-‐15
minutes
• Affects
applicaJons
that
have
real
Jme
requirement
• For
higher
HDFS
Availability
− Need
redundant
NameNode
to
eliminate
SPOF
− Need
automaJc
failover
to
reduce
MTTR
and
improve
Maintainability
− Need
Hot
standby
to
reduce
MTTR
for
very
large
clusters
ü Cold
standby
is
sufficient
for
small
clusters
6
7. NameNode
HA
–
IniLal
Goals
• Support
for
AcJve
and
a
single
Standby
− AcJve
and
Standby
with
manual
failover
ü Standby
could
be
cold/warm/hot
ü Addresses
downJme
during
upgrades
–
main
cause
of
unavailability
− AcJve
and
Standby
with
automaJc
failover
ü Hot
standby
ü Addresses
downJme
during
upgrades
and
other
failures
• Backward
compaJble
configuraJon
• Standby
performs
checkpoinJng
− Secondary
NameNode
not
needed
• Management
and
monitoring
tools
• Design
philosophy
–
choose
data
integrity
over
service
availability
7
8. High
Level
Use
Cases
• Planned
downJme
Supported
failures
− Upgrades
• Single
hardware
failure
− Config
changes
− Double
hardware
failure
not
− Main
reason
for
downJme
supported
• Some
sogware
failures
− Same
sogware
failure
affects
• Unplanned
downJme
both
acJve
and
standby
− Hardware
failure
− Server
unresponsive
− Sogware
failures
− Occurs
infrequently
8
9. High
Level
Design
• Service
monitoring
and
leader
elecJon
outside
NN
− Similar
to
industry
standard
HA
frameworks
• Parallel
Block
reports
to
both
AcJve
and
Standby
NN
• Shared
or
non-‐shared
NN
file
system
state
• Fencing
of
shared
resources/data
− DataNodes
− Shared
NN
state
(if
any)
• Client
failover
− Client
side
failover
(based
on
configuraJon
or
ZooKeeper)
− IP
Failover
9
10. Design
ConsideraLons
• Sharing
state
between
AcJve
and
Hot
Standby
− File
system
state
and
Block
locaJons
• AutomaJc
Failover
− Monitoring
AcJve
NN
and
performing
failover
on
failure
• Making
a
NameNode
acJve
during
startup
− Reliable
mechanism
for
choosing
only
one
NN
as
acJve
and
the
other
as
standby
• Prevent
data
corrupJon
on
split
brain
− Shared
Resource
Fencing
ü DataNodes
and
shared
storage
for
NN
metadata
− NameNode
Fencing
ü when
shared
resource
cannot
be
fenced
• Client
failover
− Clients
connect
to
the
new
AcJve
NN
during
failover
10
11. Failover
Control
Outside
NN
• Similar
to
Industry
Standard
HA
frameworks
• HA
daemon
outside
NameNode
ZooKeeper
− Simpler
to
build
− Immune
to
NN
failures
• Daemon
manages
resources
Resources
Failover
Resources
Controller
AcJons
start,
stop,
Resources
− Resources
–
OS,
HW,
Network
etc.
− NameNode
is
just
another
resource
failover,
monitor,
…
• Performs
Shared
Resources
− AcJve
NN
elecJon
during
startup
− AutomaJc
Failover
− Fencing
ü Shared
resources
ü NameNode
12. Architecture
ZK
ZK
ZK
Leader
elecJon
Failover
Failover
Controller
Controller
AcJve
Standby
Cmds
editlog
Monitor
Health
Monitor
Health
editlogs
NN
(fencing)
NN
AcJve
Standby
Block
Reports
DN
DN
DN
13. First
Phase
–
Hot
Standby
Needs
to
be
HA
editlogs
NN
(Shared
NFS
storage)
NN
AcJve
Standby
Manual
Failover
Block
Reports
DN
fencing
DN
DN
DN
15. Client
Failover
Design
Details
• Smart
clients
(client
side
failover)
− Users
use
one
logical
URI,
client
selects
correct
NN
to
connect
to
− Clients
know
which
operaJons
are
idempotent,
therefore
safe
to
retry
on
a
failover
− Clients
have
configurable
failover/retry
strategies
• Current
implementaJon
− Client
configured
with
the
addresses
of
all
NNs
• Other
implementaJons
in
the
future
(more
later)
15
17. AutomaLc
Failover
Design
Details
• AutomaJc
failover
requires
Zookeeper
− Not
required
for
manual
failover
− ZK
makes
it
easy
to:
ü Detect
failure
of
the
acJve
NN
ü Determine
which
NN
should
become
the
AcJve
NN
• On
both
NN
machines,
run
another
daemon
− ZKFailoverController
(Zookeeper
Failover
Controller)
• Each
ZKFC
is
responsible
for:
− Health
monitoring
of
its
associated
NameNode
− ZK
session
management
/
ZK-‐based
leader
elecJon
• See
HDFS-‐2185
and
HADOOP-‐8206
for
more
details
17
19. Ops/Admin:
Shared
Storage
• To
share
NN
state,
need
shared
storage
− Needs
to
be
HA
itself
to
avoid
just
shiging
SPOF
− Many
come
with
IP
fencing
opJons
− Recommended
mount
opJons:
ü tcp,soft,intr,timeo=60,retrans=10
• SJll
configure
local
edits
dirs,
but
shared
dir
is
special
• Work
is
currently
underway
to
do
away
with
shared
storage
requirement
(more
later)
19
20. Ops/Admin:
NN
fencing
• CriJcal
for
correctness
that
only
one
NN
is
acJve
at
a
Jme
• Out
of
the
box
− RPC
to
acJve
NN
to
tell
it
to
go
to
standby
(graceful
failover)
− SSH
to
acJve
NN
and
`kill -9’
NN
• Pluggable
opJons
− Many
filers
have
protocols
for
IP-‐based
fencing
opJons
− Many
PDUs
have
protocols
for
IP-‐based
plug-‐pulling
(STONITH)
ü Nuke
the
node
from
orbit.
It’s
the
only
way
to
be
sure.
• Configure
extra
opJons
if
available
to
you
− Will
be
tried
in
order
during
a
failover
event
− Escalate
the
aggressiveness
of
the
method
− Fencing
is
criJcal
for
correctness
of
NN
metadata
20
21. Ops/Admin:
AutomaLc
Failover
• Deploy
ZK
as
usual
(3
or
5
nodes)
or
reuse
exisJng
ZK
− ZK
daemons
have
light
resource
requirement
− OK
to
collocate
1
on
each
NN,
many
collocate
3rd
on
the
YARN
RM
− Advisable
to
configure
ZK
daemons
with
dedicated
disks
for
isolaJon
− Fine
to
use
the
same
ZK
quorum
as
for
HBase,
etc.
• Fencing
methods
sJll
required
− The
ZKFC
that
wins
the
elecJon
is
responsible
for
performing
fencing
− Fencing
script(s)
must
be
configured
and
work
from
the
NNs
• Admin
commands
which
manually
iniJate
failovers
sJll
work
− But
rather
than
coordinaJng
the
failover
themselves,
use
the
ZKFCs
21
22. Ops/Admin:
Monitoring
• New
NN
metrics
− Size
of
pending
DN
message
queues
− Seconds
since
the
standby
NN
last
read
from
shared
edit
log
− DN
block
report
lag
− All
measurements
of
standby
NN
lag
–
monitor/alert
on
all
of
these
• Monitor
shared
storage
soluJon
− Volumes
fill
up,
disks
go
bad,
etc
− Should
configure
paranoid
edit
log
retenJon
policy
(default
is
2)
• Canary-‐based
monitoring
of
HDFS
a
good
idea
− Pinging
both
NNs
not
sufficient
22
23. Ops/Admin:
Hardware
• AcJve/Standby
NNs
should
be
on
separate
racks
• Shared
storage
system
should
be
on
separate
rack
• AcJve/Standby
NNs
should
have
close
to
the
same
hardware
− Same
amount
of
RAM
–
need
to
store
the
same
things
− Same
#
of
processors
-‐
need
to
serve
same
number
of
clients
• All
the
same
recommendaJons
sJll
apply
for
NN
− ECC
memory,
48GB
− Several
separate
disks
for
NN
metadata
directories
− Redundant
disks
for
OS
drives,
probably
RAID
5
or
mirroring
− Redundant
power
23
24. Future
Work
• Other
opJons
to
share
NN
metadata
− Journal
daemons
with
list
of
acJve
JDs
stored
in
ZK
(HDFS-‐3092)
− Journal
daemons
with
quorum
writes
(HDFS-‐3077)
• More
advanced
client
failover/load
shedding
− Serve
stale
reads
from
the
standby
NN
− SpeculaJve
RPC
− Non-‐RPC
clients
(IP
failover,
DNS
failover,
proxy,
etc.)
− Less
client-‐side
configuraJon
(ZK,
custom
DNS
records,
HDFS-‐3043)
• Even
Higher
HA
− MulJple
standby
NNs
24
25. QA
• HA
design:
HDFS-‐1623
− First
released
in
Hadoop
2.0.0-‐alpha
• Auto
failover
design:
HDFS-‐3042
/
-‐2185
− First
released
in
Hadoop
2.0.1-‐alpha
• Community
effort
25