Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Resilience at exascale
1. Resilience at Exascale
Marc
Snir
Director,
Mathema0cs
and
Computer
Science
Division
Argonne
Na0onal
Laboratory
Professor,
Dept.
of
Computer
Science,
UIUC
2. The Problem
• “Resilience
is
the
black
swan
of
the
exascale
program”
• DOE
&
DoD
commissioned
several
reports
– Inter-‐Agency
Workshop
on
HPC
Resilience
at
Extreme
Scale
hOp://ins0tute.lanl.gov/resilience/docs/Inter-‐
AgencyResilienceReport.pdf
(Feb
2012)
– U.S.
Department
of
Energy
Fault
Management
Workshop
hOp://shadow.dyndns.info/publica0ons/geist12department.pdf
(June
2012)
– ICIS
workshop
on
“addressing
failures
in
exascale
compu0ng”
(report
forthcoming)
• Talk:
– Discuss
the
liOle
we
understand
– Discuss
how
we
could
learn
more
– Seek
you
feedback
on
the
report’s
content
2
4. Failures Today
• Latest
large
systema0c
study
(Schroeder
&
Gibson)
published
in
2005
–
quite
obsolete
• Failures
per
node
per
year:
2-‐20
• Root
cause
of
failures
– Hardware
(30%-‐60%),
SW
(5%-‐25%),
Unknown
(20%-‐30%)
• Mean-‐Time
to
Repair:
3-‐24
hours
• Current
anecdotal
numbers:
– Applica0on
crash:
>1
day
– Global
system
crash:
>
1
week
– Applica0on
checkpoint/restart:
15-‐20
minutes
(checkpoint
oden
“free”)
– System
restart:
>
1
hour
– Sod
HW
errors
suspected
4
5. Recommendation
• Study
in
systema0c
manner
current
failure
types
and
rates
– Using
error
logs
at
major
DOE
labs
– Pushing
for
common
ontology
• Spend
0me
hun0ng
for
sod
HW
errors
• Study
causes
of
errors
(?)
– Most
HW
faults
(>99%)
have
no
effect
(processors
are
hugely
inefficient?)
– The
common
effect
of
sod
HW
bit
flips
is
SW
failure
– HW
faults
can
(oden?)
be
due
to
design
bugs
– Vendors
are
cagey
• Time
for
a
consor0um
effort?
5
6. Current Error-Handling
• Applica0on:
global
checkpoint
&
global
restart
• System:
– Repair
persistent
state
(file
system,
databases)
– Clean
slate
restart
for
everything
else
• Quick
analysis:
– Assume
failures
have
Poisson
distribu0on
checkpoint
0me
C=1;
recover+restart
0me
=
R;
MTBF=M
− τ /M
– Op0mal
checkpoint
interval
τ
sa0sfies
e = (M − τ +1) / M
– System
u0liza0on
U
is
U = (M − R − τ +1) / M
6
8. Utilization as Function of MTBF and R
• E.g.
– C
=
15
mins
(=1)
– R
=
1
hour
(=4)
– MTBF=
25
hours
(=100)
• U
≈
80%
8
9. Projecting Ahead
• “Comfort
zone:”
– Checkpoint
0me
<1%
MTBF
– Repair
0me
<5%
MTBF
• Assume
MTBF
=
1
hour
– Checkpoint
0me
≈
0.5
minute
– Repair
0me
≈
3
minutes
• Is
this
doable?
– Yes
if
done
in
memory
– E.g.,
using
“RAID”
techniques
9
10. Exascale Design Point
Systems
2012
2020-‐2024
Difference
BG/Q
Today
&
2019
Computer
System
peak
20
Pflop/s
1
Eflop/s
O(100)
Power
8.6
MW
~20
MW
System
memory
1.6
PB
32
-‐
64
PB
O(10)
(16*96*1024)
Node
performance
205
GF/s
1.2
or
15TF/s
O(10)
–
O(100)
(16*1.6GHz*8)
Node
memory
BW
42.6
GB/s
2
-‐
4TB/s
O(1000)
Node
concurrency
64
Threads
O(1k)
or
10k
O(100)
–
O(1000)
Total
Node
Interconnect
BW
20
GB/s
200-‐400GB/s
O(10)
System
size
(nodes)
98,304
O(100,000)
or
O(1M)
O(100)
–
O(1000)
(96*1024)
Total
concurrency
5.97
M
O(billion)
O(1,000)
MTTI
4
days
O(<1
day)
-‐
O(10)
Both price and power envelopes may be too aggressive!
11. Time to checkpoint
• Assume
“Raid
5”
across
memories,
checkpoint
size
~
50%
memory
size
– Checkpoint
0me
≈
0me
to
transfer
50%
of
memory
to
another
node:
Few
seconds!
– Memory
overhead
~
50%
– Energy
overhead
small
with
nVRAM
– (But
need
many
write
cycles)
– We
are
in
comfort
zone
(re
checkpoint)
• How
about
recovery?
• Time
to
restart
applica0on
same
as
0me
to
checkpoint
• Problem:
Time
to
recover
if
system
failed
11
12. Key Assumptions:
1. Errors
are
detected
(before
the
erroneous
checkpoint
is
commiOed)
2. System
failures
are
rare
(<<
1
day)
or
recovery
from
them
is
very
fast
(minutes)
12
13. Hardware Will Fail More Frequently
• More,
smaller
transistor
• Near-‐threshold
logic
☛ More
frequent
bit
upsets
due
to
radia0on
☛ More
frequent
mul0ple
bit
upsets
due
to
radia0on
☛ Larger
manufacturing
varia0on
☛ Faster
aging
13
14. nknown with respect to the 11nm technology node.
Hardware Error Detection: Assumptions
2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm
45nm 11nm
Cores 8 128
Scattered latches per core 200, 000 200, 000
p p
Scattered latchs in uncore relative to cores ncores ⇥ 1.25 = 0.44 ncores ⇥ 1.25 = 0.11
FIT per latch 10 1
10 1
Arrays per core (MB) 1 1
FIT per SRAM cell 10 4
10 4
Logic FIT / latch FIT 0.1 0.5 0.1 0.5
DRAM FIT (per node) 50 50
12
14
16. Summary of (Rough) Analysis
• If
no
new
technology
is
deployed
can
have
up
to
one
undetected
error
per
hour
• With
addi0onal
circuitry
could
get
down
to
one
undetected
error
per
100-‐1,000
hours
(week
–
months)
– The
cost
could
be
as
high
as
20%
addi0onal
circuits
and
25%
addi0onal
power
– Main
problems
are
in
combinatorial
logic
and
latches
– The
price
could
be
significantly
higher
(small
market
for
high-‐
availability
servers)
• Need
sodware
error
detec0on
as
an
op0on!
16
17. Application-Level Data Error Detection
• Check
checkpoint
for
correctness
before
it
is
commiOed.
– Look
for
outliers
(assume
smooth
fields)
–
handle
high
order
bit
errors
– Use
robust
solvers
–
handle
low-‐order
bit
errors
• May
not
work
for
discrete
problems,
parBcle
simulaBons,
disconBnuous
phenomena,
etc.
• May
not
work
if
outlier
is
rapidly
smoothed
(error
propagaBon)
– Check
for
global
invariants
(e.g.,
energy
preserva0on)
•
Ignore
error
• Duplicate
computa0on
of
cri0cal
variables
• Can
we
reduce
checkpoint
size
(or
avoid
them
altogether)?
– Tradeoff:
how
much
is
saved,
how
oden
data
is
checked
(big/
small
transac0ons)
17
18. How About Control Errors?
• Code
is
corrupted,
jump
to
wrong
address,
etc.
– Rarer
(more
data
state
than
control
state)
– More
likely
to
cause
a
fatal
excep0on
– More
easy
to
protect
against
18
19. How About System Failures (Due to Hardware)?
• Kernel
data
corrupted
– Page
tables,
rou0ng
tables,
file
system
metadata,
etc.
• Hard
to
understand
and
diagnose
– Break
abstrac0on
layers
• Make
sense
to
avoid
such
failures
– Duplicate
system
computa0ons
that
affect
kernel
state
(price
can
be
tolerated
for
HPC)
– Use
transac0onal
methods
– Harden
kernel
data
• Highly
reliable
DRAM
• Remotely
accessible
NVRAM
– Predict
and
avoid
failures
19
20. Failure Prediction from Event Logs
Use a combination of signal analysis (to identify outliers) and
datamining (to find correlations)
e 1 presents
pe.
oes not have
e, a memory
arge number
sh the error
Data mining
themselfs in
e more than
ent the ma-
l to extract
ignals. This
f faults seen
between the
three signal
Gainaru, Cappello, Snir, Kramer (SC12)
that can be Fig. 2. Methodology overview of the hybrid approach 20
21. Can Hardware Failure Be Predicted?
Prediction method Precision Recall Seq used Pred failures
ELSA hybrid 91.2% 45.8% 62 (96.8%) 603
ELSA signal 88.1% 40.5% 117 (92.8%) 534
Data mining 91.9% 15.7% 39 (95.1%) 207
TABLE II
P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
C Precision Recall MTTF for the whole system Waste gain
.
1min 92 20 one day 9.13%
n 1min 92 36 one day 17.33%
10s 92 36 one day 12.09%
os The metrics used for evaluating one day
10s 92 45 prediction performance are
15.63%
prediction 92 recall:
1min
10s
and
92
50
65
5h
5h
21.74%
24.78%
• Precision is the fraction of failure predictions that turn
TABLE III
) out to be correct.
Migrating processes when node failure is predicted can
P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
significantly improve utilization
• Recall is the fraction of failures that are predicted.
o
s 21
22. How About Software Bugs?
• Parallel
code
(transforma0onal,
mostly
determinis0c)
vs.
concurrent
code
(event
driven,
very
nondeterminis0c)
– Hard
to
understand
concurrent
code
(cannot
comprehend
more
than
10
concurrent
actors)
– Hard
to
avoid
performance
bugs
(e.g.,
overcommiOed
resources
causing
0me-‐outs)
– Hard
to
test
for
performance
bugs
• Concurrency
performance
bugs
(e.g.,
in
parallel
file
systems)
are
major
source
of
failures
on
current
supercomputers
– Problem
will
worsen
as
we
scale
up
–
performance
bugs
become
more
frequent
• Need
to
become
beOer
at
avoiding
performance
bugs
(learn
from
control
theory
and
real-‐0me
systems)
– Make
system
code
more
determinis0c
22
23. System Recovery
• Local
failures
(e.g.,
node
kernel
crashed)
are
not
a
major
problem
-‐-‐
Can
replace
• Global
failures
(e.g.,
global
file
system
crashed)
are
the
hardest
problem
-‐-‐
Need
to
avoid
• Issue:
Fault
containment
– Local
hardware
failure
corrupts
global
state
– Localized
hardware
error
causes
global
performance
bugs
23
24. Quid Custodiet Ipsos Custodes?
• Who
watches
the
watchmen?
• Need
robust
(scalable,
fault-‐tolerant)
infrastructure
for
error
repor0ng
and
recovery
orchestra0on
– Current
approach
of
out-‐of-‐band
monitoring
and
control
is
too
restricted
24
25. Summary
• Resilience
is
a
major
problem
in
the
exascale
era
• Major
todos:
– Need
to
understand
much
beOer
failures
in
current
systems
and
future
trends
– Need
to
develop
good
applica0on
error
detec0on
methodology;
error
correc0on
is
desirable,
but
less
cri0cal
– Need
to
significantly
reduce
system
errors
and
reduce
system
recovery
0me
– Need
to
develop
robust
infrastructure
for
error
handling
25