SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Resilience at Exascale


Marc	
  Snir	
  
Director,	
  Mathema0cs	
  and	
  
Computer	
  Science	
  Division	
  
Argonne	
  Na0onal	
  Laboratory	
  
	
  
Professor,	
  Dept.	
  of	
  Computer	
  
Science,	
  UIUC	
  
	
  
The Problem
•  “Resilience	
  is	
  the	
  black	
  swan	
  of	
  the	
  exascale	
  program”	
  
•  DOE	
  &	
  DoD	
  commissioned	
  several	
  reports	
  
       –  Inter-­‐Agency	
  Workshop	
  on	
  HPC	
  Resilience	
  at	
  Extreme	
  Scale	
  
       hOp://ins0tute.lanl.gov/resilience/docs/Inter-­‐
       AgencyResilienceReport.pdf	
  	
  (Feb	
  2012)	
  
       –  U.S.	
  Department	
  of	
  Energy	
  Fault	
  Management	
  Workshop	
  
       hOp://shadow.dyndns.info/publica0ons/geist12department.pdf	
  
       (June	
  2012)	
  
       –  ICIS	
  workshop	
  on	
  “addressing	
  failures	
  in	
  exascale	
  compu0ng”	
  
       (report	
  forthcoming)	
  
•  Talk:	
  
       –  Discuss	
  the	
  liOle	
  we	
  understand	
  
       –  Discuss	
  how	
  we	
  could	
  learn	
  more	
  
       –  Seek	
  you	
  feedback	
  on	
  the	
  report’s	
  content	
  
   	
  	
  


                                                                                                 2	
  
WHERE WE ARE TODAY


                     3	
  
Failures Today
•  Latest	
  large	
  systema0c	
  study	
  (Schroeder	
  &	
  Gibson)	
  published	
  in	
  2005	
  
   –	
  quite	
  obsolete	
  
•  Failures	
  per	
  node	
  per	
  year:	
  2-­‐20	
  
•  Root	
  cause	
  of	
  failures	
  
    –  Hardware	
  (30%-­‐60%),	
  SW	
  (5%-­‐25%),	
  Unknown	
  (20%-­‐30%)	
  
•  Mean-­‐Time	
  to	
  Repair:	
  3-­‐24	
  hours	
  

•  Current	
  anecdotal	
  numbers:	
  
    –  Applica0on	
  crash:	
  >1	
  day	
  
    –  Global	
  system	
  crash:	
  >	
  1	
  week	
  
    –  Applica0on	
  checkpoint/restart:	
  15-­‐20	
  minutes	
  (checkpoint	
  oden	
  
       “free”)	
  
    –  System	
  restart:	
  >	
  1	
  hour	
  
    –  Sod	
  HW	
  errors	
  suspected	
  

                                                                                                        4	
  
Recommendation

•  Study	
  in	
  systema0c	
  manner	
  current	
  failure	
  types	
  and	
  rates	
  
    –  Using	
  error	
  logs	
  at	
  major	
  DOE	
  labs	
  
    –  Pushing	
  for	
  common	
  ontology	
  
•  Spend	
  0me	
  hun0ng	
  for	
  sod	
  HW	
  errors	
  
•  Study	
  causes	
  of	
  errors	
  (?)	
  
    –  Most	
  HW	
  faults	
  (>99%)	
  have	
  no	
  effect	
  (processors	
  are	
  hugely	
  
       inefficient?)	
  
    –  The	
  common	
  effect	
  of	
  sod	
  HW	
  bit	
  flips	
  is	
  	
  SW	
  failure	
  
    –  HW	
  faults	
  can	
  (oden?)	
  be	
  due	
  to	
  design	
  bugs	
  
    –  Vendors	
  are	
  cagey	
  
•  Time	
  for	
  a	
  consor0um	
  effort?	
  


                                                                                                   5	
  
Current Error-Handling

•  Applica0on:	
  global	
  checkpoint	
  &	
  global	
  restart	
  
•  System:	
  
    –  Repair	
  persistent	
  state	
  (file	
  system,	
  databases)	
  
    –  Clean	
  slate	
  restart	
  for	
  everything	
  else	
  
•  Quick	
  analysis:	
  
    –  Assume	
  failures	
  have	
  Poisson	
  distribu0on	
  
    checkpoint	
  0me	
  C=1;	
  	
  recover+restart	
  0me	
  =	
  R;	
  	
  	
  MTBF=M	
  

                                                                       − τ /M
     –  Op0mal	
  checkpoint	
  interval	
  τ	
  sa0sfies	
         e            = (M − τ +1) / M


     –  System	
  u0liza0on	
  U	
  is	
                          U = (M − R − τ +1) / M


                                                                                                   6	
  
Utilization, assuming R=1


                            •  EE.g.	
  
                                –  C=R=15	
  mins	
  (=1)	
  
                                –  MTBF=	
  25	
  hours	
  
                                   (=100)	
  
                                –  U	
  ≈	
  85%	
  




                                                          7	
  
Utilization as Function of MTBF and R

                                    •  E.g.	
  
                                        –  C	
  =	
  15	
  mins	
  (=1)	
  
                                        –  R	
  =	
  1	
  hour	
  (=4)	
  
                                        –  MTBF=	
  25	
  hours	
  
                                             (=100)	
  
                                    •  U	
  ≈	
  80%	
  




                                                                        8	
  
Projecting Ahead

•  “Comfort	
  zone:”	
  
     –  Checkpoint	
  0me	
  <1%	
  MTBF	
  
     –  Repair	
  0me	
  <5%	
  MTBF	
  
•  Assume	
  MTBF	
  =	
  1	
  hour	
  
     –  Checkpoint	
  0me	
  ≈	
  0.5	
  minute	
  
     –  Repair	
  0me	
  ≈	
  3	
  minutes	
  
•  Is	
  this	
  doable?	
  
     –  Yes	
  if	
  done	
  in	
  memory	
  
     –  E.g.,	
  using	
  “RAID”	
  techniques	
  
	
  



                                                      9	
  
Exascale Design Point

Systems	
                                     2012	
                     2020-­‐2024	
  	
              Difference	
  
                                             BG/Q	
                                                    Today	
  &	
  2019	
  
                                           Computer	
  

System	
  peak	
                          20	
  Pflop/s	
                    1	
  Eflop/s	
                     O(100)	
  

Power	
                                     8.6	
  MW	
                     ~20	
  MW	
  

System	
  memory	
                            1.6	
  PB	
                  32	
  -­‐	
  64	
  PB	
             O(10)	
  
                                          (16*96*1024)	
  	
  

Node	
  performance	
                     	
  	
  205	
  GF/s	
         1.2	
  	
  or	
  15TF/s	
       O(10)	
  –	
  O(100)	
  
                                            (16*1.6GHz*8)	
  


Node	
  memory	
  BW	
                     42.6	
  GB/s	
                   2	
  -­‐	
  4TB/s	
              O(1000)	
  

Node	
  concurrency	
                     64	
  Threads	
                O(1k)	
  or	
  10k	
          O(100)	
  –	
  O(1000)	
  

Total	
  Node	
  Interconnect	
  BW	
       20	
  GB/s	
                 200-­‐400GB/s	
                       O(10)	
  

System	
  size	
  (nodes)	
                  98,304	
               O(100,000)	
  or	
  O(1M)	
        O(100)	
  –	
  O(1000)	
  
                                           (96*1024)	
  

Total	
  concurrency	
                       5.97	
  M	
                    O(billion)	
                     O(1,000)	
  

MTTI	
                                        4	
  days	
                  O(<1	
  day)	
                     -­‐	
  O(10)	
  


                 Both price and power envelopes may be too aggressive!
Time to checkpoint
•  Assume	
  “Raid	
  5”	
  across	
  memories,	
  checkpoint	
  size	
  ~	
  50%	
  
   memory	
  size	
  
    –  Checkpoint	
  0me	
  ≈	
  0me	
  to	
  transfer	
  50%	
  of	
  memory	
  to	
  
       another	
  node:	
  Few	
  seconds!	
  
    –  Memory	
  overhead	
  	
  ~	
  50%	
  
    –  Energy	
  overhead	
  small	
  with	
  nVRAM	
  
    –  (But	
  need	
  many	
  write	
  cycles)	
  
    –  We	
  are	
  in	
  comfort	
  zone	
  (re	
  checkpoint)	
  
 •  How	
  about	
  recovery?	
  
 •  Time	
  to	
  restart	
  applica0on	
  same	
  
    as	
  0me	
  to	
  checkpoint	
  
 •  Problem:	
  Time	
  to	
  recover	
  if	
  
    system	
  failed	
  
                                                                                          11	
  
Key Assumptions:

	
  
1.  Errors	
  are	
  detected	
  (before	
  the	
  erroneous	
  checkpoint	
  is	
  
     commiOed)	
  	
  

2.  System	
  failures	
  are	
  rare	
  (<<	
  1	
  day)	
  or	
  recovery	
  from	
  them	
  is	
  
    very	
  fast	
  (minutes)	
  




                                                                                                        12	
  
Hardware Will Fail More Frequently

•  More,	
  smaller	
  transistor	
  
•  Near-­‐threshold	
  logic	
  

☛ More	
  frequent	
  bit	
  upsets	
  due	
  to	
  radia0on	
  
☛ More	
  frequent	
  mul0ple	
  bit	
  upsets	
  due	
  to	
  radia0on	
  
☛ Larger	
  manufacturing	
  varia0on	
  
☛ Faster	
  aging	
  
	
  
	
  
	
  
	
  
	
  
                                                                              13	
  
nknown with respect to the 11nm technology node.
      Hardware Error Detection: Assumptions
2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm
                                                                     45nm                           11nm
                                      Cores                               8                          128
                  Scattered latches per core                    200, 000                       200, 000
                                                p                              p
 Scattered latchs in uncore relative to cores       ncores ⇥ 1.25 = 0.44           ncores ⇥ 1.25 = 0.11
                               FIT per latch                         10    1
                                                                                                    10    1

                      Arrays per core (MB)                                1                              1
                        FIT per SRAM cell                            10    4
                                                                                                    10    4

                      Logic FIT / latch FIT                    0.1    0.5                     0.1    0.5
                     DRAM FIT (per node)                                  50                             50




                                                12



                                                                                                              14	
  
Hardware Error Detection: Analysis
                                            Array interleaving and SECDED
                                                       (Baseline)
                             DCE [FIT]                 DUE [FIT]                  UE [FIT]
                            45nm      11nm        45nm             11nm      45nm            11nm
             Arrays          5000    100000          50            20000        1             1000
   Scattered latches          200        4000      N/A              N/A         20             400
Combinational logic            20         400      N/A              N/A          0               4
            DRAM               50          50       0.5              0.5     0.005           0.005
              Total    1000 - 5000   100000     10 - 100   5000 - 20000     10 - 50   500 - 5000
                                           Array interleaving and ¿SECDED
                                     (11nm overhead: ⇠ 1% area and < 5% power)
                             DCE [FIT]                 DUE [FIT]                  UE [FIT]
                            45nm      11nm        45nm             11nm      45nm            11nm
             Arrays          5000    100000          50             1000        1               5
   Scattered latches          200        4000      N/A              N/A         20             400
Combinational logic            20         400      N/A              N/A        0.2               5
            DRAM               50          50       0.5              0.5     0.005           0.005
              Total    1500 - 6500   100000      10 - 50     500 - 5000     10 - 50    100 - 500
                                    Array interleaving and ¿SECDED + latch parity
                       (45nm overhead ⇠ 10%; 11nm overhead: ⇠ 20% area and ⇠ 25% power)
                             DCE [FIT]                 DUE [FIT]                  UE [FIT]
                            45nm      11nm        45nm             11nm      45nm            11nm
             Arrays          5000    100000          50             1000        1               5
   Scattered latches          200        4000        20             400       0.01             0.5
Combinational logic            20         400      N/A              N/A        0.2               5
            DRAM                0           0       0.1              0.0     0.100           0.001
              Total    1500 - 6500   100000     25 - 100   2000 - 10000          1           5 - 20   15	
  
Summary of (Rough) Analysis

•  If	
  no	
  new	
  technology	
  is	
  deployed	
  can	
  have	
  up	
  to	
  one	
  undetected	
  
   error	
  per	
  hour	
  
•  With	
  addi0onal	
  circuitry	
  could	
  get	
  down	
  to	
  one	
  undetected	
  
   error	
  per	
  100-­‐1,000	
  hours	
  (week	
  –	
  months)	
  
    –  The	
  cost	
  could	
  be	
  as	
  high	
  as	
  20%	
  addi0onal	
  circuits	
  and	
  25%	
  
          addi0onal	
  power	
  
    –  Main	
  problems	
  are	
  in	
  combinatorial	
  logic	
  and	
  latches	
  
    –  The	
  price	
  could	
  be	
  significantly	
  higher	
  (small	
  market	
  for	
  high-­‐
          availability	
  servers)	
  

•  Need	
  sodware	
  error	
  detec0on	
  as	
  an	
  op0on!	
  
	
  
	
  
                                                                                                      16	
  
Application-Level Data Error Detection
•  Check	
  checkpoint	
  for	
  correctness	
  before	
  it	
  is	
  commiOed.	
  
        –  Look	
  for	
  outliers	
  (assume	
  smooth	
  fields)	
  –	
  handle	
  high	
  order	
  
           bit	
  errors	
  
        –  Use	
  robust	
  solvers	
  –	
  handle	
  low-­‐order	
  bit	
  errors	
  
•  May	
  not	
  work	
  for	
  discrete	
  problems,	
  parBcle	
  simulaBons,	
  
   disconBnuous	
  phenomena,	
  etc.	
  	
  
•  May	
  not	
  work	
  if	
  outlier	
  is	
  rapidly	
  smoothed	
  (error	
  propagaBon)	
  
        –  Check	
  for	
  global	
  invariants	
  (e.g.,	
  energy	
  preserva0on)	
  
•  	
  Ignore	
  error	
  
•  Duplicate	
  computa0on	
  of	
  cri0cal	
  variables	
  

•  Can	
  we	
  reduce	
  checkpoint	
  size	
  (or	
  avoid	
  them	
  altogether)?	
  
     –  Tradeoff:	
  how	
  much	
  is	
  saved,	
  how	
  oden	
  data	
  is	
  checked	
  (big/
        small	
  transac0ons)	
  
	
  
                                                                                                        17	
  
How About Control Errors?

•  Code	
  is	
  corrupted,	
  jump	
  to	
  wrong	
  address,	
  etc.	
  
     –  Rarer	
  (more	
  data	
  state	
  than	
  control	
  state)	
  
     –  More	
  likely	
  to	
  cause	
  a	
  fatal	
  excep0on	
  
     –  More	
  easy	
  to	
  protect	
  against	
  
	
  




                                                                             18	
  
How About System Failures (Due to Hardware)?

•  Kernel	
  data	
  corrupted	
  
    –  Page	
  tables,	
  rou0ng	
  tables,	
  file	
  system	
  metadata,	
  etc.	
  
•  Hard	
  to	
  understand	
  and	
  diagnose	
  	
  
    –  Break	
  abstrac0on	
  layers	
  
•  Make	
  sense	
  to	
  avoid	
  such	
  failures	
  
    –  Duplicate	
  system	
  computa0ons	
  that	
  affect	
  kernel	
  state	
  (price	
  
       can	
  be	
  tolerated	
  for	
  HPC)	
  
    –  Use	
  transac0onal	
  methods	
  
    –  Harden	
  kernel	
  data	
  	
  
        •  Highly	
  reliable	
  DRAM	
  
        •  Remotely	
  accessible	
  NVRAM	
  
    –  Predict	
  and	
  avoid	
  failures	
  

                                                                                          19	
  
Failure Prediction from Event Logs
        Use a combination of signal analysis (to identify outliers) and
        datamining (to find correlations)
e 1 presents
 pe.
oes not have
e, a memory
arge number
sh the error
Data mining
themselfs in
e more than

 ent the ma-
 l to extract
 ignals. This
f faults seen
between the

 three signal
         Gainaru, Cappello, Snir, Kramer (SC12)
  that can be         Fig. 2. Methodology overview of the hybrid approach   20	
  
Can Hardware Failure Be Predicted?

      Prediction method   Precision     Recall    Seq used     Pred failures
        ELSA hybrid        91.2%        45.8%    62 (96.8%)        603
        ELSA signal        88.1%        40.5%    117 (92.8%)       534
         Data mining       91.9%        15.7%    39 (95.1%)        207
                                      TABLE II
       P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
        C    Precision Recall MTTF for the whole system Waste gain
 .
      1min      92      20            one day            9.13%
n     1min      92      36            one day            17.33%
       10s      92      36            one day            12.09%
os   The metrics used for evaluating one day
       10s      92      45             prediction performance are
                                                         15.63%
     prediction 92 recall:
      1min
       10s
                 and
                92
                        50
                        65
                                         5h
                                         5h
                                                         21.74%
                                                         24.78%
       •  Precision is the fraction of failure predictions that turn
                               TABLE III
)         out to be correct.
     Migrating processes when node failure is predicted can
      P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES
     significantly improve utilization
       • Recall is the fraction of failures that are predicted.
o
s                                                                              21	
  
How About Software Bugs?
•  Parallel	
  code	
  (transforma0onal,	
  mostly	
  determinis0c)	
  vs.	
  
   concurrent	
  code	
  (event	
  driven,	
  very	
  nondeterminis0c)	
  
    –  Hard	
  to	
  understand	
  concurrent	
  code	
  (cannot	
  comprehend	
  
       more	
  than	
  10	
  concurrent	
  actors)	
  
    –  Hard	
  to	
  avoid	
  performance	
  bugs	
  (e.g.,	
  overcommiOed	
  
       resources	
  causing	
  0me-­‐outs)	
  
    –  Hard	
  to	
  test	
  for	
  performance	
  bugs	
  
•  Concurrency	
  performance	
  bugs	
  (e.g.,	
  in	
  parallel	
  file	
  systems)	
  are	
  
   major	
  source	
  of	
  failures	
  on	
  current	
  supercomputers	
  
    –  Problem	
  will	
  worsen	
  as	
  we	
  scale	
  up	
  –	
  performance	
  bugs	
  
       become	
  more	
  frequent	
  
•  Need	
  to	
  become	
  beOer	
  at	
  avoiding	
  performance	
  bugs	
  (learn	
  
   from	
  control	
  theory	
  and	
  real-­‐0me	
  systems)	
  
    –  Make	
  system	
  code	
  more	
  determinis0c	
  

                                                                                                  22	
  
System Recovery

•  Local	
  failures	
  (e.g.,	
  node	
  kernel	
  crashed)	
  are	
  not	
  a	
  major	
  
   problem	
  -­‐-­‐	
  Can	
  replace	
  
•  Global	
  failures	
  (e.g.,	
  global	
  file	
  system	
  crashed)	
  are	
  the	
  hardest	
  
   problem	
  -­‐-­‐	
  Need	
  to	
  avoid	
  

•  Issue:	
  Fault	
  containment	
  
    –  Local	
  hardware	
  failure	
  corrupts	
  global	
  state	
  
    –  Localized	
  hardware	
  error	
  causes	
  global	
  performance	
  bugs	
  
    	
  




                                                                                                      23	
  
Quid Custodiet Ipsos Custodes?

•  Who	
  watches	
  the	
  watchmen?	
  

•  Need	
  robust	
  (scalable,	
  fault-­‐tolerant)	
  infrastructure	
  for	
  error	
  
   repor0ng	
  and	
  recovery	
  orchestra0on	
  
    –  Current	
  approach	
  of	
  out-­‐of-­‐band	
  monitoring	
  and	
  control	
  is	
  
       too	
  restricted	
  




                                                                                                24	
  
Summary

•  Resilience	
  is	
  a	
  major	
  problem	
  in	
  the	
  exascale	
  era	
  
•  Major	
  todos:	
  
    –  Need	
  to	
  understand	
  much	
  beOer	
  failures	
  in	
  current	
  systems	
  
       and	
  future	
  trends	
  
    –  Need	
  to	
  develop	
  good	
  applica0on	
  error	
  detec0on	
  
       methodology;	
  error	
  correc0on	
  is	
  desirable,	
  but	
  less	
  cri0cal	
  
    –  Need	
  to	
  significantly	
  reduce	
  system	
  errors	
  and	
  reduce	
  
       system	
  recovery	
  0me	
  
    –  Need	
  to	
  develop	
  robust	
  infrastructure	
  for	
  error	
  handling	
  




                                                                                               25	
  
The End



          26	
  

Contenu connexe

Tendances

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 
16.40 o10 d wiltshire
16.40 o10 d wiltshire16.40 o10 d wiltshire
16.40 o10 d wiltshireNZIP
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
 

Tendances (12)

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
TWEEP2015_SFernandez
TWEEP2015_SFernandezTWEEP2015_SFernandez
TWEEP2015_SFernandez
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Jarp big data_sydney_v7
Jarp big data_sydney_v7Jarp big data_sydney_v7
Jarp big data_sydney_v7
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 
Mateo valero p2
Mateo valero p2Mateo valero p2
Mateo valero p2
 
Ron perrot
Ron perrotRon perrot
Ron perrot
 
Mateo valero p1
Mateo valero p1Mateo valero p1
Mateo valero p1
 
16.40 o10 d wiltshire
16.40 o10 d wiltshire16.40 o10 d wiltshire
16.40 o10 d wiltshire
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten Replicators
 

En vedette

Keynote snir spaa
Keynote snir spaaKeynote snir spaa
Keynote snir spaaMarc Snir
 
Programming Models for High-performance Computing
Programming Models for High-performance ComputingProgramming Models for High-performance Computing
Programming Models for High-performance ComputingMarc Snir
 
Keynote snir sc
Keynote snir scKeynote snir sc
Keynote snir scMarc Snir
 
Resilience at Extreme Scale
Resilience at Extreme ScaleResilience at Extreme Scale
Resilience at Extreme ScaleMarc Snir
 

En vedette (7)

Keynote snir spaa
Keynote snir spaaKeynote snir spaa
Keynote snir spaa
 
Programming Models for High-performance Computing
Programming Models for High-performance ComputingProgramming Models for High-performance Computing
Programming Models for High-performance Computing
 
Keynote snir sc
Keynote snir scKeynote snir sc
Keynote snir sc
 
Resilience at Extreme Scale
Resilience at Extreme ScaleResilience at Extreme Scale
Resilience at Extreme Scale
 
Esctp snir
Esctp snirEsctp snir
Esctp snir
 
CNFET Technology
CNFET TechnologyCNFET Technology
CNFET Technology
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similaire à Resilience at exascale

Ertss2010 multicore scheduling
Ertss2010 multicore schedulingErtss2010 multicore scheduling
Ertss2010 multicore schedulingNicolas Navet
 
Complicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeComplicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeMaurice Naftalin
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Design for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFTDesign for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFTjayasreenimmakuri777
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016Belmiro Moreira
 
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Nicolas Navet
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelininga_spac
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 
Load balancing theory and practice
Load balancing theory and practiceLoad balancing theory and practice
Load balancing theory and practiceFoundationDB
 
A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...openseesdays
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
lec16_nanophotonics.pptx
lec16_nanophotonics.pptxlec16_nanophotonics.pptx
lec16_nanophotonics.pptxsajolks1
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsThe Linux Foundation
 

Similaire à Resilience at exascale (20)

Ertss2010 multicore scheduling
Ertss2010 multicore schedulingErtss2010 multicore scheduling
Ertss2010 multicore scheduling
 
Multicore scheduling in automotive ECUs
Multicore scheduling in automotive ECUsMulticore scheduling in automotive ECUs
Multicore scheduling in automotive ECUs
 
Complicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeComplicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine Age
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
04 performance
04 performance04 performance
04 performance
 
Design for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFTDesign for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFT
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016
 
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelining
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
Load balancing theory and practice
Load balancing theory and practiceLoad balancing theory and practice
Load balancing theory and practice
 
A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
lec16_nanophotonics.pptx
lec16_nanophotonics.pptxlec16_nanophotonics.pptx
lec16_nanophotonics.pptx
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
19th Session.pptx
19th Session.pptx19th Session.pptx
19th Session.pptx
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
 

Resilience at exascale

  • 1. Resilience at Exascale Marc  Snir   Director,  Mathema0cs  and   Computer  Science  Division   Argonne  Na0onal  Laboratory     Professor,  Dept.  of  Computer   Science,  UIUC    
  • 2. The Problem •  “Resilience  is  the  black  swan  of  the  exascale  program”   •  DOE  &  DoD  commissioned  several  reports   –  Inter-­‐Agency  Workshop  on  HPC  Resilience  at  Extreme  Scale   hOp://ins0tute.lanl.gov/resilience/docs/Inter-­‐ AgencyResilienceReport.pdf    (Feb  2012)   –  U.S.  Department  of  Energy  Fault  Management  Workshop   hOp://shadow.dyndns.info/publica0ons/geist12department.pdf   (June  2012)   –  ICIS  workshop  on  “addressing  failures  in  exascale  compu0ng”   (report  forthcoming)   •  Talk:   –  Discuss  the  liOle  we  understand   –  Discuss  how  we  could  learn  more   –  Seek  you  feedback  on  the  report’s  content       2  
  • 3. WHERE WE ARE TODAY 3  
  • 4. Failures Today •  Latest  large  systema0c  study  (Schroeder  &  Gibson)  published  in  2005   –  quite  obsolete   •  Failures  per  node  per  year:  2-­‐20   •  Root  cause  of  failures   –  Hardware  (30%-­‐60%),  SW  (5%-­‐25%),  Unknown  (20%-­‐30%)   •  Mean-­‐Time  to  Repair:  3-­‐24  hours   •  Current  anecdotal  numbers:   –  Applica0on  crash:  >1  day   –  Global  system  crash:  >  1  week   –  Applica0on  checkpoint/restart:  15-­‐20  minutes  (checkpoint  oden   “free”)   –  System  restart:  >  1  hour   –  Sod  HW  errors  suspected   4  
  • 5. Recommendation •  Study  in  systema0c  manner  current  failure  types  and  rates   –  Using  error  logs  at  major  DOE  labs   –  Pushing  for  common  ontology   •  Spend  0me  hun0ng  for  sod  HW  errors   •  Study  causes  of  errors  (?)   –  Most  HW  faults  (>99%)  have  no  effect  (processors  are  hugely   inefficient?)   –  The  common  effect  of  sod  HW  bit  flips  is    SW  failure   –  HW  faults  can  (oden?)  be  due  to  design  bugs   –  Vendors  are  cagey   •  Time  for  a  consor0um  effort?   5  
  • 6. Current Error-Handling •  Applica0on:  global  checkpoint  &  global  restart   •  System:   –  Repair  persistent  state  (file  system,  databases)   –  Clean  slate  restart  for  everything  else   •  Quick  analysis:   –  Assume  failures  have  Poisson  distribu0on   checkpoint  0me  C=1;    recover+restart  0me  =  R;      MTBF=M   − τ /M –  Op0mal  checkpoint  interval  τ  sa0sfies   e = (M − τ +1) / M –  System  u0liza0on  U  is   U = (M − R − τ +1) / M 6  
  • 7. Utilization, assuming R=1 •  EE.g.   –  C=R=15  mins  (=1)   –  MTBF=  25  hours   (=100)   –  U  ≈  85%   7  
  • 8. Utilization as Function of MTBF and R •  E.g.   –  C  =  15  mins  (=1)   –  R  =  1  hour  (=4)   –  MTBF=  25  hours   (=100)   •  U  ≈  80%   8  
  • 9. Projecting Ahead •  “Comfort  zone:”   –  Checkpoint  0me  <1%  MTBF   –  Repair  0me  <5%  MTBF   •  Assume  MTBF  =  1  hour   –  Checkpoint  0me  ≈  0.5  minute   –  Repair  0me  ≈  3  minutes   •  Is  this  doable?   –  Yes  if  done  in  memory   –  E.g.,  using  “RAID”  techniques     9  
  • 10. Exascale Design Point Systems   2012   2020-­‐2024     Difference   BG/Q   Today  &  2019   Computer   System  peak   20  Pflop/s   1  Eflop/s   O(100)   Power   8.6  MW   ~20  MW   System  memory   1.6  PB   32  -­‐  64  PB   O(10)   (16*96*1024)     Node  performance      205  GF/s   1.2    or  15TF/s   O(10)  –  O(100)   (16*1.6GHz*8)   Node  memory  BW   42.6  GB/s   2  -­‐  4TB/s   O(1000)   Node  concurrency   64  Threads   O(1k)  or  10k   O(100)  –  O(1000)   Total  Node  Interconnect  BW   20  GB/s   200-­‐400GB/s   O(10)   System  size  (nodes)   98,304   O(100,000)  or  O(1M)   O(100)  –  O(1000)   (96*1024)   Total  concurrency   5.97  M   O(billion)   O(1,000)   MTTI   4  days   O(<1  day)   -­‐  O(10)   Both price and power envelopes may be too aggressive!
  • 11. Time to checkpoint •  Assume  “Raid  5”  across  memories,  checkpoint  size  ~  50%   memory  size   –  Checkpoint  0me  ≈  0me  to  transfer  50%  of  memory  to   another  node:  Few  seconds!   –  Memory  overhead    ~  50%   –  Energy  overhead  small  with  nVRAM   –  (But  need  many  write  cycles)   –  We  are  in  comfort  zone  (re  checkpoint)   •  How  about  recovery?   •  Time  to  restart  applica0on  same   as  0me  to  checkpoint   •  Problem:  Time  to  recover  if   system  failed   11  
  • 12. Key Assumptions:   1.  Errors  are  detected  (before  the  erroneous  checkpoint  is   commiOed)     2.  System  failures  are  rare  (<<  1  day)  or  recovery  from  them  is   very  fast  (minutes)   12  
  • 13. Hardware Will Fail More Frequently •  More,  smaller  transistor   •  Near-­‐threshold  logic   ☛ More  frequent  bit  upsets  due  to  radia0on   ☛ More  frequent  mul0ple  bit  upsets  due  to  radia0on   ☛ Larger  manufacturing  varia0on   ☛ Faster  aging             13  
  • 14. nknown with respect to the 11nm technology node. Hardware Error Detection: Assumptions 2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm 45nm 11nm Cores 8 128 Scattered latches per core 200, 000 200, 000 p p Scattered latchs in uncore relative to cores ncores ⇥ 1.25 = 0.44 ncores ⇥ 1.25 = 0.11 FIT per latch 10 1 10 1 Arrays per core (MB) 1 1 FIT per SRAM cell 10 4 10 4 Logic FIT / latch FIT 0.1 0.5 0.1 0.5 DRAM FIT (per node) 50 50 12 14  
  • 15. Hardware Error Detection: Analysis Array interleaving and SECDED (Baseline) DCE [FIT] DUE [FIT] UE [FIT] 45nm 11nm 45nm 11nm 45nm 11nm Arrays 5000 100000 50 20000 1 1000 Scattered latches 200 4000 N/A N/A 20 400 Combinational logic 20 400 N/A N/A 0 4 DRAM 50 50 0.5 0.5 0.005 0.005 Total 1000 - 5000 100000 10 - 100 5000 - 20000 10 - 50 500 - 5000 Array interleaving and ¿SECDED (11nm overhead: ⇠ 1% area and < 5% power) DCE [FIT] DUE [FIT] UE [FIT] 45nm 11nm 45nm 11nm 45nm 11nm Arrays 5000 100000 50 1000 1 5 Scattered latches 200 4000 N/A N/A 20 400 Combinational logic 20 400 N/A N/A 0.2 5 DRAM 50 50 0.5 0.5 0.005 0.005 Total 1500 - 6500 100000 10 - 50 500 - 5000 10 - 50 100 - 500 Array interleaving and ¿SECDED + latch parity (45nm overhead ⇠ 10%; 11nm overhead: ⇠ 20% area and ⇠ 25% power) DCE [FIT] DUE [FIT] UE [FIT] 45nm 11nm 45nm 11nm 45nm 11nm Arrays 5000 100000 50 1000 1 5 Scattered latches 200 4000 20 400 0.01 0.5 Combinational logic 20 400 N/A N/A 0.2 5 DRAM 0 0 0.1 0.0 0.100 0.001 Total 1500 - 6500 100000 25 - 100 2000 - 10000 1 5 - 20 15  
  • 16. Summary of (Rough) Analysis •  If  no  new  technology  is  deployed  can  have  up  to  one  undetected   error  per  hour   •  With  addi0onal  circuitry  could  get  down  to  one  undetected   error  per  100-­‐1,000  hours  (week  –  months)   –  The  cost  could  be  as  high  as  20%  addi0onal  circuits  and  25%   addi0onal  power   –  Main  problems  are  in  combinatorial  logic  and  latches   –  The  price  could  be  significantly  higher  (small  market  for  high-­‐ availability  servers)   •  Need  sodware  error  detec0on  as  an  op0on!       16  
  • 17. Application-Level Data Error Detection •  Check  checkpoint  for  correctness  before  it  is  commiOed.   –  Look  for  outliers  (assume  smooth  fields)  –  handle  high  order   bit  errors   –  Use  robust  solvers  –  handle  low-­‐order  bit  errors   •  May  not  work  for  discrete  problems,  parBcle  simulaBons,   disconBnuous  phenomena,  etc.     •  May  not  work  if  outlier  is  rapidly  smoothed  (error  propagaBon)   –  Check  for  global  invariants  (e.g.,  energy  preserva0on)   •   Ignore  error   •  Duplicate  computa0on  of  cri0cal  variables   •  Can  we  reduce  checkpoint  size  (or  avoid  them  altogether)?   –  Tradeoff:  how  much  is  saved,  how  oden  data  is  checked  (big/ small  transac0ons)     17  
  • 18. How About Control Errors? •  Code  is  corrupted,  jump  to  wrong  address,  etc.   –  Rarer  (more  data  state  than  control  state)   –  More  likely  to  cause  a  fatal  excep0on   –  More  easy  to  protect  against     18  
  • 19. How About System Failures (Due to Hardware)? •  Kernel  data  corrupted   –  Page  tables,  rou0ng  tables,  file  system  metadata,  etc.   •  Hard  to  understand  and  diagnose     –  Break  abstrac0on  layers   •  Make  sense  to  avoid  such  failures   –  Duplicate  system  computa0ons  that  affect  kernel  state  (price   can  be  tolerated  for  HPC)   –  Use  transac0onal  methods   –  Harden  kernel  data     •  Highly  reliable  DRAM   •  Remotely  accessible  NVRAM   –  Predict  and  avoid  failures   19  
  • 20. Failure Prediction from Event Logs Use a combination of signal analysis (to identify outliers) and datamining (to find correlations) e 1 presents pe. oes not have e, a memory arge number sh the error Data mining themselfs in e more than ent the ma- l to extract ignals. This f faults seen between the three signal Gainaru, Cappello, Snir, Kramer (SC12) that can be Fig. 2. Methodology overview of the hybrid approach 20  
  • 21. Can Hardware Failure Be Predicted? Prediction method Precision Recall Seq used Pred failures ELSA hybrid 91.2% 45.8% 62 (96.8%) 603 ELSA signal 88.1% 40.5% 117 (92.8%) 534 Data mining 91.9% 15.7% 39 (95.1%) 207 TABLE II P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES C Precision Recall MTTF for the whole system Waste gain . 1min 92 20 one day 9.13% n 1min 92 36 one day 17.33% 10s 92 36 one day 12.09% os The metrics used for evaluating one day 10s 92 45 prediction performance are 15.63% prediction 92 recall: 1min 10s and 92 50 65 5h 5h 21.74% 24.78% • Precision is the fraction of failure predictions that turn TABLE III ) out to be correct. Migrating processes when node failure is predicted can P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES significantly improve utilization • Recall is the fraction of failures that are predicted. o s 21  
  • 22. How About Software Bugs? •  Parallel  code  (transforma0onal,  mostly  determinis0c)  vs.   concurrent  code  (event  driven,  very  nondeterminis0c)   –  Hard  to  understand  concurrent  code  (cannot  comprehend   more  than  10  concurrent  actors)   –  Hard  to  avoid  performance  bugs  (e.g.,  overcommiOed   resources  causing  0me-­‐outs)   –  Hard  to  test  for  performance  bugs   •  Concurrency  performance  bugs  (e.g.,  in  parallel  file  systems)  are   major  source  of  failures  on  current  supercomputers   –  Problem  will  worsen  as  we  scale  up  –  performance  bugs   become  more  frequent   •  Need  to  become  beOer  at  avoiding  performance  bugs  (learn   from  control  theory  and  real-­‐0me  systems)   –  Make  system  code  more  determinis0c   22  
  • 23. System Recovery •  Local  failures  (e.g.,  node  kernel  crashed)  are  not  a  major   problem  -­‐-­‐  Can  replace   •  Global  failures  (e.g.,  global  file  system  crashed)  are  the  hardest   problem  -­‐-­‐  Need  to  avoid   •  Issue:  Fault  containment   –  Local  hardware  failure  corrupts  global  state   –  Localized  hardware  error  causes  global  performance  bugs     23  
  • 24. Quid Custodiet Ipsos Custodes? •  Who  watches  the  watchmen?   •  Need  robust  (scalable,  fault-­‐tolerant)  infrastructure  for  error   repor0ng  and  recovery  orchestra0on   –  Current  approach  of  out-­‐of-­‐band  monitoring  and  control  is   too  restricted   24  
  • 25. Summary •  Resilience  is  a  major  problem  in  the  exascale  era   •  Major  todos:   –  Need  to  understand  much  beOer  failures  in  current  systems   and  future  trends   –  Need  to  develop  good  applica0on  error  detec0on   methodology;  error  correc0on  is  desirable,  but  less  cri0cal   –  Need  to  significantly  reduce  system  errors  and  reduce   system  recovery  0me   –  Need  to  develop  robust  infrastructure  for  error  handling   25  
  • 26. The End 26