A new generic and rigorous approach to the tolerance of data corruptions. Presentation of the paper "Practical Hardening of Crash-Tolerant Systems" published at USENIX ATC 2012. See video at http://bit.ly/LNc5mc
3. Crashes
o Assumptions
o A server (process) suddenly stops
o Until then, only correct steps
Crash
Time
4. Data corruptions
o What if there are data corruptions?
o The state of a process may be corrupted
o The process may make incorrect steps before stopping
Data
corruptions
Time
5. Data corruptions
o What if there are data corruptions?
o The state of a process may be corrupted
o The process may make incorrect steps before stopping
NOT COVERED!
Data
corruptions
Time
6. Sources of data corruptions
o Commodity disks are known to be unreliable
o Faulty firmware, bad sectors etc.
o RAM: ECC errors are frequent
o Production machines only see detected errors
Coverage not known
o Interconnects and CPUs also fail
o Faulty drivers or bit flips
7. A horror story
An 8-hour system-wide outage due to a single hardware fault
8. What happened?
o Quoted from the Amazon service health dashboard
o “A handful of messages had a single bit corrupted”
o “The message was still intelligible, but the system state
information was incorrect”
o “We used MD5 checksums throughout the system (but
not) for this particular internal state information”
o “(The corruption) spread throughout the system causing
the symptoms described above”
10. Common practice
o Manual placement of ad-hoc
error detection checks
o Application knowledge
o Time consuming
o Hard to structure without
fault model
o No error isolation guarantee
11. Research: Byzantine faults
o Byzantine model
o Faulty nodes controlled by an adversary
o Worst-case model
Byzantine
fault
Time
11
12. Byzantine fault model
o Black-box model of faulty processes: adversarial
o Hardening for error isolation [Nysiad NSDI 2008]
o Based on state machine replication
o Replication and performance costs
Agreement on requests
Servers
Client
13. Byzantine faults
o Byzantine hardening covers attacks and bugs…
o … assuming, e.g., design diversity of replicas
o Unpractical in most systems no real adoption
Attacks Bugs Data corruptions
Security V&V ASC Hardening
14. A new approach to
min
error isolation
u Event Event x
handling handling
v y
mout min
Process i Process j
1. General model of process behavior
2. Arbitrary State Corruption (ASC) fault model
3. Guarantee error isolation through hardening
15. A new approach to
min
error isolation
u Event Event x
handling handling
v y
mout min
Process i Process j
1. General model of process behavior
2. Arbitrary Correia, D. Ferro(ASC)F. Junqueira
with M. State Corruption and fault model
3. Guarantee error isolation through Conference
2012 Usenix Annual Technical hardening
17. Process model
min
1) Event Dispatching
Upon receive message <REQ, r> do
if v > 5 then
u = r + v + 5;
2) Event Handling else
u = r + v;
State
v = u;
send <WRITE, v> to process p
3) Message sending
mout
18. ASC fault model
o An Arbitrary State Corruption can make a process
o Crash
o Assign an arbitrary value to any variable
o Start the execution from an arbitrary instruction
v 5 v 12
z 10 z 7
PC 20 PC 320
19. Fault frequency
o One fault for every processed input message
min
1) Event Dispatching
Upon receive message <REQ, r> do
if v > 5 then
u = r + v + 5;
2) Event Handling else
u= r + v;
State
v = u;
send <WRITE, v> to process p
3) Message sending mout
20. Fault diversity
o A corrupted variable is different from its replica
v 5 5 v 12 5
z 10 10 z 7 41
PC 20 PC 320
original replica original replica
o Only holds immediately after the fault
o Can be invalidated if instructions modify the variable
21. Error propagation
o Fault diversity does not hold
o Hardening preserves diversity
Fault
Original Replica
diversity
u
v ?
23. From ASC to crashes
o Transparent: to the hardened process
o Local: no process replication on multiple machines
o Untrusted: can have faults while executing hardening
min
u
Event handling
v
mout
HARDENING RUNTIME
24. PASC library
Process Replica
state state
PASC checks
EH1 EH2 EH3
User- defined
PASC runtime
Transparent
github.com/yahoo/pasc
30. Scalability
100
Max. throughput (kops/sec) 90
80
70
60
50
40
PASC sKV
30 Unprot. sKV
20
10
0
1 3 5 7
Number of servers
o SimpleKV: eventually consistent store, no replication
o Scales similarly with hardening
o No server “wasted” for replication
31. PASC fault coverage
o Injected random bit flips in Paxos
o Code corruptions: bytecode and binary code
o State corruptions: pointers and primitive values
Code corruptions State corruptions
Unprot PASC Unprot PASC
Undet. 3 0 93 0
Det. - 1 - 330
Crash 1640 1663 2301 2066
Not manif. 1213 1193 2843 2841
Total 2856 2856 5237 5237
32. Wrap up
o Hardware data corruptions are a real danger
o Proposed new systematic approach
o BFT not realistic
o Ad-hoc approaches are not systematic
o Hardening algorithm for error isolation
o Local: does not require replication
o Efficient: PASC-Paxos has up to 70% more throughput
than PBFT
o High fault coverage
33. Directions
o Systematic protection of Yahoo! infrastructure against
data corruptions
o ASC just scratched the surface – some todos
o Reduce memory footprint
o Support for external memory (disks/SSDs)
o Hardening of legacy code
o Theoretical foundations
User Impact: >10 Million users unable to use a given service. Revenue Impact: >$100K. Brand Impact: Outage requires press release. Top Tier Revenue Property Impact (see list below)
search.yahoo.com sponsored text ads are not displaying in the North placement. Sponsored Ads are instead being moved to the east placement. There was a limit for the number of different data dictionary match types that QP can handle (720 types). The DD built and pushed the night of included an additional 400 types, slowly incrementing over the course of the months, and finally exceeding the l
SPEND MORE HERE
----- Meeting Notes (6/8/12 16:43) -----TODO: more detailed figure of how the runtime looks like- event handler- replica state
----- Meeting Notes (6/8/12 15:57) -----Simple exampleno overhead because little computation and network bound
----- Meeting Notes (6/8/12 11:51) -----too many plots, remove the ones for batching one----- Meeting Notes (6/8/12 15:57) -----more concrete example----- Meeting Notes (6/8/12 16:47) -----stress that PASC is not SMR. Paxos is built on top of PASC. Maybe have a bullet
----- Meeting Notes (6/8/12 11:51) -----use bars with one value (max tput) per setting