2. CAP conjecture [reminder]
• Can only have two of:
– Consistency
– Availability
– Partition-tolerance
• Examples
– Databases, 2PC, centralized algo (C & A)
– Distributed databases, majority protocols (C & P)
– DNS, Bayou (A & P)
3. CAP theorem
• Formalization by Gilbert & Lynch
• What does impossible mean?
– There exist an execution which violates one of CAP
– not possible to guarantee that an algorithm has
all three at all times
• Shard data with different CAP tradeoffs
• Detect partitions and weaken consistency
4. Partition-tolerance & availability
• What is partition-tolerance?
– Consistency and Availability are provided by algo
– Partitions are external events (scheduler/oracle)
• Partition-tolerance is really a failure model
• Partition-tolerance equivalent with omissions
• In the CAP theorem
– Proof rests on partitions that never heal
– Datacenters can guarantee recovery of partitions!
• Can guarantee that conflict resolution eventually happens
5. How do we ensure consistency
• Main technique to be consistent
– Quorum principle
– Example: Majority quorums
• Always write to and read from a majority of nodes
• At least one node knows most recent value
majority(9)=5
WRITE(v)
READ v
6. Quorum Principle
• Majority Quorum
– Pro: tolerate up to N/2 -1 crashes
– Con: Have to read/write N/2 +1 values
• Read/write quorums (Dynamo, ZooKeeper, Chain Repl)
– Read R nodes, Rrite W nodes, s.t. R + W > N (W > N/2)
– Pro: adjust performance of reads/writes
– Con: availability can suffer
• Maekwa Quorum
–
–
–
–
P1
Arrange nodes in a MxM grid
P4
Write to row+col, read cols (always overlap)
P7
Pro: Only need to read/write O( sqrt(N) ) nodes
Con: Tolerate at most O( sqrt(N) ) crashes (reconfiguration)
P2
P3
P5
P6
P8
P9
7
7. Probabilistic Quorums
• Quorum size α√N, (α > 1)
intersects with probability 1-exp(α2)
– Example:
– Maekwa:
N=16 nodes, quorum size 7,
intersects 95%, tolerates 9 failures
N=16 nodes, quorum size 7,
intersects 100%, tolerates 4 failures
– Pro: Small quorums, high fault-tolerance
– Con: Could fail to intersect, N usually large
8
8. Quorums and CAP
• With quorums we can get
– C & P: partition can make quorum unavailable
– C & A: no-partition ensures availability and atomicity
• Faced decision when fail to get quorum *brewer’11+
– Sacrifice availability by waiting for merger
– Sacrifice atomicity by ignoring the quorum
• Can we get CAP for weaker consistency?
9. What does atomicity really mean?
R
P1
R
P2
P3
W(5)
W(6)
invocation response
• Linearization Points
– Read ops appear as if immediately happened at all nodes at
• time between invocation and response
– Write ops appear as if immediately happened at all nodes at
• time between invocation and response
10. Definition of Atomicity
• Linearization Points
– Read ops appear as if immediately happened at all nodes at
• time between invocation and response
– Write ops appear as if immediately happened at all nodes at
• time between invocation and response
R:6
P1
R:5
P2
P3
W(5)
W(6)
atomic
12. Atomicity too strong?
R:5
P1
R:6
P2
P3
W(5)
not atomic
W(6)
• Linearization points too strong?
– Why not just have R:5 appear atomically right after W(5)?
– Lamport: ”If P2’s operator phones P1 and tells her I just read 6”
13. Atomicity too strong?
R:5
P1
R:6
P2
P3
W(5)
W(6)
not atomic
sequentially
consistent
• Sequential consistency
–
–
–
–
Weaker than atomicity
Sequential consistency removes this ”real-time” requirement
Any global ordering OK as long as they respect local ordering
Does Gilbert’s proof fall apart for sequential consistency?
• Causal memory
–
–
–
–
Weaker than sequential
No need to have global view, each process different view
Local, read/writes immediately return to caller
CAP theorem does not apply to causal memory
P1
P2
causally
consistent
W(0) R:1
W(1) R:0
14. Going really weak
• Eventual consistency
– When network non-partitioned, all nodes eventually have the same
value
– I.e. don’t be ”consistent” at all times, but only after partitions heal!
• Based on powerful technique: gossipping
–
–
–
–
Periodically exchange ”logs” with one random node
Exchange must be constant-sized packets
Set reconciliation, merkle trees, etc
Use (clock, node_id) to break ties of events in log
• Properties of gossipping
– All nodes will have the same value in O(log N) time
– No positive-feedback cycles that congest the network
15. BASE
• Catch all for any consistency model C’ that
enables C’-A-P
– Eventual consistency
– PRAM consistency
– Causal consistency
• Main ingredients
– Stale data
– Soft-state (regenerateable state)
– Approximate answers
16. Summary
• No need to ensure CAP at all times
– Switch between algorithms or satisfy subset at different times
• Weaken consistency model
– Choose weaker consistency:
• Causal memory (relatively strong) work around CAP
– Only be consistent when network isn’t partitioned:
• Eventual consistency (very weak) works around CAP
• Weaken partition-tolerance
– Some environments never partition, e.g. datacenters
– Tolerate unavailability in small quorums
– Some env. have recovery guarantees (partitions heal within X
hours), perform conflict resolution
17. Related Work (ignored in talk)
• PRAM consistency (Pipelined RAM)
– Weaker than causal and non-blocking
• Eventual Linearizability (PODC’10)
– Becomes atomic after quiescent periods
• Gossipping & set reconciliation
– Lots of related work
Notes de l'éditeur
Failed ops appear ascompleted at every node, XORnever occurred at any node