SlideShare a Scribd company logo
1 of 54
Download to read offline
Self-healing data
An introduction to consistency models, Quorums and CRDTs

Uwe Friedrichsen, codecentric AG, 2014-2015
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Why NoSQL?



•  Scalability
•  Easier schema evolution
•  Availability on unreliable OTS hardware
•  It‘s more fun ...
Why NoSQL?



•  Scalability
•  Easier schema evolution
•  Availability on unreliable OTS hardware
•  It‘s more fun ...
Challenges



•  Giving up ACID transactions
•  (Temporal) anomalies and inconsistencies

short-term due to replication or long-term due to partitioning


It might happen. Thus, it will happen!
C
Consistency
A
Availability
P
Partition Tolerance
Strict Consistency

ACID / 2PC
Strong Consistency

Quorum R&W / Paxos
Eventual Consistency

CRDT / Gossip / Hinted Handoff
Strict Consistency (CA)

•  Great programming model

no anomalies or inconsistencies need to be considered
•  Does not scale well

best for single node databases

„We know ACID – It works!“
„We know 2PC – It sucks!“

Use for moderate data amounts
And what if I need more data?






•  Distributed datastore
•  Partition tolerance is a must
•  Need to give up strict consistency (CP or AP)
Strong Consistency (CP)

•  Majority based consistency model

can tolerate up to N nodes failing out of 2N+1 nodes
•  Good programming model

Single-copy consistency
•  Trades consistency for availability

in case of partitioning

Paxos (for sequential consistency)
Quorum-based reads & writes
Quorum-based reads & writes


•  N – number of nodes (replicas)
•  R – number of reads delivering the same result required
•  W – number of confirmed writes required

W > N / 2
R + W > N
Example 1

•  3 replica nodes

This is: N = 3
•  W > N / 2 à W > 3 / 2 à W > 1,5

Let‘s pick: W = 2 (can tolerate failure of 1 node)
•  R + W > N à R + 2 > 3 à R > 1

Let‘s pick: R = 2 (can tolerate failure of 1 node)
Client
Node 1
 Node 3
Node 2
W
 W
W
Client
Node 1
 Node 3
Node 2
Client
Node 1
 Node 3
Node 2
✔
 ✖
✔
W = 2
 ✔
Client
Node 1
 Node 3
Node 2
R
 R
R
Client
Node 1
 Node 3
Node 2
✔
 ✔
 ✔
Client
Node 1
 Node 3
Node 2
2x / 1x
 ✔
R = 2
 ✔
Client
Node 1
 Node 3
Node 2
R
 R
R
Client
Node 1
 Node 3
Node 2
✔
 ✔
✖
Client
Node 1
 Node 3
Node 2
1x / 1x
 ✖
R = 1
 ✖
Example 2

•  3 replica nodes

This is: N = 3
•  W > N / 2 à W > 3 / 2 à W > 1,5

Let‘s pick: W = 3 (strict consistency – does not tolerate any failure)
•  R + W > N à R + 3 > 3 à R > 0

Let‘s pick: R = 1 (single read from any node sufficient)
Example 3

•  5 replica nodes

This is: N = 5
•  W > N / 2 à W > 5 / 2 à W > 2,5

Let‘s pick: W = 3 (can tolerate failure of 2 nodes)
•  R + W > N à R + 3 > 5 à R > 2

Let‘s pick: R = 3 (can tolerate failure of 2 nodes)
Limitations of QB R&W

•  Requires fixed set of replica nodes

Dynamic adding and removal of nodes not possible
•  Client-centric consistency

Consistency only perceived on client, not on nodes
•  Requires additional measures on nodes

Nodes need to implement at least eventual consistency

Use if client-perceived strong consistency is

sufficient and simplicity trumps formal precision
And what if I need more availability?






•  Need to give up strong consistency (CP)
•  Relax required consistency properties even more
•  Leads to eventual consistency (AP)
Eventual Consistency (AP)

•  Gives up some consistency guarantees

no sequential consistency, anomalies become visible
•  Maximum availability possible

can tolerate up to N-1 nodes failing out of N nodes
•  Challenging programming model

anomalies usually need to be resolved explicitly

Gossip / Hinted Handoffs
CRDT
Conflict-free Replicated Data Types


•  Eventually consistent, self-stabilizing data structures
•  Designed for maximum availability
•  Tolerates up to N-1 out of N nodes failing

State-based CRDT: Convergent Replicated Data Type (CvRDT)
Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
A bit of theory first ...
Convergent Replicated Data Type

State-based CRDT – CvRDT

•  All replicas (usually) connected
•  Exchange state between replicas, calculate new state on target replica
•  State transfer at least once over eventually-reliable channels
•  Set of possible states form a Semilattice
•  Partially ordered set of elements where all subsets have a Least Upper Bound (LUB)
•  All state changes advance upwards with respect to the partial order
Commutative Replicated Data Type

Operation-based CRDT - CmRDT

•  All replicas (usually) connected
•  Exchange update operations between replicas, apply on target replica
•  Reliable broadcast with ordering guarantee for non-concurrent
updates
•  Concurrent updates must be commutative
That‘s been enough theory ...
Counter
Op-based Counter

Data
Integer i
Init
i ≔ 0
Query
return i
Operations
increment(): i ≔ i + 1
decrement(): i ≔ i - 1
State-based G-Counter (grow only)
(Naïve approach)

Data
Integer i
Init
i ≔ 0
Query
return i
Update
increment(): i ≔ i + 1
Merge( j)
i ≔ max(i, j)
State-based G-Counter (grow only)
(Naïve approach)
R1
R3
R2
i = 1
U
i = 1
U
i = 0
i = 0
i = 0
I
I
I
M
i = 1
 i = 1
M
State-based G-Counter (grow only)
(Vector-based approach)

Data
Integer V[] / one element per replica set
Init
V ≔ [0, 0, ... , 0]
Query
return ∑i V[i]
Update
increment(): V[i] ≔ V[i] + 1 / i is replica set number
Merge(V‘)
∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
State-based G-Counter (grow only)
(Vector-based approach)
R1
R3
R2
V = [1, 0, 0]
U
V = [0, 0, 0]
I
I
I
V = [0, 0, 0]
V = [0, 0, 0]
U
V = [0, 0, 1]
M
V = [1, 0, 0]
M
V = [1, 0, 1]
State-based PN-Counter (pos./neg.)


•  Simple vector approach as with G-Counter does not work
•  Violates monotonicity requirement of semilattice
•  Need to use two vectors
•  Vector P to track incements
•  Vector N to track decrements
•  Query result is ∑i P[i] – N[i]
State-based PN-Counter (pos./neg.)

Data
Integer P[], N[] / one element per replica set
Init
P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0]
Query
Return ∑i P[i] – N[i]
Update
increment(): P[i] ≔ P[i] + 1 / i is replica set number
decrement(): N[i] ≔ N[i] + 1 / i is replica set number
Merge(P‘, N‘)
∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i])
∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
Non-negative Counter

Problem: How to check a global invariant with local information only?

•  Approach 1: Only dec when local state is > 0
•  Concurrent decs could still lead to negative value

•  Approach 2: Externalize negative values as 0
•  inc(negative value) == noop(), violates counter semantics

•  Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0
•  Works, but may be too strong limitation

•  Approach 4: Synchronize
•  Works, but violates assumptions and prerequisites of CRDTs
Sets
Op-based Set
(Naïve approach)

Data
Set S
Init
S ≔ {}
Query(e)
return e ∈ S
Operations
add(e) : S ≔ S ∪ {e}
remove(e): S ≔ S  {e}
Op-based Set
(Naïve approach)
R1
R3
R2
S = {e}
add(e)
S = {}
S = {}
S = {}
I
I
I
S = {e}
S = {}
rmv(e)
add(e)
add(e)
 add(e)
 rmv(e)
add(e)
S = {e}
S = {e}
 S = {e}
 S = {}
State-based G-Set (grow only)

Data
Set S
Init
S ≔ {}
Query(e)
return e ∈ S
Update
add(e): S ≔ S ∪ {e}
Merge(S‘)
S = S ∪ S‘
State-based 2P-Set (two-phase)

Data
Set A, R / A: added, R: removed
Init
A ≔ {}, R ≔ {}
Query(e)
return e ∈ A ∧ e ∉ R
Update
add(e): A ≔ A ∪ {e}
remove(e): (pre query(e)) R ≔ R ∪ {e}
Merge(A‘, R‘)
A ≔ A ∪ A‘, R ≔ R ∪ R‘
Op-based OR-Set (observed-remove)

Data
Set S / Set of pairs { (element e, unique tag u), ... }
Init
S ≔ {}
Query(e)
return ∃u : (e, u) ∈ S
Operations
add(e): S ≔ S ∪ { (e, u) } / u is generated unique tag
remove(e):
pre query(e)
R ≔ { (e, u) | ∃u : (e, u) ∈ S } /at source („prepare“)
S ≔ S  R /downstream („execute“)
Op-based OR-Set (observed-remove)
R1
R3
R2
S = {ea}
add(e)
S = {}
S = {}
S = {}
I
I
I
S = {eb}
S = {}
rmv(e)
add(e)
add(eb)
 add(ea)
 rmv(ea)
add(eb)
S = {eb}
S = {eb}
 S = {ea, eb}
 S = {eb}
More datatypes

•  Register
•  Dictionary (Map)
•  Tree
•  Graph
•  Array
•  List

plus more representations for each datatype
Garbage collection

•  Sets could grow infinitely in worst case
•  Garbage collection possible, but a bit tricky
•  Only remove updates that can be deleted safely
and were received by all replicas
•  Usually implemented using vector clocks
•  Can tolerate up to n-1 crashes
•  Live only while no replicas are crashed
•  Can induce surprising behavior sometimes
•  Sometimes stronger consensus is needed (Paxos, ...)
Limitations of CRDTs

•  Very weak consistency guarantees

Strives for „quiescent consistency“
•  Eventually consistent

Not suitable for high-volume ID generator or alike
•  Not easy to understand and model
•  Not all data structures representable

Use if availability is extremely important
Further reading

1.  Shapiro et al., Conflict-free Replicated
Data Types, Inria Research report, 2011
2.  Shapiro et al., A comprehensive study
of Convergent and Commutative
Replicated Data Types,

Inria Research report, 2011
3.  Basho Tag Archives: CRDT,

https://basho.com/tag/crdt/
4.  Leslie Lamport,

Time, clocks, and the ordering of
events in a distributed system,

Communications of the ACM (1978)
Wrap-up

•  CAP requires rethinking consistency
•  Strict Consistency

ACID / 2PC
•  Strong Consistency

Quorum-based R&W, Paxos
•  Eventual Consistency

CRDT, Gossip, Hinted Handoffs

Pick your consistency model based on your

consistency and availability requirements
The real world is not ACID
Thus, it is perfectly fine to go for a relaxed consistency model
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Self healing data

More Related Content

What's hot

What's hot (20)

Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Introduction to Docker - 2017
Introduction to Docker - 2017Introduction to Docker - 2017
Introduction to Docker - 2017
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for Microservices
 
Mastering DevOps With Oracle
Mastering DevOps With OracleMastering DevOps With Oracle
Mastering DevOps With Oracle
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patterns
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...
 
App Dynamics
App DynamicsApp Dynamics
App Dynamics
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Arquitetura reativa, a solução para os microserviços?
Arquitetura reativa,  a solução para os microserviços?Arquitetura reativa,  a solução para os microserviços?
Arquitetura reativa, a solução para os microserviços?
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Migrating Java JBoss EAP Applications to Kubernetes With S2I
Migrating Java JBoss EAP Applications to Kubernetes With S2IMigrating Java JBoss EAP Applications to Kubernetes With S2I
Migrating Java JBoss EAP Applications to Kubernetes With S2I
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 

Viewers also liked

Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
Uwe Friedrichsen
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservices
Uwe Friedrichsen
 
Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service Design
Uwe Friedrichsen
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader context
Uwe Friedrichsen
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of IT
Uwe Friedrichsen
 

Viewers also liked (20)

Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Devops for Developers
Devops for DevelopersDevops for Developers
Devops for Developers
 
How to survive in a BASE world
How to survive in a BASE worldHow to survive in a BASE world
How to survive in a BASE world
 
Dr. Hectic and Mr. Hype - surviving the economic darwinism
Dr. Hectic and Mr. Hype - surviving the economic darwinismDr. Hectic and Mr. Hype - surviving the economic darwinism
Dr. Hectic and Mr. Hype - surviving the economic darwinism
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
Patterns of resilience
Patterns of resiliencePatterns of resilience
Patterns of resilience
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservices
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack risk
 
Conway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITConway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective IT
 
Why resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesWhy resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudes
 
Fantastic Elastic
Fantastic ElasticFantastic Elastic
Fantastic Elastic
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.no
 
Towards complex adaptive architectures
Towards complex adaptive architecturesTowards complex adaptive architectures
Towards complex adaptive architectures
 
Production-ready Software
Production-ready SoftwareProduction-ready Software
Production-ready Software
 
Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service Design
 
Watch your communication
Watch your communicationWatch your communication
Watch your communication
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader context
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of IT
 
The Next Generation (of) IT
The Next Generation (of) ITThe Next Generation (of) IT
The Next Generation (of) IT
 

Similar to Self healing data

Predective analytcis v0.1 AS
Predective analytcis v0.1 ASPredective analytcis v0.1 AS
Predective analytcis v0.1 AS
Ankur Sansanwal
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
Victor Pillac
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
James Wong
 

Similar to Self healing data (20)

Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Dynamo: Not Just For Datastores
Dynamo: Not Just For DatastoresDynamo: Not Just For Datastores
Dynamo: Not Just For Datastores
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Functional Operations - Susan Potter
Functional Operations - Susan PotterFunctional Operations - Susan Potter
Functional Operations - Susan Potter
 
Parallel Computing with R
Parallel Computing with RParallel Computing with R
Parallel Computing with R
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Predective analytcis v0.1 AS
Predective analytcis v0.1 ASPredective analytcis v0.1 AS
Predective analytcis v0.1 AS
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
1015 track2 abbott
1015 track2 abbott1015 track2 abbott
1015 track2 abbott
 
1030 track2 abbott
1030 track2 abbott1030 track2 abbott
1030 track2 abbott
 
MLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningMLSD18. Unsupervised Learning
MLSD18. Unsupervised Learning
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Parallelising Dynamic Programming
Parallelising Dynamic ProgrammingParallelising Dynamic Programming
Parallelising Dynamic Programming
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 

More from Uwe Friedrichsen

Timeless design in a cloud-native world
Timeless design in a cloud-native worldTimeless design in a cloud-native world
Timeless design in a cloud-native world
Uwe Friedrichsen
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Uwe Friedrichsen
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
Uwe Friedrichsen
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
Uwe Friedrichsen
 
Excavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsExcavating the knowledge of our ancestors
Excavating the knowledge of our ancestors
Uwe Friedrichsen
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"
Uwe Friedrichsen
 

More from Uwe Friedrichsen (10)

Timeless design in a cloud-native world
Timeless design in a cloud-native worldTimeless design in a cloud-native world
Timeless design in a cloud-native world
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Life after microservices
Life after microservicesLife after microservices
Life after microservices
 
The hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerThe hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developer
 
Digitization solutions - A new breed of software
Digitization solutions - A new breed of softwareDigitization solutions - A new breed of software
Digitization solutions - A new breed of software
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
 
Excavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsExcavating the knowledge of our ancestors
Excavating the knowledge of our ancestors
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"
 
Life, IT and everything
Life, IT and everythingLife, IT and everything
Life, IT and everything
 

Recently uploaded

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Self healing data

  • 1. Self-healing data An introduction to consistency models, Quorums and CRDTs Uwe Friedrichsen, codecentric AG, 2014-2015
  • 2. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
  • 3. Why NoSQL? •  Scalability •  Easier schema evolution •  Availability on unreliable OTS hardware •  It‘s more fun ...
  • 4. Why NoSQL? •  Scalability •  Easier schema evolution •  Availability on unreliable OTS hardware •  It‘s more fun ...
  • 5. Challenges •  Giving up ACID transactions •  (Temporal) anomalies and inconsistencies
 short-term due to replication or long-term due to partitioning It might happen. Thus, it will happen!
  • 6. C Consistency A Availability P Partition Tolerance Strict Consistency ACID / 2PC Strong Consistency Quorum R&W / Paxos Eventual Consistency CRDT / Gossip / Hinted Handoff
  • 7. Strict Consistency (CA) •  Great programming model
 no anomalies or inconsistencies need to be considered •  Does not scale well
 best for single node databases „We know ACID – It works!“ „We know 2PC – It sucks!“ Use for moderate data amounts
  • 8. And what if I need more data? •  Distributed datastore •  Partition tolerance is a must •  Need to give up strict consistency (CP or AP)
  • 9. Strong Consistency (CP) •  Majority based consistency model
 can tolerate up to N nodes failing out of 2N+1 nodes •  Good programming model
 Single-copy consistency •  Trades consistency for availability
 in case of partitioning Paxos (for sequential consistency) Quorum-based reads & writes
  • 10. Quorum-based reads & writes •  N – number of nodes (replicas) •  R – number of reads delivering the same result required •  W – number of confirmed writes required W > N / 2 R + W > N
  • 11. Example 1 •  3 replica nodes
 This is: N = 3 •  W > N / 2 à W > 3 / 2 à W > 1,5
 Let‘s pick: W = 2 (can tolerate failure of 1 node) •  R + W > N à R + 2 > 3 à R > 1
 Let‘s pick: R = 2 (can tolerate failure of 1 node)
  • 12. Client Node 1 Node 3 Node 2
  • 13. W W W Client Node 1 Node 3 Node 2
  • 14. Client Node 1 Node 3 Node 2 ✔ ✖ ✔ W = 2 ✔
  • 15. Client Node 1 Node 3 Node 2
  • 16. R R R Client Node 1 Node 3 Node 2 ✔ ✔ ✔
  • 17. Client Node 1 Node 3 Node 2 2x / 1x ✔ R = 2 ✔
  • 18. Client Node 1 Node 3 Node 2
  • 19. R R R Client Node 1 Node 3 Node 2 ✔ ✔ ✖
  • 20. Client Node 1 Node 3 Node 2 1x / 1x ✖ R = 1 ✖
  • 21. Example 2 •  3 replica nodes
 This is: N = 3 •  W > N / 2 à W > 3 / 2 à W > 1,5
 Let‘s pick: W = 3 (strict consistency – does not tolerate any failure) •  R + W > N à R + 3 > 3 à R > 0
 Let‘s pick: R = 1 (single read from any node sufficient)
  • 22. Example 3 •  5 replica nodes
 This is: N = 5 •  W > N / 2 à W > 5 / 2 à W > 2,5
 Let‘s pick: W = 3 (can tolerate failure of 2 nodes) •  R + W > N à R + 3 > 5 à R > 2
 Let‘s pick: R = 3 (can tolerate failure of 2 nodes)
  • 23. Limitations of QB R&W •  Requires fixed set of replica nodes
 Dynamic adding and removal of nodes not possible •  Client-centric consistency
 Consistency only perceived on client, not on nodes •  Requires additional measures on nodes
 Nodes need to implement at least eventual consistency Use if client-perceived strong consistency is
 sufficient and simplicity trumps formal precision
  • 24. And what if I need more availability? •  Need to give up strong consistency (CP) •  Relax required consistency properties even more •  Leads to eventual consistency (AP)
  • 25. Eventual Consistency (AP) •  Gives up some consistency guarantees
 no sequential consistency, anomalies become visible •  Maximum availability possible
 can tolerate up to N-1 nodes failing out of N nodes •  Challenging programming model
 anomalies usually need to be resolved explicitly Gossip / Hinted Handoffs CRDT
  • 26. Conflict-free Replicated Data Types •  Eventually consistent, self-stabilizing data structures •  Designed for maximum availability •  Tolerates up to N-1 out of N nodes failing State-based CRDT: Convergent Replicated Data Type (CvRDT) Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
  • 27. A bit of theory first ...
  • 28. Convergent Replicated Data Type State-based CRDT – CvRDT •  All replicas (usually) connected •  Exchange state between replicas, calculate new state on target replica •  State transfer at least once over eventually-reliable channels •  Set of possible states form a Semilattice •  Partially ordered set of elements where all subsets have a Least Upper Bound (LUB) •  All state changes advance upwards with respect to the partial order
  • 29. Commutative Replicated Data Type Operation-based CRDT - CmRDT •  All replicas (usually) connected •  Exchange update operations between replicas, apply on target replica •  Reliable broadcast with ordering guarantee for non-concurrent updates •  Concurrent updates must be commutative
  • 30. That‘s been enough theory ...
  • 32. Op-based Counter Data Integer i Init i ≔ 0 Query return i Operations increment(): i ≔ i + 1 decrement(): i ≔ i - 1
  • 33. State-based G-Counter (grow only) (Naïve approach) Data Integer i Init i ≔ 0 Query return i Update increment(): i ≔ i + 1 Merge( j) i ≔ max(i, j)
  • 34. State-based G-Counter (grow only) (Naïve approach) R1 R3 R2 i = 1 U i = 1 U i = 0 i = 0 i = 0 I I I M i = 1 i = 1 M
  • 35. State-based G-Counter (grow only) (Vector-based approach) Data Integer V[] / one element per replica set Init V ≔ [0, 0, ... , 0] Query return ∑i V[i] Update increment(): V[i] ≔ V[i] + 1 / i is replica set number Merge(V‘) ∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
  • 36. State-based G-Counter (grow only) (Vector-based approach) R1 R3 R2 V = [1, 0, 0] U V = [0, 0, 0] I I I V = [0, 0, 0] V = [0, 0, 0] U V = [0, 0, 1] M V = [1, 0, 0] M V = [1, 0, 1]
  • 37. State-based PN-Counter (pos./neg.) •  Simple vector approach as with G-Counter does not work •  Violates monotonicity requirement of semilattice •  Need to use two vectors •  Vector P to track incements •  Vector N to track decrements •  Query result is ∑i P[i] – N[i]
  • 38. State-based PN-Counter (pos./neg.) Data Integer P[], N[] / one element per replica set Init P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0] Query Return ∑i P[i] – N[i] Update increment(): P[i] ≔ P[i] + 1 / i is replica set number decrement(): N[i] ≔ N[i] + 1 / i is replica set number Merge(P‘, N‘) ∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i]) ∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
  • 39. Non-negative Counter Problem: How to check a global invariant with local information only? •  Approach 1: Only dec when local state is > 0 •  Concurrent decs could still lead to negative value •  Approach 2: Externalize negative values as 0 •  inc(negative value) == noop(), violates counter semantics •  Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0 •  Works, but may be too strong limitation •  Approach 4: Synchronize •  Works, but violates assumptions and prerequisites of CRDTs
  • 40. Sets
  • 41. Op-based Set (Naïve approach) Data Set S Init S ≔ {} Query(e) return e ∈ S Operations add(e) : S ≔ S ∪ {e} remove(e): S ≔ S {e}
  • 42. Op-based Set (Naïve approach) R1 R3 R2 S = {e} add(e) S = {} S = {} S = {} I I I S = {e} S = {} rmv(e) add(e) add(e) add(e) rmv(e) add(e) S = {e} S = {e} S = {e} S = {}
  • 43. State-based G-Set (grow only) Data Set S Init S ≔ {} Query(e) return e ∈ S Update add(e): S ≔ S ∪ {e} Merge(S‘) S = S ∪ S‘
  • 44. State-based 2P-Set (two-phase) Data Set A, R / A: added, R: removed Init A ≔ {}, R ≔ {} Query(e) return e ∈ A ∧ e ∉ R Update add(e): A ≔ A ∪ {e} remove(e): (pre query(e)) R ≔ R ∪ {e} Merge(A‘, R‘) A ≔ A ∪ A‘, R ≔ R ∪ R‘
  • 45. Op-based OR-Set (observed-remove) Data Set S / Set of pairs { (element e, unique tag u), ... } Init S ≔ {} Query(e) return ∃u : (e, u) ∈ S Operations add(e): S ≔ S ∪ { (e, u) } / u is generated unique tag remove(e): pre query(e) R ≔ { (e, u) | ∃u : (e, u) ∈ S } /at source („prepare“) S ≔ S R /downstream („execute“)
  • 46. Op-based OR-Set (observed-remove) R1 R3 R2 S = {ea} add(e) S = {} S = {} S = {} I I I S = {eb} S = {} rmv(e) add(e) add(eb) add(ea) rmv(ea) add(eb) S = {eb} S = {eb} S = {ea, eb} S = {eb}
  • 47. More datatypes •  Register •  Dictionary (Map) •  Tree •  Graph •  Array •  List plus more representations for each datatype
  • 48. Garbage collection •  Sets could grow infinitely in worst case •  Garbage collection possible, but a bit tricky •  Only remove updates that can be deleted safely and were received by all replicas •  Usually implemented using vector clocks •  Can tolerate up to n-1 crashes •  Live only while no replicas are crashed •  Can induce surprising behavior sometimes •  Sometimes stronger consensus is needed (Paxos, ...)
  • 49. Limitations of CRDTs •  Very weak consistency guarantees
 Strives for „quiescent consistency“ •  Eventually consistent
 Not suitable for high-volume ID generator or alike •  Not easy to understand and model •  Not all data structures representable Use if availability is extremely important
  • 50. Further reading 1.  Shapiro et al., Conflict-free Replicated Data Types, Inria Research report, 2011 2.  Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types,
 Inria Research report, 2011 3.  Basho Tag Archives: CRDT,
 https://basho.com/tag/crdt/ 4.  Leslie Lamport,
 Time, clocks, and the ordering of events in a distributed system,
 Communications of the ACM (1978)
  • 51. Wrap-up •  CAP requires rethinking consistency •  Strict Consistency
 ACID / 2PC •  Strong Consistency
 Quorum-based R&W, Paxos •  Eventual Consistency
 CRDT, Gossip, Hinted Handoffs Pick your consistency model based on your
 consistency and availability requirements
  • 52. The real world is not ACID Thus, it is perfectly fine to go for a relaxed consistency model
  • 53. @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com