Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax

1
10 Little
Servers
A Story of No Downtime

© DataStax, All Rights Reserved.
“Anything that can
go wrong will go
wrong”
Murphy is here,
watching you.
2

Disaster-Tolerant
Design Principles
3 © DataStax, All Rights Reserved.
Analyzing Cassandra Architecture

Step I: [Data] Replication
● Single copy is doomed
● It’s a question of time
● Replicate it!
● Inconsistency (Say goodbye to ACID)
● Consistency level control

Step II: Replica Distribution
● What stayed together is doomed
● It’s a question of time
● Distribute it
● Network delay
● Work with local_dc

Step III: Infrastructure Diversification
● Single platform is doomed
● Guess what? It’s a question of time.
● Diversify it
● Configuration discrepancies
● Platform-agnostic solution

Step IV: Durable Design
● Every unique node is a
bottleneck…
● And Single Point of Failure
● No SPoF, everything is
disposable
● Decentralization over
Federalisation
● “Cattle over Pets”
● Collaboration is harder
● Paxos Consensus Protocol

Step V: Horizontal Scaling
● Up-Scaling is Ooops-Scaling
● Expensive and not efficient
● Commodity Hardware
● Scale Out!
● Fleet Management
● Configuration Management
● Infrastructure Automation (IaaC)

Step VI: Self-Aware Cluster Topology
● Situation changes quickly
● No manual management possible
● Schema-aware cluster
● Gossiping
● Early failure detection
● Coordination
● Query optimisation
● Schema-aware client
● Client-side routing

Step VII: Failure Detection & Recovery
● Errors happen all the time
● Proper error handling is often missing
● Recovery is usually post-factum
● Every part is ready
● Node processing request is a coordinator
● Parallel Async Dispatching
● Fail on write? Proactive Hinted handoff.
● Fail on read? Wait for next response &
decrease weight of a suspicious node.

Step VIII: Operational Simplicity
“Lack of laziness is the developer’s worst curse”
● Manual operations are error-prone, not transparent and time-wasting.
● All repeatable operations should be automated and traceable
● Partitioning automation
● Emergency rebalance automation
● Bootstrap automation
● Decommission automation

Step IX: Background Self-Healing
● Failures sneak in anyway
● Because of Murphy, blame him!
● Repair-on-Read
● On-demand repair
● NodeSync (DSE)
● Scheduled repairs (v4)
● Automated Background Process
(unless you have 5000 perfect
ops ppl)
(no, you don’t)

Step X: Continuous Improvement
● Debugging of a distributed system is DEADLY HARD
● No, seriously. I mean that.
● Think ahead, make logs great again ©
● Transient unique transaction ID
● Continuous monitoring
● Post-Mortem & Root Cause Analysis
● Goal is MTTR=0

Real Life?
Let me show you the numbers

Netflix

Apple

Conclusion
TL;DR

• Replicate Data
• Distribute Replicas
• Diversify Infrastructure
• Have no Single Point of Failure
• Scale Out
• Develop to be Self-Sufficient
• Design to Recover Quickly
• Simplify Management
• Automate Recovery
• Monitoring & Post-Mortem
Know your Principles
All Together
19

Expect Failure
Praise Failure
Design to Fail
Know the Principle
In Two Words
20

Thank you! Questions?
Aleks Volochnev
Developer Advocate at DataStax
@HadesArchitect
After many years in software development as a developer,
technical lead, devops engineer and architect, Aleks focused
himself on distributed applications and cloud architecture. Working
as a developer advocate at DataStax, he shares his knowledge
and expertise in the field of microservices, disaster tolerant
systems and hybrid platforms.
Ask me about Cassandra Day in your city!

Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax

Recommended

Recommended

More Related Content

Similar to Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax

Similar to Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax (20)

More from Dataconomy Media

More from Dataconomy Media (20)

Recently uploaded

Recently uploaded (20)

Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax