SlideShare a Scribd company logo
1 of 22
1
10 Little
Servers
A Story of No Downtime
© DataStax, All Rights Reserved.
“Anything that can
go wrong will go
wrong”
Murphy is here,
watching you.
2
Disaster-Tolerant
Design Principles
3 © DataStax, All Rights Reserved.
Analyzing Cassandra Architecture
4 © DataStax, All Rights Reserved.
Step I: [Data] Replication
● Single copy is doomed
● It’s a question of time
● Replicate it!
● Inconsistency (Say goodbye to ACID)
● Consistency level control
5 © DataStax, All Rights Reserved.
Step II: Replica Distribution
● What stayed together is doomed
● It’s a question of time
● Distribute it
● Network delay
● Work with local_dc
6 © DataStax, All Rights Reserved.
Step III: Infrastructure Diversification
● Single platform is doomed
● Guess what? It’s a question of time.
● Diversify it
● Configuration discrepancies
● Platform-agnostic solution
7 © DataStax, All Rights Reserved.
Step IV: Durable Design
● Every unique node is a
bottleneck…
● And Single Point of Failure
● No SPoF, everything is
disposable
● Decentralization over
Federalisation
● “Cattle over Pets”
● Collaboration is harder
● Paxos Consensus Protocol
8 © DataStax, All Rights Reserved.
Step V: Horizontal Scaling
● Up-Scaling is Ooops-Scaling
● Expensive and not efficient
● Commodity Hardware
● Scale Out!
● Fleet Management
● Configuration Management
● Infrastructure Automation (IaaC)
9 © DataStax, All Rights Reserved.
Step V: Horizontal Scaling
● Up-Scaling is Ooops-Scaling
● Expensive and not efficient
● Commodity Hardware
● Scale Out!
● Fleet Management
● Configuration Management
● Infrastructure Automation (IaaC)
10 © DataStax, All Rights Reserved.
Step VI: Self-Aware Cluster Topology
● Situation changes quickly
● No manual management possible
● Schema-aware cluster
● Gossiping
● Early failure detection
● Coordination
● Query optimisation
● Schema-aware client
● Client-side routing
11 © DataStax, All Rights Reserved.
Step VII: Failure Detection & Recovery
● Errors happen all the time
● Proper error handling is often missing
● Recovery is usually post-factum
● Every part is ready
● Node processing request is a coordinator
● Parallel Async Dispatching
● Fail on write? Proactive Hinted handoff.
● Fail on read? Wait for next response &
decrease weight of a suspicious node.
12 © DataStax, All Rights Reserved.
Step VIII: Operational Simplicity
“Lack of laziness is the developer’s worst curse”
● Manual operations are error-prone, not transparent and time-wasting.
● All repeatable operations should be automated and traceable
● Partitioning automation
● Emergency rebalance automation
● Bootstrap automation
● Decommission automation
13 © DataStax, All Rights Reserved.
Step IX: Background Self-Healing
● Failures sneak in anyway
● Because of Murphy, blame him!
● Repair-on-Read
● On-demand repair
● NodeSync (DSE)
● Scheduled repairs (v4)
● Automated Background Process
(unless you have 5000 perfect
ops ppl)
(no, you don’t)
14 © DataStax, All Rights Reserved.
Step X: Continuous Improvement
● Debugging of a distributed system is DEADLY HARD
● No, seriously. I mean that.
● Think ahead, make logs great again ©
● Transient unique transaction ID
● Continuous monitoring
● Post-Mortem & Root Cause Analysis
● Goal is MTTR=0
Real Life?
15 © DataStax, All Rights Reserved.
Let me show you the numbers
16 © DataStax, All Rights Reserved.
Netflix
17 © DataStax, All Rights Reserved.
Apple
Conclusion
18 © DataStax, All Rights Reserved.
TL;DR
© DataStax, All Rights Reserved.
• Replicate Data
• Distribute Replicas
• Diversify Infrastructure
• Have no Single Point of Failure
• Scale Out
• Develop to be Self-Sufficient
• Design to Recover Quickly
• Simplify Management
• Automate Recovery
• Monitoring & Post-Mortem
Know your Principles
All Together
19
© DataStax, All Rights Reserved.
Expect Failure
Praise Failure
Design to Fail
Know the Principle
In Two Words
20
21 © DataStax, All Rights Reserved.
Thank you! Questions?
22 © DataStax, All Rights Reserved.
Aleks Volochnev
Developer Advocate at DataStax
@HadesArchitect
After many years in software development as a developer,
technical lead, devops engineer and architect, Aleks focused
himself on distributed applications and cloud architecture. Working
as a developer advocate at DataStax, he shares his knowledge
and expertise in the field of microservices, disaster tolerant
systems and hybrid platforms.
Ask me about Cassandra Day in your city!

More Related Content

Similar to Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax

UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
FromDual GmbH
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
UniFabric
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
Top 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres DeploymentTop 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres Deployment
EDB
 

Similar to Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax (20)

Web scale with-nutanix_rev
Web scale with-nutanix_revWeb scale with-nutanix_rev
Web scale with-nutanix_rev
 
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudLife After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data Cloud
 
Scaling Magento
Scaling MagentoScaling Magento
Scaling Magento
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
Ez performance measurement
Ez performance measurementEz performance measurement
Ez performance measurement
 
Cloud Native Practice
Cloud Native PracticeCloud Native Practice
Cloud Native Practice
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Container Attached Storage (CAS) with OpenEBS - SDC 2018
Container Attached Storage (CAS) with OpenEBS -  SDC 2018Container Attached Storage (CAS) with OpenEBS -  SDC 2018
Container Attached Storage (CAS) with OpenEBS - SDC 2018
 
Surge2012
Surge2012Surge2012
Surge2012
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Top 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres DeploymentTop 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres Deployment
 
How AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changesHow AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changes
 

More from Dataconomy Media

More from Dataconomy Media (20)

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
 
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
 
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
 
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
 
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
 
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
 
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
 
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
 
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
 
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
 
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
 
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
 
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
 
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 
Big Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas Tomperi
Big Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas TomperiBig Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas Tomperi
Big Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas Tomperi
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - Aleksandr Volochnev, Developer Advocate at DataStax

  • 2. © DataStax, All Rights Reserved. “Anything that can go wrong will go wrong” Murphy is here, watching you. 2
  • 3. Disaster-Tolerant Design Principles 3 © DataStax, All Rights Reserved. Analyzing Cassandra Architecture
  • 4. 4 © DataStax, All Rights Reserved. Step I: [Data] Replication ● Single copy is doomed ● It’s a question of time ● Replicate it! ● Inconsistency (Say goodbye to ACID) ● Consistency level control
  • 5. 5 © DataStax, All Rights Reserved. Step II: Replica Distribution ● What stayed together is doomed ● It’s a question of time ● Distribute it ● Network delay ● Work with local_dc
  • 6. 6 © DataStax, All Rights Reserved. Step III: Infrastructure Diversification ● Single platform is doomed ● Guess what? It’s a question of time. ● Diversify it ● Configuration discrepancies ● Platform-agnostic solution
  • 7. 7 © DataStax, All Rights Reserved. Step IV: Durable Design ● Every unique node is a bottleneck… ● And Single Point of Failure ● No SPoF, everything is disposable ● Decentralization over Federalisation ● “Cattle over Pets” ● Collaboration is harder ● Paxos Consensus Protocol
  • 8. 8 © DataStax, All Rights Reserved. Step V: Horizontal Scaling ● Up-Scaling is Ooops-Scaling ● Expensive and not efficient ● Commodity Hardware ● Scale Out! ● Fleet Management ● Configuration Management ● Infrastructure Automation (IaaC)
  • 9. 9 © DataStax, All Rights Reserved. Step V: Horizontal Scaling ● Up-Scaling is Ooops-Scaling ● Expensive and not efficient ● Commodity Hardware ● Scale Out! ● Fleet Management ● Configuration Management ● Infrastructure Automation (IaaC)
  • 10. 10 © DataStax, All Rights Reserved. Step VI: Self-Aware Cluster Topology ● Situation changes quickly ● No manual management possible ● Schema-aware cluster ● Gossiping ● Early failure detection ● Coordination ● Query optimisation ● Schema-aware client ● Client-side routing
  • 11. 11 © DataStax, All Rights Reserved. Step VII: Failure Detection & Recovery ● Errors happen all the time ● Proper error handling is often missing ● Recovery is usually post-factum ● Every part is ready ● Node processing request is a coordinator ● Parallel Async Dispatching ● Fail on write? Proactive Hinted handoff. ● Fail on read? Wait for next response & decrease weight of a suspicious node.
  • 12. 12 © DataStax, All Rights Reserved. Step VIII: Operational Simplicity “Lack of laziness is the developer’s worst curse” ● Manual operations are error-prone, not transparent and time-wasting. ● All repeatable operations should be automated and traceable ● Partitioning automation ● Emergency rebalance automation ● Bootstrap automation ● Decommission automation
  • 13. 13 © DataStax, All Rights Reserved. Step IX: Background Self-Healing ● Failures sneak in anyway ● Because of Murphy, blame him! ● Repair-on-Read ● On-demand repair ● NodeSync (DSE) ● Scheduled repairs (v4) ● Automated Background Process (unless you have 5000 perfect ops ppl) (no, you don’t)
  • 14. 14 © DataStax, All Rights Reserved. Step X: Continuous Improvement ● Debugging of a distributed system is DEADLY HARD ● No, seriously. I mean that. ● Think ahead, make logs great again © ● Transient unique transaction ID ● Continuous monitoring ● Post-Mortem & Root Cause Analysis ● Goal is MTTR=0
  • 15. Real Life? 15 © DataStax, All Rights Reserved. Let me show you the numbers
  • 16. 16 © DataStax, All Rights Reserved. Netflix
  • 17. 17 © DataStax, All Rights Reserved. Apple
  • 18. Conclusion 18 © DataStax, All Rights Reserved. TL;DR
  • 19. © DataStax, All Rights Reserved. • Replicate Data • Distribute Replicas • Diversify Infrastructure • Have no Single Point of Failure • Scale Out • Develop to be Self-Sufficient • Design to Recover Quickly • Simplify Management • Automate Recovery • Monitoring & Post-Mortem Know your Principles All Together 19
  • 20. © DataStax, All Rights Reserved. Expect Failure Praise Failure Design to Fail Know the Principle In Two Words 20
  • 21. 21 © DataStax, All Rights Reserved.
  • 22. Thank you! Questions? 22 © DataStax, All Rights Reserved. Aleks Volochnev Developer Advocate at DataStax @HadesArchitect After many years in software development as a developer, technical lead, devops engineer and architect, Aleks focused himself on distributed applications and cloud architecture. Working as a developer advocate at DataStax, he shares his knowledge and expertise in the field of microservices, disaster tolerant systems and hybrid platforms. Ask me about Cassandra Day in your city!