SlideShare une entreprise Scribd logo
1  sur  47
ABOUT NETFLIX
NETFLIX
ACTIVE - ACTIVE
WHAT IS ACTIVE-ACTIVE
Also called dual active, it is a phrase used to
describe a network of independent processing nodes
where each node has access to a replicated database
giving each node access and usage of single
application. In an active-active system all requests are
load balanced across all available processing capacity,
Where a failure occurs on a node, another node in the
network takes its place.
DOES AN INSTANCE FAIL?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test with Chaos Monkey
DOES A ZONE FAIL?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
• Test with Chaos Gorilla
DOES A REGION FAIL?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region-wide configuration issues
• Test with Chaos Kong
EVERYTHING FAILS… EVENTUALLY
• Keep your services running by embracing isolation and
redundancy
• Construct a highly agile and highly available service
from ephemeral and assumed broken components
ISOLATION
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioning between regions should not affect
functionality / operations
REDUNDANCY
• Make more than one (of pretty much everything)
• Specifically, distribute services across Availability
Zones and regions
HISTORY: X-MAS EVE 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was deleted by a maintenance process
that was inadvertently run against the
production ELB state data”
ACTIVE-ACTIVE ARCHITECTURE
THE PROCESS
IDENTIFYING CLUSTERS FOR AA
SNITCH CHANGES
EC2Snitch EC2MultiRegionSnitch
Uses Private IPs Uses Public IPs
PRIAM.MULTIREGION.ENABLE =TRUE
tcp 7101-7101 [ ] [10.190.21.36/32, 10.232.200.17/32, 10.33.573.26/32,
10.20.151.165/32, 10.226.99.46/32, 10.244.143.193/32]
tcp 7103-7103 [ ] [54.196.221.136/32, 54.202.200.217/32, 54.203.57.226/32,
54.205.151.165/32, 54.226.99.46/32, 54.244.143.193/32]
SPIN UP NODES IN NEW REGION
us-east-1 us-west-2
APP
UPDATE KEYSPACE
Update keyspace <keyspace> with placement_strategy =
'NetworkTopologyStrategy'
and strategy_options = {us-east : 3, us-west-2 : 3};
Existing region and replication factor New region and replication factor
REBUILD NEW REGION
Run – nodetool rebuild us-east-1 on all us-west-2 nodes
RUN NODETOOL REPAIR
VALIDATION
BENCHMARKING GLOBAL CASSANDRA
WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION
CAPACITY
16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL
192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING
CASSANDRA IN 20 MINUTES
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test
Load
Test
Load
Validation
Load
Interzone Traffic
1 Million Writes
CL.ONE (Wait for
One Replica to ack)
1 Million Reads
after 500 ms
CL.ONE with No
Data Loss
Interregional Traffic
Up to 9Gbits/s, 83ms 18 TB backups
from S3
TEST FOR THUNDERING HERD
TEST FOR RETRIES
FAILURE
RETRY
KEY METRICS USED
• 99 /95 th Read Latency (Client & C*)
• Dropped Metrics on C*
• Exceptions on C*
• Heap Usage on C*
• CPU Usage (Client & C*)
• Threads Pending on C*
CONFIGURATION FOR TEST
• 24 Node C* SSDs
• 220 Client instances
• 70+ Jmeter Instances
C* IOPS
TOTAL READ IOPS
TOTAL WRITE IOPS
95th LATENCY
99th LATENCY
CHECK FOR CEILING
NETWORK PARTITION
us-east-1 us-west-2
TAKEAWAYS
REPAIRS AFTER EXTENSION ARE PAINFUL !!
TIME TO REPAIR DEPENDS ON
• Number of regions
• Number of replicas
• Data size
• Amount of entropy
ADJUST GC_GRACE AFTER
EXTENSION
• Column Family Setting
• Defined in seconds
• Default 10 days
• Tweak gc_grace settings to
accommodate time taken to repair
• BEWARE of deleted columns
RUNBOOK
PLAN FOR CAPACITY
CONSISTENCY LEVEL
• Check the client for consistency level setting
• In a Multiregional cluster QUORUM <>
LOCAL_QUORUM
• Recommended consistency levels
LOCAL_ONE (CASSANDRA-6202) for reads
and LOCAL_QUORUM for writes
• For region resiliency avoid – ALL or
QUORUM calls
HOW DO WE KNOW IT WORKS?
CREATE CHAOS!!
Benchmark …
Time Consuming
But worth it!
Active Active - C* Behind the Scenes at Netflix

Contenu connexe

En vedette

Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
DataStax
 

En vedette (20)

3800 die-bonder overview
3800 die-bonder overview3800 die-bonder overview
3800 die-bonder overview
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For Operators
 
Securing Cassandra
Securing CassandraSecuring Cassandra
Securing Cassandra
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
Multi Data Center Strategies
Multi Data Center StrategiesMulti Data Center Strategies
Multi Data Center Strategies
 
Ficstar Software: Cassandra Installation to Optimization
Ficstar Software: Cassandra Installation to OptimizationFicstar Software: Cassandra Installation to Optimization
Ficstar Software: Cassandra Installation to Optimization
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at Target
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 

Plus de DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Plus de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Active Active - C* Behind the Scenes at Netflix

  • 1.
  • 4.
  • 6. WHAT IS ACTIVE-ACTIVE Also called dual active, it is a phrase used to describe a network of independent processing nodes where each node has access to a replicated database giving each node access and usage of single application. In an active-active system all requests are load balanced across all available processing capacity, Where a failure occurs on a node, another node in the network takes its place.
  • 7. DOES AN INSTANCE FAIL? • It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  • 8. DOES A ZONE FAIL? • Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone • Test with Chaos Gorilla
  • 9. DOES A REGION FAIL? • Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issues • Test with Chaos Kong
  • 10. EVERYTHING FAILS… EVENTUALLY • Keep your services running by embracing isolation and redundancy • Construct a highly agile and highly available service from ephemeral and assumed broken components
  • 11. ISOLATION • Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not affect functionality / operations
  • 12. REDUNDANCY • Make more than one (of pretty much everything) • Specifically, distribute services across Availability Zones and regions
  • 13. HISTORY: X-MAS EVE 2012 • Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the production ELB state data”
  • 15.
  • 16.
  • 20. PRIAM.MULTIREGION.ENABLE =TRUE tcp 7101-7101 [ ] [10.190.21.36/32, 10.232.200.17/32, 10.33.573.26/32, 10.20.151.165/32, 10.226.99.46/32, 10.244.143.193/32] tcp 7103-7103 [ ] [54.196.221.136/32, 54.202.200.217/32, 54.203.57.226/32, 54.205.151.165/32, 54.226.99.46/32, 54.244.143.193/32]
  • 21.
  • 22. SPIN UP NODES IN NEW REGION us-east-1 us-west-2 APP
  • 23. UPDATE KEYSPACE Update keyspace <keyspace> with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {us-east : 3, us-west-2 : 3}; Existing region and replication factor New region and replication factor
  • 24. REBUILD NEW REGION Run – nodetool rebuild us-east-1 on all us-west-2 nodes
  • 27. BENCHMARKING GLOBAL CASSANDRA WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION CAPACITY 16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL 192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING CASSANDRA IN 20 MINUTES Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-West-2 Region - Oregon Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East-1 Region - Virginia Test Load Test Load Validation Load Interzone Traffic 1 Million Writes CL.ONE (Wait for One Replica to ack) 1 Million Reads after 500 ms CL.ONE with No Data Loss Interregional Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  • 30. KEY METRICS USED • 99 /95 th Read Latency (Client & C*) • Dropped Metrics on C* • Exceptions on C* • Heap Usage on C* • CPU Usage (Client & C*) • Threads Pending on C*
  • 31. CONFIGURATION FOR TEST • 24 Node C* SSDs • 220 Client instances • 70+ Jmeter Instances
  • 33. TOTAL READ IOPS TOTAL WRITE IOPS
  • 38. REPAIRS AFTER EXTENSION ARE PAINFUL !!
  • 39. TIME TO REPAIR DEPENDS ON • Number of regions • Number of replicas • Data size • Amount of entropy
  • 40. ADJUST GC_GRACE AFTER EXTENSION • Column Family Setting • Defined in seconds • Default 10 days • Tweak gc_grace settings to accommodate time taken to repair • BEWARE of deleted columns
  • 43. CONSISTENCY LEVEL • Check the client for consistency level setting • In a Multiregional cluster QUORUM <> LOCAL_QUORUM • Recommended consistency levels LOCAL_ONE (CASSANDRA-6202) for reads and LOCAL_QUORUM for writes • For region resiliency avoid – ALL or QUORUM calls
  • 44. HOW DO WE KNOW IT WORKS? CREATE CHAOS!!
  • 45.