Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees
1. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Cry in the dojo, laugh in the
battlefield: how we constantly
try to bring Scylla to its knees so
you don't have to.
QA Manager, Scylla
Roy Dahan
2. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Roy Dahan
2
Roy has over of 10 years of experience testing
large-scale distributed systems, with a focus on
storage/data systems, and managing small to large
teams responsible for all testing aspects using a
highly automated approach.
3. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Our Goal
▪ Achieving Highest Levels of System Stability & Availability
▪ Maintaining Data Integrity
▪ Prevent Performance Degradations Over Time
▪ Increase Users Confidence
All of the above, even when BAD THINGS happen on
“Production-like Environments”
3
4. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
How We Test Scylla
4
Scylla
Testing
Unit
✓ scylla-unittest
Functional
✓ dtest
Compatibility
✓ dtest
✓ Driver Tests
Integration
✓ Janus-Graph
Tests
✓ Titan-test
✓ Spark
Scale /
Performance
✓ S-C-T
Stress / Load
✓ S-C-T
✓ Cassandra
Stress
System /
Longevity
✓ S-C-T
✓ Jepsen
5. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Distributed Tests (dtest)
▪ Functional “Black Box” Tests
▪ Verifies our Compatibility with Cassandra
▪ Enhanced & Extended to Catch Scylla Regressions
▪ Around 10% (208) of the Reported Issues on the Scylla Project
reference a dtest - (Detected/Reproduced by dtest)
▪ About 675 Tests Runs Regularly as part of “Regression Suite”
5
6. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Scylla-Cluster-Tests (SCT)
▪ Automation Library and Test Collection for Scylla & Cassandra
Clusters
▪ Supports Multiple Backends such as: AWS / GCE / OpenStack /
Libvirt
▪ Tests are Based on Chaos Engineering Principles:
o Build a Hypothesis around Steady State Behavior
o Vary Real-world Events
o Automate Experiments to Run Continuously
▪ Around 4% (105) of the Reported Issues on the Scylla Project
Reference SCT test - (Detected/Reproduced by SCT test)
6
7. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
7
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=6)
▪ Set of X Loaders Nodes (x=2)
▪ Scylla Monitoring Server
client
Cluster of nodes
client
8. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
8
Test Setup - Example on GCE:
▪
9. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
9
The Test flow:
▪ Client Side Loaders Run Workloads
(Set of Cassandra-Stress loads run on the loaders (Write,
Mixed, Counters, User Profiles)
▪ During X hours / days / weeks
▪ A “Nemesis” Out of the Predefined List is
Randomly Selected
o Some Nemesis Disrupts Nodes in the
Cluster.
o Someone Runs Standard Cluster
Operations
Current Nemesis types:
StopStartService
StopWaitStartService
Drainer
Decommission
CorruptThenRepair
CorruptThenRebuild
NoCorruptRepair
Refresh
MajorCompaction
ModifyTableProperties
Enospc
10. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
10
Test Fixture Example:
test_duration: 5760
stress_cmd:
["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)'
cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]
n_db_nodes: 6
n_loaders: 2
n_monitor_nodes: 1
nemesis_class_name: 'ChaosMonkey'
nemesis_interval: 5
failure_post_behavior: keep
space_node_threshold: 644245094
ip_ssh_connections: 'private'
experimental: 'true'
11. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
11
Test Fixture Example:
test_duration: 5760
stress_cmd:
["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)'
cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]
n_db_nodes: 6
n_loaders: 2
n_monitor_nodes: 1
nemesis_class_name: 'ChaosMonkey'
nemesis_interval: 5
failure_post_behavior: keep
space_node_threshold: 644245094
ip_ssh_connections: 'private'
experimental: 'true'
12. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
12
Nemesis Code Examples:
def disrupt_destroy_data_then_repair(self):
self._set_current_disruption('CorruptThenRepair %s' % self.target_node)
# Delete set of sstables from data directory
self._destroy_data()
# Try to save the node
self.repair_nodetool_repair()
def disrupt_stop_wait_start_scylla_server(self, sleep_time=300):
self._set_current_disruption('StopWaitStartService %s' % self.target_node)
self.target_node.remoter.run('sudo systemctl stop scylla-server.service')
self.target_node.wait_db_down()
self.log.info("Sleep for %s seconds", sleep_time)
time.sleep(sleep_time)
self.target_node.remoter.run('sudo systemctl start scylla-server.service')
self.target_node.wait_db_up()
13. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
13
Test Verification & Analysis:
▪ Application Load (cassandra-stress) Doesn’t Stop
▪ Auto Detection of:
• Coredumps
• Errors
• Exceptions
• Operations failures (repair, add node, refresh, compaction, etc.)
▪ Auto Detection of Performance Degradations (unexpected lower throughput
/ higher latencies due to operations)
▪ Compare Nemesis Execution Durations Across Builds to Detect Possible
Regressions
14. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
14
Longevity monitoring example:
“Total Requests Served” (op/s) correlated with Nemesis executions.
15. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
15
Longevity monitoring example:
“Requests Rate Served” (op/s per instance) correlated with Nemesis executions.
16. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
16
Longevity monitoring example:
“CPU utilization” (% per instance) correlated with Nemesis executions.
18. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
18
SCT Longevity Testing
Nemesis Execution Analysis:
Auto-analysis and reports based on test
statistics stored automatically in ElasticSearch
19. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example of Issue detected by Longevity
19
20. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example of Nemesis Added due to Issue
20
21. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example of Nemesis Added due to Issue
21
def disrupt_modify_table_comment(self):
self._set_current_disruption('ModifyTableProperties %s' % self.target_node)
comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24))
cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment)
self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address),
verbose=True)
def disrupt_modify_table_gc_grace_time(self):
self._set_current_disruption('ModifyTableProperties %s' % self.target_node)
gc_grace_seconds = random.choice(xrange(216000, 864000))
cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND"
" gc_grace_seconds = {};".format(gc_grace_seconds)
self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address),
verbose=True)
22. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Multi DC Longevity - The plot thickens
22
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=15)
▪ Across M “Data Centers” (M=3)
▪ Set of X Loaders nodes. (X=3)
▪ Scylla Monitoring Server.
▪ Set of Cassandra-Stress commands
running on the loaders (Write,
Mixed, Counters, User Profiles).
The tc utility is being used to impose random network delays,
packet drops and reorder packets between Data Centers.
DC1
client
DC2
client
DC3
client
23. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Performance Regression
23
▪ Set of Predefined Workloads & Setups
○ Write
○ Read
○ Mixed
○ Customers Workloads
▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch
▪ Master Daily Regression Suite - Automatically Compare Results
with a Previous Build & “Best” Build
▪ Release Regression Suite - Automatically Compare Results with
Previous Releases (including RCs)
24. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Performance Regression
24
Test-Write - Total Op rate (op/s) by Release:
25. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Performance Regression
25
Test-Write - 99th Percentile Latency (ms) by Release:
26. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Large Scale Tests
26
▪ 100’s of Nodes Clusters
▪ 10’s TB DataSets
▪ Multi-Core Scylla nodes
▪ Many sstables
Sample of 101 nodes Scylla cluster running on AWS.
27. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
On QA Roadmap
Longevity:
▪ Embed CharybdeFS (fault injection FS) in Longevity
▪ Extend workload types
▪ Two+ Nemesis in Parallel
▪ Adding more “Sudden Death” Types of Nemesis
▪ Enable “sstables integrity checker”
Load & Scale
▪ XXL Clusters Sizes (1000+ nodes)
▪ Enhance Load Testing to More Server Dimensions (network, Disk)
27
28. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
On QA Roadmap
Performance:
▪ Add more “Real World Workloads” to Daily Regressions
▪ Performance Impact Per Operation (e.g. repair, majorCompaction)
▪ Collecting Latency Histograms for Various Load Types
3rd Party Integration:
▪ Spark & Titan Integration Suites
▪ Java & Golang Driver Integration Suites
Tools & Infrastructure:
▪ Enhance auto analysis based on Statistics in ElasticSearch
▪ Running SCT using an Existing Env
28
29. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
Roy@scylladb.com
Please stay in touch
Any questions?