SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Cry in the dojo, laugh in the
battlefield: how we constantly
try to bring Scylla to its knees so
you don't have to.
QA Manager, Scylla
Roy Dahan
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Roy Dahan
2
Roy has over of 10 years of experience testing
large-scale distributed systems, with a focus on
storage/data systems, and managing small to large
teams responsible for all testing aspects using a
highly automated approach.
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Our Goal
▪ Achieving Highest Levels of System Stability & Availability
▪ Maintaining Data Integrity
▪ Prevent Performance Degradations Over Time
▪ Increase Users Confidence
All of the above, even when BAD THINGS happen on
“Production-like Environments”
3
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
How We Test Scylla
4
Scylla
Testing
Unit
✓ scylla-unittest
Functional
✓ dtest
Compatibility
✓ dtest
✓ Driver Tests
Integration
✓ Janus-Graph
Tests
✓ Titan-test
✓ Spark
Scale /
Performance
✓ S-C-T
Stress / Load
✓ S-C-T
✓ Cassandra
Stress
System /
Longevity
✓ S-C-T
✓ Jepsen
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Distributed Tests (dtest)
▪ Functional “Black Box” Tests
▪ Verifies our Compatibility with Cassandra
▪ Enhanced & Extended to Catch Scylla Regressions
▪ Around 10% (208) of the Reported Issues on the Scylla Project
reference a dtest - (Detected/Reproduced by dtest)
▪ About 675 Tests Runs Regularly as part of “Regression Suite”
5
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Scylla-Cluster-Tests (SCT)
▪ Automation Library and Test Collection for Scylla & Cassandra
Clusters
▪ Supports Multiple Backends such as: AWS / GCE / OpenStack /
Libvirt
▪ Tests are Based on Chaos Engineering Principles:
o Build a Hypothesis around Steady State Behavior
o Vary Real-world Events
o Automate Experiments to Run Continuously
▪ Around 4% (105) of the Reported Issues on the Scylla Project
Reference SCT test - (Detected/Reproduced by SCT test)
6
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
7
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=6)
▪ Set of X Loaders Nodes (x=2)
▪ Scylla Monitoring Server
client
Cluster of nodes
client
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
8
Test Setup - Example on GCE:
▪
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
9
The Test flow:
▪ Client Side Loaders Run Workloads
(Set of Cassandra-Stress loads run on the loaders (Write,
Mixed, Counters, User Profiles)
▪ During X hours / days / weeks
▪ A “Nemesis” Out of the Predefined List is
Randomly Selected
o Some Nemesis Disrupts Nodes in the
Cluster.
o Someone Runs Standard Cluster
Operations
Current Nemesis types:
StopStartService
StopWaitStartService
Drainer
Decommission
CorruptThenRepair
CorruptThenRebuild
NoCorruptRepair
Refresh
MajorCompaction
ModifyTableProperties
Enospc
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
10
Test Fixture Example:
test_duration: 5760
stress_cmd:
["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)'
cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]
n_db_nodes: 6
n_loaders: 2
n_monitor_nodes: 1
nemesis_class_name: 'ChaosMonkey'
nemesis_interval: 5
failure_post_behavior: keep
space_node_threshold: 644245094
ip_ssh_connections: 'private'
experimental: 'true'
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
11
Test Fixture Example:
test_duration: 5760
stress_cmd:
["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)'
cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]
n_db_nodes: 6
n_loaders: 2
n_monitor_nodes: 1
nemesis_class_name: 'ChaosMonkey'
nemesis_interval: 5
failure_post_behavior: keep
space_node_threshold: 644245094
ip_ssh_connections: 'private'
experimental: 'true'
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
12
Nemesis Code Examples:
def disrupt_destroy_data_then_repair(self):
self._set_current_disruption('CorruptThenRepair %s' % self.target_node)
# Delete set of sstables from data directory
self._destroy_data()
# Try to save the node
self.repair_nodetool_repair()
def disrupt_stop_wait_start_scylla_server(self, sleep_time=300):
self._set_current_disruption('StopWaitStartService %s' % self.target_node)
self.target_node.remoter.run('sudo systemctl stop scylla-server.service')
self.target_node.wait_db_down()
self.log.info("Sleep for %s seconds", sleep_time)
time.sleep(sleep_time)
self.target_node.remoter.run('sudo systemctl start scylla-server.service')
self.target_node.wait_db_up()
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
13
Test Verification & Analysis:
▪ Application Load (cassandra-stress) Doesn’t Stop
▪ Auto Detection of:
• Coredumps
• Errors
• Exceptions
• Operations failures (repair, add node, refresh, compaction, etc.)
▪ Auto Detection of Performance Degradations (unexpected lower throughput
/ higher latencies due to operations)
▪ Compare Nemesis Execution Durations Across Builds to Detect Possible
Regressions
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
14
Longevity monitoring example:
“Total Requests Served” (op/s) correlated with Nemesis executions.
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
15
Longevity monitoring example:
“Requests Rate Served” (op/s per instance) correlated with Nemesis executions.
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
16
Longevity monitoring example:
“CPU utilization” (% per instance) correlated with Nemesis executions.
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
17
Test Summary Output - Nemesis Execution:
50GB DataSet Test: (Nemesis every 5 minutes, 4 days)
--------------------------------------------
| Nemesis Type |Count | Avg Time(s) |
-------------------------------------------
| CorruptThenRebuild | 103 | 93.79 |
| Decommission | 111 | 231.89 |
| Drainer | 109 | 48.27 |
| CorruptThenRepair | 113 | 285.71 |
| Refresh | 95 | 7.72 |
| NoCorruptRepair | 97 | 331.73 |
| StopStartService | 133 | 26.92 |
| MajorCompaction | 134 | 20.63 |
| ModifyTable | 197 | 1.50 |
| Enospc | 114 | 26.33 |
| StopWaitStartService| 98 | 66.30 |
--------------------------------------------
1TB DataSet Test: (Nemesis every 30 minutes, 6 days)
--------------------------------------------
| Nemesis Type |Count | Avg Time(s) |
-------------------------------------------
| CorruptThenRebuild | 2 | 732.50 |
| Decommission | 7 | 2913.86 |
| Drainer | 6 | 213.00 |
| CorruptThenRepair | 5 | 4942.60 |
| Refresh | 6 | 10.50 |
| NoCorruptRepair | 3 | 2835.33 |
| StopStartService | 2 | 195.00 |
| MajorCompaction | 3 | 663.33 |
| ModifyTable | 6 | 4.67 |
| Enospc | 6 | 221.00 |
| StopWaitStartService| 6 | 492.17 |
--------------------------------------------
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
18
SCT Longevity Testing
Nemesis Execution Analysis:
Auto-analysis and reports based on test
statistics stored automatically in ElasticSearch
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example of Issue detected by Longevity
19
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example of Nemesis Added due to Issue
20
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example of Nemesis Added due to Issue
21
def disrupt_modify_table_comment(self):
self._set_current_disruption('ModifyTableProperties %s' % self.target_node)
comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24))
cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment)
self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address),
verbose=True)
def disrupt_modify_table_gc_grace_time(self):
self._set_current_disruption('ModifyTableProperties %s' % self.target_node)
gc_grace_seconds = random.choice(xrange(216000, 864000))
cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND" 
" gc_grace_seconds = {};".format(gc_grace_seconds)
self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address),
verbose=True)
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Multi DC Longevity - The plot thickens
22
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=15)
▪ Across M “Data Centers” (M=3)
▪ Set of X Loaders nodes. (X=3)
▪ Scylla Monitoring Server.
▪ Set of Cassandra-Stress commands
running on the loaders (Write,
Mixed, Counters, User Profiles).
The tc utility is being used to impose random network delays,
packet drops and reorder packets between Data Centers.
DC1
client
DC2
client
DC3
client
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Performance Regression
23
▪ Set of Predefined Workloads & Setups
○ Write
○ Read
○ Mixed
○ Customers Workloads
▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch
▪ Master Daily Regression Suite - Automatically Compare Results
with a Previous Build & “Best” Build
▪ Release Regression Suite - Automatically Compare Results with
Previous Releases (including RCs)
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Performance Regression
24
Test-Write - Total Op rate (op/s) by Release:
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Performance Regression
25
Test-Write - 99th Percentile Latency (ms) by Release:
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Large Scale Tests
26
▪ 100’s of Nodes Clusters
▪ 10’s TB DataSets
▪ Multi-Core Scylla nodes
▪ Many sstables
Sample of 101 nodes Scylla cluster running on AWS.
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
On QA Roadmap
Longevity:
▪ Embed CharybdeFS (fault injection FS) in Longevity
▪ Extend workload types
▪ Two+ Nemesis in Parallel
▪ Adding more “Sudden Death” Types of Nemesis
▪ Enable “sstables integrity checker”
Load & Scale
▪ XXL Clusters Sizes (1000+ nodes)
▪ Enhance Load Testing to More Server Dimensions (network, Disk)
27
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
On QA Roadmap
Performance:
▪ Add more “Real World Workloads” to Daily Regressions
▪ Performance Impact Per Operation (e.g. repair, majorCompaction)
▪ Collecting Latency Histograms for Various Load Types
3rd Party Integration:
▪ Spark & Titan Integration Suites
▪ Java & Golang Driver Integration Suites
Tools & Infrastructure:
▪ Enhance auto analysis based on Statistics in ElasticSearch
▪ Running SCT using an Existing Env
28
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
Roy@scylladb.com
Please stay in touch
Any questions?

Contenu connexe

Tendances

VirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRVirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWR
Kristofferson A
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Kristofferson A
 
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
DataStax
 

Tendances (20)

Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
VirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRVirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWR
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
 
Rapid Home Provisioning
Rapid Home ProvisioningRapid Home Provisioning
Rapid Home Provisioning
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 

En vedette

En vedette (20)

Scylla Summit 2017: From Elasticsearch to Scylla at Zenly
Scylla Summit 2017: From Elasticsearch to Scylla at ZenlyScylla Summit 2017: From Elasticsearch to Scylla at Zenly
Scylla Summit 2017: From Elasticsearch to Scylla at Zenly
 
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark Scylla Summit 2017: Stateful Streaming Applications with Apache Spark
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark
 
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
 
Scylla Summit 2017: Planning Your Queries for Maximum Performance
Scylla Summit 2017: Planning Your Queries for Maximum PerformanceScylla Summit 2017: Planning Your Queries for Maximum Performance
Scylla Summit 2017: Planning Your Queries for Maximum Performance
 
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined Types
 
Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...
Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...
Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...
 
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
 
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
 
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
 
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
 
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
 
Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on Kubernetes
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
 
Scylla Summit 2016: Keynote - Big Data Goes Native
Scylla Summit 2016: Keynote - Big Data Goes NativeScylla Summit 2016: Keynote - Big Data Goes Native
Scylla Summit 2016: Keynote - Big Data Goes Native
 
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
 
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
 
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
 

Similaire à Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
MySQLConference
 
Where to start with power cli
Where to start with power cliWhere to start with power cli
Where to start with power cli
Chris Halverson
 
Mutant Tests Too: The SQL
Mutant Tests Too: The SQLMutant Tests Too: The SQL
Mutant Tests Too: The SQL
DataWorks Summit
 

Similaire à Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees (20)

Performance tests - it's a trap
Performance tests - it's a trapPerformance tests - it's a trap
Performance tests - it's a trap
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Performance tests with Gatling
Performance tests with GatlingPerformance tests with Gatling
Performance tests with Gatling
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Performance
PerformancePerformance
Performance
 
Mixed Integer Programming: Analyzing 12 Years of Progress
Mixed Integer Programming: Analyzing 12 Years of ProgressMixed Integer Programming: Analyzing 12 Years of Progress
Mixed Integer Programming: Analyzing 12 Years of Progress
 
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
 
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
OSMC 2008 | Monitoring MySQL by Geert VanderkelenOSMC 2008 | Monitoring MySQL by Geert Vanderkelen
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
Where to start with power cli
Where to start with power cliWhere to start with power cli
Where to start with power cli
 
Mutant Tests Too: The SQL
Mutant Tests Too: The SQLMutant Tests Too: The SQL
Mutant Tests Too: The SQL
 
Rmoug ashmaster
Rmoug ashmasterRmoug ashmaster
Rmoug ashmaster
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
 

Plus de ScyllaDB

Plus de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

  • 1. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Cry in the dojo, laugh in the battlefield: how we constantly try to bring Scylla to its knees so you don't have to. QA Manager, Scylla Roy Dahan
  • 2. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Roy Dahan 2 Roy has over of 10 years of experience testing large-scale distributed systems, with a focus on storage/data systems, and managing small to large teams responsible for all testing aspects using a highly automated approach.
  • 3. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Our Goal ▪ Achieving Highest Levels of System Stability & Availability ▪ Maintaining Data Integrity ▪ Prevent Performance Degradations Over Time ▪ Increase Users Confidence All of the above, even when BAD THINGS happen on “Production-like Environments” 3
  • 4. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company How We Test Scylla 4 Scylla Testing Unit ✓ scylla-unittest Functional ✓ dtest Compatibility ✓ dtest ✓ Driver Tests Integration ✓ Janus-Graph Tests ✓ Titan-test ✓ Spark Scale / Performance ✓ S-C-T Stress / Load ✓ S-C-T ✓ Cassandra Stress System / Longevity ✓ S-C-T ✓ Jepsen
  • 5. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Distributed Tests (dtest) ▪ Functional “Black Box” Tests ▪ Verifies our Compatibility with Cassandra ▪ Enhanced & Extended to Catch Scylla Regressions ▪ Around 10% (208) of the Reported Issues on the Scylla Project reference a dtest - (Detected/Reproduced by dtest) ▪ About 675 Tests Runs Regularly as part of “Regression Suite” 5
  • 6. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Scylla-Cluster-Tests (SCT) ▪ Automation Library and Test Collection for Scylla & Cassandra Clusters ▪ Supports Multiple Backends such as: AWS / GCE / OpenStack / Libvirt ▪ Tests are Based on Chaos Engineering Principles: o Build a Hypothesis around Steady State Behavior o Vary Real-world Events o Automate Experiments to Run Continuously ▪ Around 4% (105) of the Reported Issues on the Scylla Project Reference SCT test - (Detected/Reproduced by SCT test) 6
  • 7. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 7 Test Setup (Our Defaults): ▪ Cluster of N Scylla DB nodes (N=6) ▪ Set of X Loaders Nodes (x=2) ▪ Scylla Monitoring Server client Cluster of nodes client
  • 8. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 8 Test Setup - Example on GCE: ▪
  • 9. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 9 The Test flow: ▪ Client Side Loaders Run Workloads (Set of Cassandra-Stress loads run on the loaders (Write, Mixed, Counters, User Profiles) ▪ During X hours / days / weeks ▪ A “Nemesis” Out of the Predefined List is Randomly Selected o Some Nemesis Disrupts Nodes in the Cluster. o Someone Runs Standard Cluster Operations Current Nemesis types: StopStartService StopWaitStartService Drainer Decommission CorruptThenRepair CorruptThenRebuild NoCorruptRepair Refresh MajorCompaction ModifyTableProperties Enospc
  • 10. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 10 Test Fixture Example: test_duration: 5760 stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5", "cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000", "cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"] n_db_nodes: 6 n_loaders: 2 n_monitor_nodes: 1 nemesis_class_name: 'ChaosMonkey' nemesis_interval: 5 failure_post_behavior: keep space_node_threshold: 644245094 ip_ssh_connections: 'private' experimental: 'true'
  • 11. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 11 Test Fixture Example: test_duration: 5760 stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5", "cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000", "cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"] n_db_nodes: 6 n_loaders: 2 n_monitor_nodes: 1 nemesis_class_name: 'ChaosMonkey' nemesis_interval: 5 failure_post_behavior: keep space_node_threshold: 644245094 ip_ssh_connections: 'private' experimental: 'true'
  • 12. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 12 Nemesis Code Examples: def disrupt_destroy_data_then_repair(self): self._set_current_disruption('CorruptThenRepair %s' % self.target_node) # Delete set of sstables from data directory self._destroy_data() # Try to save the node self.repair_nodetool_repair() def disrupt_stop_wait_start_scylla_server(self, sleep_time=300): self._set_current_disruption('StopWaitStartService %s' % self.target_node) self.target_node.remoter.run('sudo systemctl stop scylla-server.service') self.target_node.wait_db_down() self.log.info("Sleep for %s seconds", sleep_time) time.sleep(sleep_time) self.target_node.remoter.run('sudo systemctl start scylla-server.service') self.target_node.wait_db_up()
  • 13. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 13 Test Verification & Analysis: ▪ Application Load (cassandra-stress) Doesn’t Stop ▪ Auto Detection of: • Coredumps • Errors • Exceptions • Operations failures (repair, add node, refresh, compaction, etc.) ▪ Auto Detection of Performance Degradations (unexpected lower throughput / higher latencies due to operations) ▪ Compare Nemesis Execution Durations Across Builds to Detect Possible Regressions
  • 14. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 14 Longevity monitoring example: “Total Requests Served” (op/s) correlated with Nemesis executions.
  • 15. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 15 Longevity monitoring example: “Requests Rate Served” (op/s per instance) correlated with Nemesis executions.
  • 16. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 16 Longevity monitoring example: “CPU utilization” (% per instance) correlated with Nemesis executions.
  • 17. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company SCT Longevity Testing 17 Test Summary Output - Nemesis Execution: 50GB DataSet Test: (Nemesis every 5 minutes, 4 days) -------------------------------------------- | Nemesis Type |Count | Avg Time(s) | ------------------------------------------- | CorruptThenRebuild | 103 | 93.79 | | Decommission | 111 | 231.89 | | Drainer | 109 | 48.27 | | CorruptThenRepair | 113 | 285.71 | | Refresh | 95 | 7.72 | | NoCorruptRepair | 97 | 331.73 | | StopStartService | 133 | 26.92 | | MajorCompaction | 134 | 20.63 | | ModifyTable | 197 | 1.50 | | Enospc | 114 | 26.33 | | StopWaitStartService| 98 | 66.30 | -------------------------------------------- 1TB DataSet Test: (Nemesis every 30 minutes, 6 days) -------------------------------------------- | Nemesis Type |Count | Avg Time(s) | ------------------------------------------- | CorruptThenRebuild | 2 | 732.50 | | Decommission | 7 | 2913.86 | | Drainer | 6 | 213.00 | | CorruptThenRepair | 5 | 4942.60 | | Refresh | 6 | 10.50 | | NoCorruptRepair | 3 | 2835.33 | | StopStartService | 2 | 195.00 | | MajorCompaction | 3 | 663.33 | | ModifyTable | 6 | 4.67 | | Enospc | 6 | 221.00 | | StopWaitStartService| 6 | 492.17 | --------------------------------------------
  • 18. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company 18 SCT Longevity Testing Nemesis Execution Analysis: Auto-analysis and reports based on test statistics stored automatically in ElasticSearch
  • 19. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Example of Issue detected by Longevity 19
  • 20. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Example of Nemesis Added due to Issue 20
  • 21. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Example of Nemesis Added due to Issue 21 def disrupt_modify_table_comment(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24)) cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True) def disrupt_modify_table_gc_grace_time(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) gc_grace_seconds = random.choice(xrange(216000, 864000)) cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND" " gc_grace_seconds = {};".format(gc_grace_seconds) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True)
  • 22. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Multi DC Longevity - The plot thickens 22 Test Setup (Our Defaults): ▪ Cluster of N Scylla DB nodes (N=15) ▪ Across M “Data Centers” (M=3) ▪ Set of X Loaders nodes. (X=3) ▪ Scylla Monitoring Server. ▪ Set of Cassandra-Stress commands running on the loaders (Write, Mixed, Counters, User Profiles). The tc utility is being used to impose random network delays, packet drops and reorder packets between Data Centers. DC1 client DC2 client DC3 client
  • 23. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Performance Regression 23 ▪ Set of Predefined Workloads & Setups ○ Write ○ Read ○ Mixed ○ Customers Workloads ▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch ▪ Master Daily Regression Suite - Automatically Compare Results with a Previous Build & “Best” Build ▪ Release Regression Suite - Automatically Compare Results with Previous Releases (including RCs)
  • 24. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Performance Regression 24 Test-Write - Total Op rate (op/s) by Release:
  • 25. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Performance Regression 25 Test-Write - 99th Percentile Latency (ms) by Release:
  • 26. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Large Scale Tests 26 ▪ 100’s of Nodes Clusters ▪ 10’s TB DataSets ▪ Multi-Core Scylla nodes ▪ Many sstables Sample of 101 nodes Scylla cluster running on AWS.
  • 27. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company On QA Roadmap Longevity: ▪ Embed CharybdeFS (fault injection FS) in Longevity ▪ Extend workload types ▪ Two+ Nemesis in Parallel ▪ Adding more “Sudden Death” Types of Nemesis ▪ Enable “sstables integrity checker” Load & Scale ▪ XXL Clusters Sizes (1000+ nodes) ▪ Enhance Load Testing to More Server Dimensions (network, Disk) 27
  • 28. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company On QA Roadmap Performance: ▪ Add more “Real World Workloads” to Daily Regressions ▪ Performance Impact Per Operation (e.g. repair, majorCompaction) ▪ Collecting Latency Histograms for Various Load Types 3rd Party Integration: ▪ Spark & Titan Integration Suites ▪ Java & Golang Driver Integration Suites Tools & Infrastructure: ▪ Enhance auto analysis based on Statistics in ElasticSearch ▪ Running SCT using an Existing Env 28
  • 29. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company THANK YOU Roy@scylladb.com Please stay in touch Any questions?