ScyllaDB is a distributed database designed to scale horizontally and vertically — in theory. What about in practice? ScyllaDB’s Benny Halevy, Director, Software Engineering, will take you through the process and results of benchmarking our NoSQL database at the petabyte level, showing how you can use advanced features like workload prioritization to control priorities of transactional (read-write) and analytic (read-only) queries on the same cluster with smooth and predictable performance.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
2. Benny Halevy
■ Leading the storage software development team at ScyllaDB.
■ Benny has been working on operating systems and distributed
file systems for over 20 years.
■ Most recently, Benny led software development for GSI
Technology, and previously co-founded Tonian (later acquired by
Primary Data) and led it as CTO.
■ Before Tonian, Benny was the lead architect in Panasas of the
pNFS protocol.
Dir. Software Eng. ScyllaDB
3. Background and Motivation
As more applications are hosted on public and private
clouds and increasingly larger datasets are collected
and analysed, there is need to support Petabyte-scale
applications.
+ Billions of users X entities generate Petabytes of
data.
+ Rapid data collection.
+ Online Transaction Processing (OLTP)
+ Combined with analytics (OLAP)
4. Application Modeling
To model a petabyte scale application we chose to run two concurrent workloads:
+ Large user data dataset (containing per-user data)
+ Read mostly
+ Regularly updated
+ Used by analytics applications
+ Smaller, yet real-time oriented application dataset
+ E.g. online bidding for ad-placement (OLTP)
+ Requires low-latency to meet real-time deadlines and maximize algorithms’ efficiency.
5. Back of the Envelope Sizing
+ 1 Billion users
+ 10,000 records per user
+ 100 bytes per record
+ ➞ 1 PetaByte of storage
+ 10 Million auctions
+ 1,000 records per auction
+ 1,000 bytes per record
+ ➞ Several Terabytes of storage
7. What were our Goals?
+ Construct a PB-scale Scylla cluster
+ Load the database with data.
+ Measure throughput and latency.
+ Order of 1 PB of user data, 1 TB of application data
+ Concurrent workloads: user and application datasets.
+ Measure throughput and latency.
■ 5M tps user workload (read-only, 80/20 RW, high throughput)
■ 200K tps application workload (50/50 RW, low latency)
+ Demonstrate the use of workload prioritization.
8. Bill of Materials
+ Scylla cluster: 20 x i3en.metal AWS instances, each having:
+ 96 vCPUs
+ 768 GiB RAM
+ 60 TB NVMe disk space
+ 100 Gbps network bandwidth
+ Load Generators: 50 x c5n.9xlarge AWS instances, each having:
+ 36 vCPUs
+ 96 GiB RAM
+ 50 Gbps network bandwidth
9. Software Used
+ Scylla Enterprise: version 2021.1.6
+ Cassandra-Stress: over Scylla shard-aware Java driver
+ Workload generator
+ Scylla-monitoring stack for metrics collection and presentation
+ Using Prometheus and Grafana
10. User Workload
The user keyspace was constructed as a key/value dataset.
+ 500Bi keys
+ Variable size values with a mean size of 600 bytes
+ Representing 1 PB of uncompressed text data with 3.33x compression ratio
+ LZ4 Compression
+ Replication Factor (RF) of 2
+ Consistency Level (CL): ONE
+ Keys were randomly selected in a uniform distribution.
11. User Workload
The read-only query workload was generated using cassandra-stress.
+ Each of the 50 load generators used a normal distribution to draw random keys
out of its assigned 1/50 range of the keys.
+ threads=1000 fixed=100_000/s
+ Total of 5M read tps
+ Workload ran for 3 hours with 5 minutes warm-up time.
12. User Workload
The 80/20 read/write query workload was generated using cassandra-stress as well.
+ Each of the 50 load generators used a normal distribution to draw random keys
out of its assigned 1/50 range of the keys.
+ threads=1000 fixed=100_000/s
+ Total of
■ 4M read tps
■ 1M write tps
+ Workload ran for 3 hours with 5 minutes warm-up time.
13. Application Workload
The application keyspace was constructed as a key/value dataset.
+ 6Bi keys
+ Fixed size values of 250 bytes
+ Representing about 3 TB of uncompressed binary data with 2x compression ratio
+ LZ4 Compression
+ Replication Factor (RF) of 2
+ Consistency Level (CL): QUORUM
+ Keys were randomly selected in a uniform distribution.
14. Application Workload
The 50/50 read/write query workload was generated using cassandra-stress.
+ Each of the 50 load generators used a normal distribution to draw random keys
out of its assigned 1/50 range of the keys.
+ threads=1000 fixed=4,000/s
+ Total of:
■ 100K read tps
■ 100K write tps
+ Workload ran for 3 hours with 5 minutes warm-up time.
20. Storage demands during ingestion
+ Today’s disks are able to handle multi-GB/s workloads
21. Storage demands during ingestion
+ Today’s disks are able to handle multi-GB/s workloads
22. Storage demands during ingestion
900 MB/s commitlog
writes per instance.
7.5M inserts/sec *
3000 bytes * RF(2) /
50 nodes
Generate around 6 GB/s
per instance of
compaction I/O.
Overall:
20 nodes * 6GB/s ->
120GB/s!
23. Incremental Compaction in Action
+ ICS creates and deletes equal-sized sstables that dramatically reduce
temporary space amplification during compaction.
24. Scaling system throughput
+ How much can the Scylla Petabyte cluster be loaded and still provide
single-digit millisecond 99% latency.
25. Concurrent workloads: R/W + Read-only
(1) Throughput is in transactions/second
(2) Latency is in milliseconds
(1) Workload: Application: 280K R/W User: 7M read-only
(2) Write latency 0.821 P50
2.232 P99
Read latency 1.433 P50
6.832 P99
0.885 P50
6.350 P99
+ 7Mi user read ops/sec + 280K application 50/50 R/W ops/sec
+ Stable high throughput with <10ms 99% latency.
26. Cache Efficiency
+ Note that cache hit rate is only
a little over 1% due to random
key/value reads
+ Potential for BYPASS CACHE to
further improve read
throughput.
+ Previous tests showed that BYPASS
CACHE may improve performance by 70%
and a all-cached setup will even be 4x.
27. Concurrent workloads: R/W + Read-only
(1) Throughput is in transactions/second
(2) Latency is in milliseconds
(1) Workload: Application: 200K R/W User: 5M read-only
(2) Write latency 0.632 P50
1.398 P99
Read latency 1.046 P50
2.279 P99
0.680 P50
1.932 P99
+ 5Mi user read ops/sec + 200K application 50/50 R/W ops/sec
+ High throughput with low, ~2 ms app 99% latency.
30. Concurrent workloads: R/W + 80/20
(1) Throughput is in transactions/second
(2) Latency is in milliseconds
With Workload Prioritization
+ As the 80/20 user workload interfered with the application latency,
let’s reduce its relative priority to better share the system resources.
(1) Workload: Application: 200K R/W User: 5M 80/20 R/W
before:
1000 shares
after:
1000 shares
before:
1000 shares
after:
500 shares
(2) Write latency 0.682
2.454
0.354 P50
1.184 P99
0.326
1.252
0.440 P50
3.244 P99
Read latency 1.195
4.555
0.855 P50
3.731 P99
0.744
3.709
1.043 P50
6.455 P99
31. + Each service level has its own per-shard queue for consuming cpu and I/O
Service Levels in Action
Application workload (200K ops/sec)
User workload (5M ops/sec)
33. Main Challenges
As expected, setting up and testing a petabyte-scale database was not trivial.
That said, it didn’t take any unreasonable effort.
+ Provisioning: it took time to find an AWS availability zone with enough on-demand instances
of the needed kind.
+ Hardware Tuning:
+ Interrupt handling cpus had to be manually assigned to maximize throughput (fix will be merged for our
out-of-the-box machine images.)
+ cpupower governor set to “performance”
+ Benchmarking framework
+ Cassandra-stress not built for this scale (e.g. default population distribution is too small)
+ Data collection library had issues with large number of parallel machines
34. Scylla Configuration
For running the benchmark, we’ve used the following non-default configuration
■ Node level
• 4 irq-serving cpus (rather than 2 by default) - for handling high network throughput.
• mount -o discard (now default in OSS head-of-line)
■ scylla.yaml:
• compaction_static_shares: 100 - optimized for append-mostly workload (*)
• Head-of-line has improvements for compaction backlog controller
• compaction_enforce_min_threshold: true
■ Schema: compaction = {
'class': 'IncrementalCompactionStrategy',
'sstable_size_in_mb': 10000,
'space_amplification_goal': 1.25
} AND compression = {'sstable_compression': 'LZ4Compressor'};
35. Future Work
+ Whitepaper is coming up, expanding on this benchmark.
+ BYPASS CACHE:
+ Show the benefits of using the Scylla BYPASS CACHE CQL query option to optimize utilization for e.g. random-
small reads workloads.
+ All-cached workload
+ Demonstrate maximum performance when the whole dataset fits in cache.
+ Bear in mind that the i3en.metal instances have 768 GB of memory each!