Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Workloads on ScyllaDB

Operating at Monstrous
Scales: Benchmarking
Petabyte Workloads on
ScyllaDB
Benny Halevy
Dir. Software Eng. ScyllaDB

Benny Halevy
■ Leading the storage software development team at ScyllaDB.
■ Benny has been working on operating systems and distributed
ﬁle systems for over 20 years.
■ Most recently, Benny led software development for GSI
Technology, and previously co-founded Tonian (later acquired by
Primary Data) and led it as CTO.
■ Before Tonian, Benny was the lead architect in Panasas of the
pNFS protocol.
Dir. Software Eng. ScyllaDB

Background and Motivation
As more applications are hosted on public and private
clouds and increasingly larger datasets are collected
and analysed, there is need to support Petabyte-scale
applications.
+ Billions of users X entities generate Petabytes of
data.
+ Rapid data collection.
+ Online Transaction Processing (OLTP)
+ Combined with analytics (OLAP)

Application Modeling
To model a petabyte scale application we chose to run two concurrent workloads:
+ Large user data dataset (containing per-user data)
+ Read mostly
+ Regularly updated
+ Used by analytics applications
+ Smaller, yet real-time oriented application dataset
+ E.g. online bidding for ad-placement (OLTP)
+ Requires low-latency to meet real-time deadlines and maximize algorithms’ eﬃciency.

Back of the Envelope Sizing
+ 1 Billion users
+ 10,000 records per user
+ 100 bytes per record
+ ➞ 1 PetaByte of storage
+ 10 Million auctions
+ 1,000 records per auction
+ 1,000 bytes per record
+ ➞ Several Terabytes of storage

What were our Goals?
+ Construct a PB-scale Scylla cluster
+ Load the database with data.
+ Measure throughput and latency.
+ Order of 1 PB of user data, 1 TB of application data
+ Concurrent workloads: user and application datasets.
+ Measure throughput and latency.
■ 5M tps user workload (read-only, 80/20 RW, high throughput)
■ 200K tps application workload (50/50 RW, low latency)
+ Demonstrate the use of workload prioritization.

Bill of Materials
+ Scylla cluster: 20 x i3en.metal AWS instances, each having:
+ 96 vCPUs
+ 768 GiB RAM
+ 60 TB NVMe disk space
+ 100 Gbps network bandwidth
+ Load Generators: 50 x c5n.9xlarge AWS instances, each having:
+ 36 vCPUs
+ 96 GiB RAM
+ 50 Gbps network bandwidth

Software Used
+ Scylla Enterprise: version 2021.1.6
+ Cassandra-Stress: over Scylla shard-aware Java driver
+ Workload generator
+ Scylla-monitoring stack for metrics collection and presentation
+ Using Prometheus and Grafana

User Workload
The user keyspace was constructed as a key/value dataset.
+ 500Bi keys
+ Variable size values with a mean size of 600 bytes
+ Representing 1 PB of uncompressed text data with 3.33x compression ratio
+ LZ4 Compression
+ Replication Factor (RF) of 2
+ Consistency Level (CL): ONE
+ Keys were randomly selected in a uniform distribution.

User Workload
The read-only query workload was generated using cassandra-stress.
+ Each of the 50 load generators used a normal distribution to draw random keys
out of its assigned 1/50 range of the keys.
+ threads=1000 ﬁxed=100_000/s
+ Total of 5M read tps
+ Workload ran for 3 hours with 5 minutes warm-up time.

User Workload
The 80/20 read/write query workload was generated using cassandra-stress as well.
+ threads=1000 ﬁxed=100_000/s
+ Total of
■ 4M read tps
■ 1M write tps

Application Workload
The application keyspace was constructed as a key/value dataset.
+ 6Bi keys
+ Fixed size values of 250 bytes
+ Representing about 3 TB of uncompressed binary data with 2x compression ratio
+ LZ4 Compression
+ Replication Factor (RF) of 2
+ Consistency Level (CL): QUORUM
+ Keys were randomly selected in a uniform distribution.

Application Workload
The 50/50 read/write query workload was generated using cassandra-stress.
+ threads=1000 ﬁxed=4,000/s
+ Total of:
■ 100K read tps
■ 100K write tps

Data Ingestion
+ 7.5Mi inserts per seconds
+ Using 50 concurrent load generators
+ At 4ms 99% write latency
+ 1 PB completed in roughly 20 hours.

CPU load during ingestion
+ At 7.5Mi inserts/sec, CPU cores were loaded at ~90% on average.

Storage demands during ingestion
+ Today’s disks are able to handle multi-GB/s workloads

Storage demands during ingestion
900 MB/s commitlog
writes per instance.
7.5M inserts/sec *
3000 bytes * RF(2) /
50 nodes
Generate around 6 GB/s
per instance of
compaction I/O.
Overall:
20 nodes * 6GB/s ->
120GB/s!

Incremental Compaction in Action
+ ICS creates and deletes equal-sized sstables that dramatically reduce
temporary space ampliﬁcation during compaction.

Scaling system throughput
+ How much can the Scylla Petabyte cluster be loaded and still provide
single-digit millisecond 99% latency.

Concurrent workloads: R/W + Read-only
(1) Throughput is in transactions/second
(2) Latency is in milliseconds
(1) Workload: Application: 280K R/W User: 7M read-only
(2) Write latency 0.821 P50
2.232 P99
Read latency 1.433 P50
6.832 P99
0.885 P50
6.350 P99
+ 7Mi user read ops/sec + 280K application 50/50 R/W ops/sec
+ Stable high throughput with <10ms 99% latency.

Cache Eﬃciency
+ Note that cache hit rate is only
a little over 1% due to random
key/value reads
+ Potential for BYPASS CACHE to
further improve read
throughput.
+ Previous tests showed that BYPASS
CACHE may improve performance by 70%
and a all-cached setup will even be 4x.

Concurrent workloads: R/W + Read-only
(1) Workload: Application: 200K R/W User: 5M read-only
1.398 P99
2.279 P99
0.680 P50
1.932 P99
+ 5Mi user read ops/sec + 200K application 50/50 R/W ops/sec
+ High throughput with low, ~2 ms app 99% latency.

Concurrent workloads: R/W + 80/20
(1) Workload: Application: 200K R/W User: 5M 80/20 R/W
2.454 P99
0.326 P50
1.252 P99
4.555 P99
0.744 P50
3.709 P99
+ 5Mi user 80/20 R/W ops/sec + 200K application 50/50 R/W ops/sec
+ Added user write workload increases app workload latency.

Concurrent workloads: R/W + 80/20
With Workload Prioritization
+ As the 80/20 user workload interfered with the application latency,
let’s reduce its relative priority to better share the system resources.
(1) Workload: Application: 200K R/W User: 5M 80/20 R/W
before:
1000 shares
after:
1000 shares
before:
1000 shares
after:
500 shares
(2) Write latency 0.682
2.454
0.354 P50
1.184 P99
0.326
1.252
0.440 P50
3.244 P99
Read latency 1.195
4.555
0.855 P50
3.731 P99
0.744
3.709
1.043 P50
6.455 P99

+ Each service level has its own per-shard queue for consuming cpu and I/O
Service Levels in Action
Application workload (200K ops/sec)
User workload (5M ops/sec)

Main Challenges
As expected, setting up and testing a petabyte-scale database was not trivial.
That said, it didn’t take any unreasonable effort.
+ Provisioning: it took time to ﬁnd an AWS availability zone with enough on-demand instances
of the needed kind.
+ Hardware Tuning:
+ Interrupt handling cpus had to be manually assigned to maximize throughput (ﬁx will be merged for our
out-of-the-box machine images.)
+ cpupower governor set to “performance”
+ Benchmarking framework
+ Cassandra-stress not built for this scale (e.g. default population distribution is too small)
+ Data collection library had issues with large number of parallel machines

Scylla Conﬁguration
For running the benchmark, we’ve used the following non-default conﬁguration
■ Node level
• 4 irq-serving cpus (rather than 2 by default) - for handling high network throughput.
• mount -o discard (now default in OSS head-of-line)
■ scylla.yaml:
• compaction_static_shares: 100 - optimized for append-mostly workload (*)
• Head-of-line has improvements for compaction backlog controller
• compaction_enforce_min_threshold: true
■ Schema: compaction = {
'class': 'IncrementalCompactionStrategy',
'sstable_size_in_mb': 10000,
'space_amplification_goal': 1.25
} AND compression = {'sstable_compression': 'LZ4Compressor'};

Future Work
+ Whitepaper is coming up, expanding on this benchmark.
+ BYPASS CACHE:
+ Show the beneﬁts of using the Scylla BYPASS CACHE CQL query option to optimize utilization for e.g. random-
small reads workloads.
+ All-cached workload
+ Demonstrate maximum performance when the whole dataset ﬁts in cache.
+ Bear in mind that the i3en.metal instances have 768 GB of memory each!

Thank you!
Stay in touch
Benny Halevy
bhalevy@scylladb.com
@scylladb

Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Workloads on ScyllaDB

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Workloads on ScyllaDB

Similaire à Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Workloads on ScyllaDB (20)

Plus de ScyllaDB

Plus de ScyllaDB (20)

Dernier

Dernier (20)

Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Workloads on ScyllaDB