This document describes a key-key-value store that Discord built to store composite primary keys in ScyllaDB. The store aims to mitigate performance issues from tombstones and large partitions by using application tombstones and automatic partition splitting. It provides APIs for creating, getting, deleting and scanning entities under a parent identifier. When partitions grow too large, it doubles the shard count and copies data to new shards in a resilient process. This generic datastore has supported over 20 use cases at Discord with no production incidents attributed to ScyllaDB.
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Automatic Partition Splitting
1. Key-Key-Value Store:
Generic NoSQL Datastore with
Tombstone Reduction and
Automatic Partition Splitting
Stephen Ma, Senior Software Engineer
2. Sining (Stephen) Ma
■ Sr Software Engineer working in the persistence
infrastructure team at Discord
■ Building and maintaining services on Discord’s
ScyllaDB clusters
3. Context: Why do we want to build Key Key Value (KKV) store?
Features: What features are provided by KKV store?
Implementation: Methods provided by KKV & implementation challenges?
Agenda
5. Why ?
● We want to support storing and querying entities identified by composite
primary keys. The composite key consists of a parent ID and an entity ID.
a. An example, an emoji in a server is identified by a server id and an emoji id.
b. Data is stored as binary blob.
c. Supports Range Scan of entity IDs within one parent ID.
● A very easy and fast solution to onboard new use cases.
a. No new table creation or database schema migration when iterating on the data model.
● Hide Scylla-specific pitfalls from developers.
a. Mitigate potential performance costs associated with tombstones.
b. Avoid too many entities stored in one ScyllaDB partition.
6. Why not BigTable but ScyllaDB?
● BigTable Pros:
a. Key is stored as string, arbitrary number of keys, key data type is flexible.
■ e.g. 822219052402606132#857417264214442014
■ Bigtable supports scanning rows given a key prefix
b. A GCP service, simple administration and easy for operational work.
● BigTable Cons:
a. GCP Cloud Bigtable is one zone per cluster.
b. Cloud Bigtable provides read after your write consistency only if clients do single-cluster
routing but single-cluster routing does not provide auto-failover.
8. Features
● All entities of one use case are stored under one namespace.
● The stored data payload of an entity is protobuf binary format.
● There is an upper limit for the data size per entity.
● No support to partial update a subset of fields in an entity.
● Offer no guarantees of sync reads/writes on the entity. Guard your code
accordingly.
9. APIs provided by KKV Store & internal implementation
Implementation
10. APIs Provided by KKV Store
● Create a new entity or update an existing one.
● Get an entity by a parent ID and entity ID.
● Delete an entity given a parent ID and entity ID.
● Range scan multiple entities under a parent ID.
a. Specify a start entity ID and/or an end entity ID.
b. Limit the number of entities returned.
c. Return unsorted entities.
d. Or return sorted entities.
● Binary data serialization and deserialization are done in the clients.
11. Mitigating Performance Impact of
Tombstones with Application Tombstones
● Excessive tombstones could
increase read latency or failures
from ScyllaDB, especially for
range scan queries.
● When an entity is deleted from
ScyllaDB, ScyllaDB doesn’t purge
it but writes a tombstone.
● If tombstones are not written to
3 replicas, this results in
resurrection of deleted data after
tombstones are deleted as part
of compaction.
● App tombstone design allows to
set gc_grace_seconds to 0. gc_grace_seconds = 0 !!
13. Partition Resharding
If all entities under one parent ID are stored in one shard, we could have large partitions.
Large partitions could cause hot-spotting and query timeout.
Solution:
● One shard has a limit for the maximum number of entities.
● When more than a certain number of entities are stored in one shard, send a
resharding message to Google pubsub.
● Resharding process doubles total shard count and then copies all entities from old
shards into new shards.
14. Resharding Workflow
● Store the resharding state metadata in ScyllaDB
between each step.
● Updating partition info table must grab a distributed
lock.
● Copying or deleting data in a shard enables client
speculative retry.
Efficient retry on data copy or clean up failures.
Validate
Copy data from old
partition to new partition
Update Partition Info
Cleanup data in old partition
Complete
15. After Shipped …
● Written in Rust
● Serving 20+ use cases in production and keep growing
● Taking one day to implement and then ship a new use case to
production
No production incident from KKV store Service was caused by ScyllaDB
16. Thank You
Stay in Touch
Stephen Ma
siningma
www.linkedin.com/in/siningma/