The document introduces the Scylla Operator for Kubernetes, which provides a management layer for Scylla on Kubernetes. It addresses some limitations of using StatefulSets alone to run Scylla, such as safe scale down operations and tracking member identity. The operator implements the controller pattern with custom resources to deploy and manage Scylla clusters on Kubernetes. It handles tasks like cluster creation and scale up/down while addressing issues like local storage failures.
3. Problem Statement
● Great database
● Requires operational
expertise
● Great workload
management platform
Can we leverage Kubernetes to write a great management layer for Scylla ?
4. Pod
kubelet
Master
Node 1
kubelet
Node 2
kubelet
Node 3
kubelet
Node 4
API
Server
Pod
etcd nginx
Pod
MySQL
Pod
tomcat
Pod
kubectl apply
-f
save
Controllers
Scheduler
write
Various
Controllers
new
Pod
Node 4
new Pod
schedule
5. StatefulSet
Deploys and scales stateful software.
Provides guarantees for:
■ Pod uniqueness
● At most 1 of each Pod exists at any given time
■ Pod ordering
● Rolling Update and Deployment
■ Persistent network and storage identity
● DNS record and own Persistent Volume
storage
identity
network
identity
6. spec.replicas: status.replicas:
status.readyReplicas:
StatefulSet Controller
kubelet
Master
Node 1
kubelet
Node 2
kubelet
Node 3
kubelet
Node 4
API
Server
Pod
etcd
kubectl apply
-f
Controllers
StatefulSet
Controller
Various
Controllers
Node 4
write
Headless
Service
StatefulSet
save
scylla-0
Pod
scylla-0.scylla.
default.svc.cluster.local
scylla-1
Pod
scylla-1.scylla.
default.svc.cluster.local
scylla-2
Pod
scylla-2.scylla.
default.svc.cluster.local
3 0
0
1
1
2
2
3
3
11. StatefulSet: Confined to 1 Rack
Member Pod
Cluster
Rack
Datacenter
StatefulSet
StatefulSet
StatefulSet
Multiple Racks ?
Multiple Datacenters?
Pod
Member
12. Safe Scale Down 0
44
88
132
176
220
● Want to leave
○ nodetool decommission
● Stream data
● Leave
Scylla Ring
member-0 Up
member-1 Up
member-2 Up
member-3 Up
member-4 Up
member-5 UpLeaving
Member
Member
Member
Member Member
Member
13. StatefulSet: Unsafe Scale Down
kubelet
Master
Node 1
kubelet
Node 2
kubelet
Node 3
kubelet
Node 4
API
Server
Pod
etcd
Controllers
StatefulSet
Controller
Various
Controllers
Node 4
scylla-1
Pod
scylla-1.scylla.
default.svc.cluster.local
spec.replicas: 2
scylla-2
Pod
scylla-2.scylla.
default.svc.cluster.local
StatefulSet
Scale Down?
spec.replicas: status.replicas:
status.readyReplicas:
3 0
0
1
1
2
2
3
3
kubectl apply
-f
save
2
Data not streamed!
Scylla Ring
scylla-0 Up
scylla-1 Up
scylla-2 UpDown
Potential Data Loss!
scylla-0
Pod
scylla-0.scylla.
default.svc.cluster.local
14. StatefulSet: Cannot track Member identity
kubelet
Master
Node 1
kubelet
Node 2
kubelet
Node 3
kubelet
Node 4
API
Server
Pod
etcd
Controllers
StatefulSet
Controller
Various
Controllers
Node 4
scylla-0
Pod
scylla-0.scylla.
default.svc.cluster.local
scylla-2
Pod
scylla-2.scylla.
default.svc.cluster.local
scylla-1
Pod
scylla-1.scylla.
default.svc.cluster.local
Member Joining
Replace Member? Add new Member?
Node Fail
Must know Member identity beforehand!
15. Vanilla Solution: StatefulSet
Problems with:
■ Seeds
■ Multi-zone deployment
■ Scale Down
■ Loss of Persistence
■ Backups/Restores
■ Extensibility
What if we could create management software in
the image of Kubernetes Controllers?
18. StatefulSet
Pod
Rack N, Datacenter M
...
Cluster
Custom
Resource
Member
Services
(Static IP)
Controller
communication through Labels / Annotations
Member
Services
(Static IP)
Member
Services
(Static IP)
write
watch
Sidecar
JMX/HTTP
StatefulSet
Pod
Rack 1, Datacenter 1
Sidecar
JMX/HTTP
StatefulSet
Pod
Rack 1, Datacenter 2
Sidecar
JMX/HTTP
20. Sidecar
CRD + Controller + Sidecar
Sidecar
JMX/HTTP
Pod
Sidecar needed to:
■ Setup config files
■ Install plugins at startup
■ Backup and Restore functionality
■ Future extensibility
Member
21. An Alternative to DNS Records
Services already have a static IP, called ClusterIP.
Solution: ClusterIP Service per Pod
Drawbacks? :
■ Performance: iptables can handle a few hundred Members, IPVS
can handle thousands with no problem.
■ ClusterIP CIDR Depletion: Usually a /12 IP Block, so plenty of
addresses.
Much Requested Feature ->
■ What if we could have static IPs?
23. Cluster Creation & Scale Up
kubelet
Master
Node 1
kubelet
Node 2
kubelet
Node 3
kubelet
Node 4
API
Server
Pod
etcd
Controllers
Scylla
Operator
Various
Controllers
eu-west1-b
eu-west1-c
Spec:
eu-west1-b: 1 Members
eu-west1-c: 2 Members
Status:
eu-west1-b: 0 Members 0 ReadyMembers
eu-west1-c: 0 Members 0 ReadyMembers
scylla-eu-west1-b-0
Pod
10.96.0.1
Member
Service
scylla-eu-west1-c-0
Pod
10.96.0.3
Member
Service
scylla-eu-west1-c-1
Pod
10.96.0.4
Member
Service
Scylla
Cluster
write
kubectl
apply
save
new Cluster
1 1
1 12 2
StatefulSet
eu-west1-c
replicas: 0
StatefulSet
eu-west1-b
replicas: 01
12
24. kubelet
Scale Down
Sidecar
scylla-eu-west1-c-1
Member
Pod
kubelet
Master
Node 1
kubelet
Node 3
Node 4
API
Server
Pod
etcd
Controllers
Scylla
Operator
Various
Controllers
eu-west1-b
eu-west1-c
Spec:
eu-west1-b: 1 Members
eu-west1-c: 2 Members
Status:
eu-west1-b: 0 Members 0 ReadyMembers
eu-west1-c: 0 Members 0 ReadyMembers
scylla-eu-west1-b-0
Pod
10.96.0.1
Member
Service
scylla-eu-west1-c-0
Pod
10.96.0.3
Member
Service
Scylla
Cluster
kubectl
apply
save
scale down eu-west1-c
Cluster changed
10.96.0.4
1 1
1 12 2
StatefulSet
eu-west1-c
replicas: 0
StatefulSet
eu-west1-b
replicas: 01
12
1
Member
Service
decommissioned: false
nodetool decommission
Node 4
Scylla Ring
scylla-eu-west1-b-0 Up
scylla-eu-west1-c-0 Up
scylla-eu-west1-c-1 UpLeaving
decommissioned: true
stream
data
kubelet
Node 2
25. Local Storage vs Network Attached
Local NVME
SSD
Network Attached Storage
(AWS EBS, Google Persistent
Disk)
■ Fast
■ Ephemeral
■ Slow
■ Fault-tolerant
Scylla handles replication => Use Local Storage!
v1.10: Local Persistent Volumes in Beta
26. Local Storage Failure Scenarios
■ Disk Misbehaves
● Block errors
● Deteriorating performance
■ Disk Fails
● Mount Point Disappears
■ Node Fails
● With Disk on it
■ Pod still runs
■ Unhandled by K8s
■ Pod fails to start
■ Unhandled by K8s
■ Pod fails to be scheduled
■ Unhandled by K8s
Common in the Cloud!
27. Node Fail
kubelet
Master
Node 1
kubelet
Node 2
kubelet
Node 4
API
Server
Pod
etcd
Controllers
Scylla
Operator
Various
Controllers
/mnt/ssd1 /mnt/ssd1
/mnt/ssd1
member-0
Pod
10.96.0.1
Member
Service
kubelet
Node 3
/mnt/ssd1
member-1
Pod
10.96.0.3
Member
Service
member-2
Pod
10.96.0.4
Member
Service
Node Fail
Admin / Fencing Software
Delete Node 3
StatefulSet changed
Recreate PVC
member-1
Pod
10.96.0.3
Member
Service
Empty Disk
30. Take away
Kubernetes helps to manage Scylla, but has some limitations:
■ CPU Pinning
● Huge performance gains.
● Must be enabled in the kubelet.
● Many managed solutions don’t enable it.
■ Local Storage
● Supported but still needs improvement.
● Some vendors don’t offer high storage machines for K8s.
■ Multi-Region Clusters
● Still an unsolved problem.
“Cost of Containerization” by Moreno Garcia:
https://www.scylladb.com/2018/08/09/cost-containerization-scylla/
31. Future Work
Scylla Operator
■ Repairs with Scylla Manager
■ Multi-Region Clusters
● Very early support in Kubernetes
● LoadBalancer per Pod is a possible workaround
■ Backups and Restores
■ File your own issue:
● https://github.com/scylladb/scylla-operator
Kubernetes
■ Better Support for Local Storage
● Monitoring, scheduling
32. Thank you Stay in touch
Any questions?
Yannis Zarkadas
yanniszark@arrikto.com
@yanniszark
Notes de l'éditeur
Overview of distributed nature of Scylla
Overview: each member stores a different portion of the data
Intro to kubernetes:
Smallest unit of processing: Pod
Declarative nature: user declares desired state, Kubernetes works to satisfy
Kubernetes’ solution for running DBs: StatefulSet
Example of how the StatefulSet works
Controller pattern that appears everywhere in K8s:
1. Observe desired state
2. Calculate actual state
3. Diff and take action
What is missing to enable us to build our own controller?Custom Objects.
CRDs enable us to store custom objects in etcd.
Operator pattern.Controller acts as a human operator would.
Examples of how our design addresses each of the StatefulSet’s shortcomings.