3. on premises Google packet.net
DMaaS
Analytics
Alerting
Compliance
Policies
Declarative Data Plane
A
P
I
Advisory
Chatbot
4. Resistance Is Futile
• K8s based on the original Google Borg paper
• Containers are the “unit” of management
• Mostly web based applications
• Typically the apps where stateless — if you agree there is such a thing
• In its most simplistic form k8s is a control loop
• Converge to the desired state based on declarative intent provided by the DevOps
persona
• Abstract away underlying compute cluster details and decouple apps from
infra structure: avoid lock-in
• Have developer focus on application deployment and not worry about the
environment it runs in
• HW independent (commodity)
6. Persistency in Volatile Environnements
• Containers storage is ephemeral; data is only stored during the life time of
the container(s)
• This either means that temporary data has no value or it can be regenerated
• Sharing data between containers is also a challenge — need to persist
• In the case of severless — the intermediate state between tasks is ephemeral
• The problem then: containers need persistent volumes in order to run state
full workloads
• While doing so: abstract away the underlying storage details and decouple
the data from the underlying infra: avoid lock-in
• The “bar” has been set in terms of expectation by the cloud providers i.e PD, EBS
• Volume available at multiple DCs and/or regions and replicated
7. Data Loss Is Almost Guaranteed
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
Unless…
8. Use a “Cloud” Disk
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist!
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
9. Evaluation and Progress
• In both cases we tie ourselves to a particular node — that defeats the agility
found natively in k8s and it failed to abstract away details
• We are cherrypicking pets from our herd
• anti pattern — easy to say and hard to avoid in some cases
• The second example allows us to mount (who?) the PV to different nodes
but requires volumes to be created prior to launching the workload
• Good — not great
• More abstraction through community efforts around Persistent Volumes
(PV) and Persistent Volume Claims (PVC) and CSI
• Container Storage Interface (CSI) to handle vendor specific needs before, in
example, mounting the volume
• Avoid wild fire of “volume plugins” or “drivers” in k8s main repo
11. Summary So Far
• Register a set of “mountable” things to the cluster (PVC)
• Take ownership of a “mountable” thing in the cluster (PV)
• Refer in the application to the PVC
• Dynamic provisioning; create ad-hoc PVCs when claiming something that
does not exist yet
• Remove the need to preallocate them (is that a good thing?)
• The attaching and detaching of volumes to nodes is standardised by means
of CSI which is an gRPC interface that handles the details of creating,
attaching, staging, destroying etc
• Vendor specific implementations are hidden from the users
12. The Basics — Follow the Workload
Node Node
POD
PVC
13. Problem Solved?
• How does a developer configure the PV such that it exactly has the features
that are required for that particular workload
• Number of replica’s, Compression, Snapshot and clones (opt in/out)
• How do we abstract away differences between storage vendors when
moving to/from private or public cloud?
• Differences in replication approaches — usually not interchangeable
• Abstract away access protocol and feature mismatch
• Provide cloud native storage type like “look and feel” on premises ?
• Don't throw away our million dollar existing storage infra
• GKE on premisses, AWS outpost — if you are not going to the cloud it will come to
you, resistance if futile
• Make data as agile as the applications that they serve
14. Data Gravity
• As data grows — it has the tendency to pull applications towards it (gravity)
• Everything will evolve around the sun and it dominates the planets
• Latency, throughput, IO blender
• If the sun goes super nova — all your apps circling it will be gone instantly
• Some solutions involve replicating the sun towards some other location in
the “space time continuum”
• It works — but it exacerbates the problem
17. Cloud Native Architecture?
• Applications have changed, and somebody forgot to tell storage
• Cloud native applications are —distributed systems themselves
• May use a variety of protocols to achieve consensus (Paxos, Gossip, etc)
• Is a distributed storage system still needed?
• Designed to fail and expected to fail
• Across racks, DC’s, regions and providers, physical or virtual
• Scalability batteries included
• HaProxy, Envoy, Nginx
• Datasets of individual containers relativity small in terms of IO and size
• Prefer having a collection of small stars over a big sun?
• The rise of cloud native languages such as Ballerina, Metaparticle etc
18. HW / Storage Trends
• Hardware trends enforce a change in the way we do things
• 40GbE and 100GbE are ramping up, RDMA capable
• NVMe and NVMe-OF (transport — works on any device)
• Increasing core counts — concurrency primitives built into languages
• Storage limitations bubble up in SW design (infra as code)
• “don’t do this because of that” — “don’t run X while I run my backup”
• Friction between teams creates “shadow it” — the (storage) problems start when
we move back from the dark side of the moon back into the sun
• “We simply use DAS —as there is nothing faster then that”
• small stars, that would work — no “enterprise features”?
• “they have to figure that out for themselves”
• Seems like storage is an agility anti-pattern?
20. The Persona Changed
• Deliver fast and frequently
• Infrastructure as code, declarative
intent, gitOps, chatOps
• K8s as the unified cross cloud
control plane (control loop)
• So what about storage? It has not
changed at all
21. The Idea
Manifests express intent
stateless
Container 1 Container 2 Container 3
stateful
Data Container Data Container Data Container
Any Server, Any Cloud Any Server, Any Cloud
container(n) container(n) container(n)
container(n) container(n) container(n)
22. Design Constraints
• Built on top of the substrate of Kubernetes
• That was a bet we made ~2 years ago that turned out to be right
• Not yet another distributed storage system; small is the new big
• Not to be confused with not scalable
• One on top of the other, an operational nightmare?
• Per workload: using declarative intent defined by the persona
• Runs in containers for containers — so it needs to run in user space
• Make volumes omnipresent — compute follows the storage?
• Where is the value? Compute or the data that feeds the compute?
• Not a clustered storage instance rather a cluster of storage instances
29. User Space and Performance
• NVMe as a transport is a game changer not just for its speed potential, but
also due to its relentless break away of the SCSI layer (1978)
• A Lot of similarities with Infini Band technology found in HPC for many years
(1999 as a result of a merger)
31. HW Changes Enforce A Change
• With these low latency devices CPUs are becoming the
bottleneck
• Post spectre/meltdown syscalls have become more expensive
then ever
35. K8S as a Control Loop
Kubelet
K8s
Master
YAML
+ -
Primary loop (k8s)
OP Sched
API
Servers
…..
36. -
+
Extending the K8S Control Loop
Kubeletk8s++
Adapt
YAML
+ -
RefMO
Primary loop (k8s)
Secondary loop (MOAC)
37. Raising the Bar — Automated Error Correction
CAS
FIO FIO FIO
replay blk IO pattern of various apps
kubectl scale up and down
DB
Regression
AI/ML
Logs Telemetry
Learn what failure
impacts app how
Declarative Data Plane
A
P
I