2. What even is a Kubernete?
● In Cyrillic: Κυβερνήτη
● Translated to English: Commander
● Some say Helmsman…?
3. Why Kubernetes?
● Previous tool (homegrown)
○ Built images (AMIs) based on composable “layers”.
○ Orchestrated AWS primitives (ASGs, LCs, ELBs).
○ Slow to build, the base image was hard to maintain.
○ Little documentation.
○ Maintained by 1 person primarily.
○ Became a large collection of Jenkins wrapper scripts over time.
● Kubernetes
○ Leverage a growing, active community for support.
○ Leverage the shared knowledge and expertise of many.
○ Good documentation!?
○ Enable hiring people who already know it.
○ Build upon and use well-known CI and deployment patterns.
4. A few downsides to Kubernetes
● The rate of change is a bit challenging.
● Many related projects come and go, keeping current is hard.
● Cloud provider specifics are often trickier than they first look.
● Only the Kubernetes core has been load-tested for scale. Finding out which
other pieces don’t scale is super “fun”.
5. Initial requirements
● Must be HA.
● Shared cluster across multiple teams.
○ Authorization based on groups managed elsewhere.
○ Network policy support for defense in depth.
● Low-latency networking.
● AWS IAM integration for applications.
● High level of instrumentation & introspection.
6. So simple in theory...in reality it was way more complicated than this
8. Management & disaster recovery
● Chronological order
○ Kube-up.sh
○ kube-aws (CloudFormation)
○ Terraform
○ Troposphere
○ CoreOS tectonic (terraform+)
○ Kops <- what we’re still using today
● Kops + Terraform (network level)
○ Infrastructure as code
○ Kops has cluster introspection, rolling-updates are possible
○ Lesson learned, managing upgrades at on a per-instance group level is safest.
9. Networking
So many options! A wide variety of use cases.
● CNI (Container Network Interface) was still a new standard.
● We didn’t need Layer 2 features.
● Performance (especially low latency) was important.
● NetworkPolicy support.
● We chose Calico.
○ A bit daunting but closest to standard networking.
○ Met our requirements.
○ Fast moving target.
● How do we make cluster debugging and connectivity easier?
○ VPN vs. Bastions
○ DNS for cluster internals while on VPN?
10. PaaS? & user tools
● Shall it come to PaaS?
○ Deis Workflow? Nope.
○ OpenShift? Nope.
○ Cloud Foundry? Nope.
○ Knative? Maybe…
● User tools
○ Helm
○ Kustomize
○ Jsonnet
11. Problems we encountered
● DNS
● DNS
● DNS
● Resource requests & limits
● Workload isolation
● Etcd v2
● “Bad” nodes
12. DNS issues as you scale
● Autoscaling
○ Early kube-dns didn’t autoscale.
● Too many queries!
○ 1 DNS query turns into 10, every time.
● Node DNS cache
○ Lightweight build of coredns that runs on all nodes, forwards cluster queries to central
CoreDNS.
○ Co-presented at KubeCon EU about it.
13. Workload isolation & resources
Resource requests & limits
● Requires a lot of training.
● Kubernetes admins start feeling like resource cops.
● Good metrics and alerting are crucial.
Workload Isolation
● Helps with the above, but is a heavy-handed solution.
● Limits the efficiency gains from bin-packing.
● Required for safety and reliability in some cases.
14. Etcd v2 & “Bad” nodes
Etcd version 2
● We got stuck on v2 because of Kops & Calico.
● Kubernetes has removed support for etcd v2 as of v1.13.
“Bad” nodes
● A very unspecific term for a large amount of problems.
● We use node-problem-detector with custom monitors to catch a handful of
these.
● Regularly adding new use cases
15. Links & Questions?
Kubernetes Failure Stories:
https://github.com/hjacobs/kubernetes-failure-stories
Node-local-dns cache talk:
https://static.sched.com/hosted_files/kccnceu19/4b/KubeCon-Europe-2019-nodelocaldns.pdf
An early comparison of CNI providers:
https://docs.google.com/spreadsheets/d/1polIS2pvjOxCZ7hpXbra68CluwOZybsP1IYfr-HrAXc/edit#gid=0