Video to the talk: https://www.youtube.com/watch?v=MTHj0_NdeeM
When you run Kubernetes in production and at scale, you encounter many issues both on the infrastructure side as well as in user-space. Some of these issues come with time and increased usage and size of clusters as well as amount of workloads, some might only come once you go global and into regions that have vastly different technology landscapes like China.
This talk goes into detail on learnings from concurrently operating 100+ clusters for big enterprises in production on different clouds and data centers around the globe. Over the years we have fixed 100s of post mortems and want to share both operations and development best-practices that can help avoid the issues we ran into. A big focus of this talk is getting towards a hardened and reliable cluster setup and the handling of multi-tenancy in clusters that are used by a multitude of teams.
9. “The primary goals of writing a postmortem are to
ensure that the incident is documented, that all
contributing root cause(s) are well understood, and,
especially, that effective preventive actions are put
in place to reduce the likelihood and/or impact of
recurrence.”
- Google SRE book
Postmortem Philosophy
@puja108 9
10. 1. Gather Issues
2. Fix in Code
3. Roll out continuously
4. Profit 😉
Single Product
@puja108 10
11. - Issue Template
- High Priority
- Assigned to
x-functional team
Postmortem
Practice
@puja108 11
18. @puja108
Customer Load
Test goes bad?
You take the
blame!
- “Must be Calico,
kube-proxy, IC!”
- Turns out EC2 network
saturation was the
bottleneck
- Solution: More
workers!
18
20. - Old versions
- Ingress (~15%)
- Networking & DNS
- Resource Pressure
- Multi-tenancy
Postmortem
Hotspots
@puja108 20
21. - Issues might have
been solved already
- CVEs
- Test Upgrades
extensively
- Automate Upgrades (or
have a process)
Old versions
@puja108 21
22. - NGINX IC: Newer
versions are less
prone to
misconfiguration
- Separate controllers
- Load- and
failover-testing
- Last resort:
SVC of type LB
Ingress
@puja108 22
23. - Monitor network
health
- Monitor DNS latency
- Check for known
issues
- Apply best practices
Networking & DNS
@puja108 23
24. - Resource Management!
- Include Buffers (lots
of them)
- Protect K8s and
critical addons
(priority)
Resource
Pressure
@puja108 24
25. Multi-tenancy
@puja108 25
- Separate and isolate
namespaces with RBAC
- No cluster-admins!
- Separate clusters if
possible
- Automate with CI/CD
- Minimize manual ops
26. Best Practices
@puja108 26
- Preemptive Monitoring &
Alerting are key!
- Logging (and Tracing)
help debugging
- Fix issues fast
- Educate users
- Have a postmortem process
- Train Recovery
27. Stand on the Shoulders of
Giants!
@puja108 27
- Kubernetes the very hard way - Datadog
- Scaling Kubernetes to 2,500 Nodes - OpenAI
- 5 - 15s DNS lookups on Kubernetes? - BitMEX
- Scaling CoreDNS in Kubernetes Clusters
- Inside Kubernetes Resource Management (QoS) -
Michael Gasch
- List of Kubernetes Best Practice talks/blogs
- Kubernetes Office Hours