Talk given on 2019-05-21 at KubeCon Barcelona: https://kccnceu19.sched.com/event/MPcM/kubernetes-failure-stories-and-how-to-crash-your-clusters-henning-jacobs-zalando-se
Bootstrapping a Kubernetes cluster is easy, rolling it out to nearly 200 engineering teams and operating it at scale is a challenge. In this talk, we are presenting our approach to Kubernetes provisioning on AWS, operations and developer experience for our growing Zalando developer base. We will walk you through our horror stories of operating 100+ clusters and share the insights we gained from incidents, failures, user reports and general observations. Our failure stories will be sourced from recent and past incidents, so the talk will be up-to-date with our latest experiences.
Most of our learnings apply to other Kubernetes infrastructures (EKS, GKE, ..) as well. This talk strives to reduce the audience's unknown unknowns about running Kubernetes in production.