Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
A Million Ways
to Crash Your Cluster
CONTAINER CAMP UK
HENNING JACOBS
@try_except_
2018-09-07
4
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15.000
employees in
Europe
> 70%
of v...
5
SCALE
95Clusters
378Accounts
INCIDENTS ARE FINE
7
INCIDENT #1: CUSTOMER IMPACT
8
INCIDENT #1: IAM RETURNING 404
9
INCIDENT #1: NUMBER OF PODS
10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DE...
11
INCIDENT #1: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "...
12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foo...
13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to...
14
INCIDENT #2: CLUSTER DOWN
15
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
16
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> ...
17
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
19
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
20
INCIDENT #3: LATENCY SPIKES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeout...
21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEP...
22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems durin...
23
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernete...
24
INCIDENT #4: IMPACT
25
INCIDENT #4: CLUSTER DOWN?
26
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
28
CLUSTER UPGRADE
FLOW
29
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
30
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playgr...
31
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
32
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev t...
33
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cl...
34
INCIDENT #4: LESSONS LEARNED
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration aut...
35
INCIDENT #5: IMPACT
[4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied
one ...
36
INCIDENT #5: IMPACT
37
INCIDENT #5: A VERY INNOCENT PULL REQUEST
38
INCIDENT #5: WHAT HAPPENED
• Deployment caused rebuild with the latest stable Go version
• Library for signature verifi...
39
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of...
40
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS...
41
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redes...
42
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
WELCOME TO
CLOUD NATIVE!
44
45
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zala...
46
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report
https://github.com/hjacobs/kube-ops-view
48
OTHER TALKS
• Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017
• Monzo: Anatomy of a Production Kubernetes Outa...
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK
Prochain SlideShare
Chargement dans…5
×

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK

2 382 vues

Publié le

Bootstrapping a Kubernetes cluster is easy, rolling it out to nearly 200 engineering teams and operating it at scale is a challenge. In this talk, we are presenting our approach to Kubernetes provisioning on AWS, operations and developer experience for our growing Zalando developer base. We will walk you through our horror stories of operating 80+ clusters and share the insights we gained from incidents, failures, user reports and general observations. Most of our learnings apply to other Kubernetes infrastructures (EKS, GKE, ..) as well. This talk strives to reduce the audience’s unknown unknowns about running Kubernetes in production.

https://2018.container.camp/uk/schedule/running-kubernetes-in-production-a-million-ways-to-crash-your-cluster/

Publié dans : Technologie
  • Soyez le premier à commenter

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK

  1. 1. A Million Ways to Crash Your Cluster CONTAINER CAMP UK HENNING JACOBS @try_except_ 2018-09-07
  2. 2. 4 ZALANDO AT A GLANCE ~ 4.5billion EUR revenue 2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 23 million active customers > 300.000 product choices ~ 2.000 brands 15 countries
  3. 3. 5 SCALE 95Clusters 378Accounts
  4. 4. INCIDENTS ARE FINE
  5. 5. 7 INCIDENT #1: CUSTOMER IMPACT
  6. 6. 8 INCIDENT #1: IAM RETURNING 404
  7. 7. 9 INCIDENT #1: NUMBER OF PODS
  8. 8. 10 LIFE OF A REQUEST (INGRESS) DNS my-app.example.org ALB aws-1234-lb.eu-central-1.elb.amazonaws.com SERVICE 10.3.0.216 DEPLOYMENT POD 10.2.0.1 POD 10.2.1.1 POD 10.2.2.1 POD 10.2.3.1 SKIPPER 172.31.1.1:9999 SKIPPER 172.31.2.1:9999 SKIPPER 172.31.3.1:9999 SKIPPER 172.31.4.1:9999 ALIAS Record
  9. 9. 11 INCIDENT #1: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: metadata: labels: application: "foobar" spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  10. 10. 12 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: metadata: labels: application: "foobar" spec: restartPolicy: Never containers: ...
  11. 11. 13 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Skipper Ingress to stay “healthy” during API server problems • Fix Skipper Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  12. 12. 14 INCIDENT #2: CLUSTER DOWN
  13. 13. 15 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  14. 14. 16 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  15. 15. 17 Junior Engineers are Features, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
  16. 16. https://www.outcome-eng.com/human-error-never-root-cause/
  17. 17. 19 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  18. 18. 20 INCIDENT #3: LATENCY SPIKES ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ...
  19. 19. 21 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  20. 20. 22 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  21. 21. 23 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  22. 22. 24 INCIDENT #4: IMPACT
  23. 23. 25 INCIDENT #4: CLUSTER DOWN?
  24. 24. 26 INCIDENT #4: THE TRIGGER
  25. 25. https://www.outcome-eng.com/human-error-never-root-cause/
  26. 26. 28 CLUSTER UPGRADE FLOW
  27. 27. 29 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager
  28. 28. 30 CLUSTER CHANNELS github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and playground clusters. 3 alpha Main infrastructure cluster (important to us). 1 beta Product clusters for the rest of the organization (prod/test). 90+
  29. 29. 31 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws
  30. 30. 32 RUNNING E2E TESTS (BEFORE) Control plane nodenode branch: dev Create Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  31. 31. 33 RUNNING E2E TESTS (NOW) Control plane nodenode Control plane nodenode branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  32. 32. 34 INCIDENT #4: LESSONS LEARNED • Automated end-to-end tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with the previous configuration • Apply new configuration • Run end-to-end & conformance tests
  33. 33. 35 INCIDENT #5: IMPACT [4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 .. [5:01 PM] Alice: Now it does not start the build step at all [5:02 PM] John: +1 [5:02 PM] John: Failed to create builder pod: … [5:02 PM] Pedro: +1 [5:04 PM] Damien: +1 [5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which results in many problems… ...
  34. 34. 36 INCIDENT #5: IMPACT
  35. 35. 37 INCIDENT #5: A VERY INNOCENT PULL REQUEST
  36. 36. 38 INCIDENT #5: WHAT HAPPENED • Deployment caused rebuild with the latest stable Go version • Library for signature verification was incompatible with Go 1.10, causing all verification checks to fail during runtime. • Lack of unit/smoke tests and alerting for one component • "Near miss": outage could have had large impact
  37. 37. 39 A MILLION WAYS TO CRASH YOUR CLUSTER? • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
  38. 38. 40 MORE TOPICS • Graceful Pod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
  39. 39. 41 RACE CONDITIONS.. • Switch to the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • github.com/zalando-incubator/kubernetes-on-aws
  40. 40. 42 TIMEOUTS TO API SERVER.. github.com/zalando-incubator/kubernetes-on-aws
  41. 41. WELCOME TO CLOUD NATIVE!
  42. 42. 44
  43. 43. 45 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  44. 44. 46 KUBERNETES RESOURCE REPORT github.com/hjacobs/kube-resource-report
  45. 45. https://github.com/hjacobs/kube-ops-view
  46. 46. 48 OTHER TALKS • Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017 • Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018 • Inside Kubernetes Resource Management (QoS) - KubeCon 2018 We need more failure talks!
  47. 47. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

×