Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Developer Experience at Zalando - CNCF End User SIG-DX

580 vues

Publié le

Presentation given on 2019-04-18 in the regular CNCF End User Developer Experience SIG call (Zoom).

Publié dans : Technologie
  • Soyez le premier à commenter

Developer Experience at Zalando - CNCF End User SIG-DX

  1. 1. CNCF END USER SIG-DX 2019-04-18 HENNING JACOBS @try_except_ Developer Experience at Zalando
  2. 2. 2 EUROPE’S LEADING ONLINE FASHION PLATFORM
  3. 3. 3 ZALANDO AT A GLANCE ~ 5.4billion EUR revenue 2018 > 250 million visits per month > 15.000 employees in Europe > 79% of visits via mobile devices > 26 million active customers > 300.000 product choices ~ 2.000 brands 17 countries
  4. 4. 4 Platform > 1100 developers > 200 development teams
  5. 5. 5 YOU BUILD IT, YOU RUN IT The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. - A Conversation with Werner Vogels, ACM Queue, 2006
  6. 6. 6 ON-CALL: YOU OWN IT, YOU RUN IT When things are broken, we want people with the best context trying to fix things. - Blake Scrivener, Netflix SRE Manager
  7. 7. 7 KUBERNETES @ ZALANDO 114 clusters 1400~ nodes Since Oct 2016 Node Autoscaling From v1.4 to v1.12 Default Deployment Target
  8. 8. 8 DEVELOPERS USING KUBERNETES
  9. 9. 9 DEVELOPER JOURNEY Consistent story that models all aspects of SW dev
  10. 10. 10 Developer Journey
  11. 11. 11 Developer Journey Correctness Compliance GDPR Security Cost Efficiency 24x7 On Call Governance Resilience Capacity ...
  12. 12. 12 DEVELOPER PRODUCTIVITY Code Build Test Deploy OperateSetup Cloud Native Application Runtime
  13. 13. 14 PLAN & SETUP
  14. 14. 15 Plan Stories Rules of Play Tech Radar
  15. 15. 17 Setup Application Bootstrapping
  16. 16. 20 BUILD & TEST
  17. 17. 21 CDPGit code push CONTINUOUS DELIVERY PLATFORM: BUILD
  18. 18. 23 DEPLOY
  19. 19. 24 Deploy Kubernetes
  20. 20. 25 DEPLOYMENT CONFIGURATION ├── deploy/apply │ ├── deployment.yaml │ ├── credentials.yaml # Zalando IAM │ ├── ingress.yaml │ └── service.yaml └── delivery.yaml # Zalando CI/CD
  21. 21. 26 INGRESS.YAML kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  22. 22. 27 TEMPLATING: MUSTACHE kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "{{{APPLICATION}}}.example.org" http: paths: - backend: serviceName: "{{{APPLICATION}}}" servicePort: 80
  23. 23. 28 CONTINUOUS DELIVERY PLATFORM
  24. 24. 29 CDP: DEPLOY "glorified kubectl apply"
  25. 25. 30 CDP: OPTIONAL APPROVAL
  26. 26. 31 STACKSET: TRAFFIC SWITCHING github.com/zalando-incubator/stackset-controller
  27. 27. 32 STACKSET CRD apiVersion: zalando.org/v1 kind: StackSet ... spec: ingress: hosts: ["foo.example.org"] backendPort: 8080 stackLifecycle: scaledownTTLSeconds: 1800 limit: 5 stackTemplate: spec: podTemplate: ... github.com/zalando-incubator/stackset-controller
  28. 28. 33 TRAFFIC SWITCHING STEPS IN CDP github.com/zalando-incubator/stackset-controller
  29. 29. 34 EMERGENCY ACCESS SERVICE Get emergency access by referencing existing Incident ticket: zkubectl cluster-access request --emergency -i INC REASON Get privileged production access via 4-eyes: zkubectl cluster-access request REASON zkubectl cluster-access approve USERNAME
  30. 30. 35 INTEGRATIONS
  31. 31. 36 CLOUD FORMATION VIA CI/CD ├── deploy/apply │ ├── deployment.yaml # Kubernetes │ ├── cf-iam-role.yaml # AWS IAM Role │ ├── cf-rds.yaml # AWS RDS Database │ ├── kube-ingress.yaml │ ├── kube-secret.yaml │ └── kube-service.yaml └── delivery.yaml # CI/CD config "Infrastructure as Code"
  32. 32. 37 ZALANDO IAM/OAUTH VIA CRD apiVersion: zalando.org/v1 kind: PlatformCredentialsSet .. spec: application: my-app tokens: read-only: privileges: - com.zalando::foobar.read clients: employee: grant: authorization-code realm: users redirectUri: https://example.org/auth/callback
  33. 33. 38 POSTGRES OPERATOR Application to manage PostgreSQL clusters on Kubernetes >700 clusters running on Kubernetes github.com/zalando/postgres-operator
  34. 34. Elasticsearch in Kubernetes Elasticsearch 2.500 vCPUs 1 TB RAM github.com/zalando-incubator/es-operator/
  35. 35. 40 SUMMARY • Application Bootstrapping • Git as source of truth and UI • 4-eyes principle for master/production • Extensible Kubernetes API as primary interface • OAuth/IAM credentials • PostgreSQL • CloudFormation for proprietary AWS services
  36. 36. 41 DELIVERY PERFORMANCE METRICS • Lead Time • Release Frequency • Time to Restore Service • Change Fail Rate https://srcco.de/posts/accelerate-software-delivery-performance.html
  37. 37. 42 CONTAINERS From "Accelerate: The Science of Lean Software and DevOps"
  38. 38. 43 DELIVERY PERFORMANCE METRICS • Lead Time • Release Frequency • Time to Restore Service • Change Fail Rate ≙ Commit to Prod ≙ Deploys/week/dev ≙ MTRS from incidents ≙ n/a
  39. 39. “.. means establishing empathy with internal consumers (read: developers) and collaborating with them on the design. Platform product managers establish roadmaps and ensure the platform delivers value to the business and enhances the developer experience.” - ThoughtWorks Technology Radar
  40. 40. 46 DEVELOPER SATISFACTION
  41. 41. 47 DOCUMENTATION "Documentation is hard to find" "Documentation is not comprehensive enough" "Remove unnecessary complexity and obstacles." "Get the documentation up to date and prepare use cases" "More and more clear documentation" "More detailed docs, example repos with more complicated deployments."
  42. 42. 48 DOCUMENTATION • Restructure following https://www.divio.com/en/blog/documentation/ • Concepts • How Tos • Tutorials • Reference • Global Search • Weekly Health Check: Support → Documentation
  43. 43. 50 NEWSLETTER "You can now.." • You can now benefit from the most recent Kubernetes 1.12 features, e.g. .. • You can now analyse your Kotlin project with SonarQube and upload your Scala code coverage report to SonarQube
  44. 44. 51 SIGNAL: ISSUE UPVOTES
  45. 45. 52 TESTIMONIALS “So, thank you, Team Automata, for listening to our community, taking our upvotes in consideration when developing new solutions and building every day 'the first CI that doesn't suck'.” - a user, October 2018
  46. 46. 53 MONITORING
  47. 47. 54 ZMON DASHBOARD github.com/zalando/zmon
  48. 48. 55 GRAFANA APPLICATION DASHBOARD
  49. 49. 56 KUBERNETES RESOURCE REPORT github.com/hjacobs/kube-resource-report
  50. 50. 57 RESOURCE REPORT: TEAMS Sorting teams by Slack Costs github.com/hjacobs/kube-resource-report
  51. 51. 58 RESOURCE REPORT: APPLICATIONS "Slack"
  52. 52. 59 RESOURCE REPORT: CLUSTERS github.com/hjacobs/kube-resource-report "Slack"
  53. 53. 60 UNDER THE HOOD
  54. 54. 61 ZALANDO: DECISION 1. Forbid Memory Overcommit • Implement mutating admission webhook • Set requests = limits 2. Disable CPU CFS Quota in all clusters • --cpu-cfs-quota=false
  55. 55. 62 KUBERNETES CLUSTER SETUP Master Config Worker EC2 Instances CloudFormation Stacks github.com/zalando-incubator/kubernetes-on-aws Master
  56. 56. 63 CLUSTER PROVISIONING CLUSTER LIFECYCLE MANAGER (CLM) ADMIN create apply manifests provision resources create CF stack CLUSTER REGISTRY CLM API ... ... ... CloudFormation API github.com/zalando-incubator/cluster-lifecycle-manager github.com/zalando-incubator/kubernetes-on-aws
  57. 57. 64 INGRESS https://github.com/zalando-incubator/kube-ingress-aws-controller
  58. 58. 65 VPA FOR PROMETHEUS apiVersion: poc.autoscaling.k8s.io/v1alpha1 kind: VerticalPodAutoscaler metadata: name: prometheus-vpa namespace: kube-system spec: selector: matchLabels: application: prometheus updatePolicy: updateMode: Auto CPU/memory
  59. 59. 66 VERTICAL POD AUTOSCALER limit/requests adapted by VPA
  60. 60. 67 HORIZONTAL POD AUTOSCALING (CUSTOM METRICS) Queue Length Prometheus Query Ingress Req/s ZMON Check github.com/zalando-incubator/kube-metrics-adapter
  61. 61. 68 DOWNSCALING DURING OFF-HOURS github.com/hjacobs/kube-downscaler Weekend
  62. 62. 69 DOWNSCALING DURING OFF-HOURS DEFAULT_UPTIME="Mon-Fri 07:30-20:30 CET" annotations: downscaler/exclude: "true" github.com/hjacobs/kube-downscaler
  63. 63. 70 KUBERNETES JANITOR ● TTL and expiry date annotations, e.g. ○ set time-to-live for your test deployment ● Custom rules, e.g. ○ delete everything without "app" label after 7 days github.com/hjacobs/kube-janitor
  64. 64. 71 JANITOR TTL ANNOTATION # let's try out nginx, but only for 1 hour kubectl run nginx --image=nginx kubectl annotate deploy nginx janitor/ttl=1h github.com/hjacobs/kube-janitor
  65. 65. 72 CUSTOM JANITOR RULES # require "app" label for new pods starting April 2019 - id: require-app-label-april-2019 resources: - deployments - statefulsets jmespath: "!(spec.template.metadata.labels.app) && metadata.creationTimestamp > '2019-04-01'" ttl: 7d github.com/hjacobs/kube-janitor
  66. 66. 73 EC2 SPOT NODES 72% savings
  67. 67. 74 SPOT ASG / LAUNCH TEMPLATE Not upstream in cluster-autoscaler (yet)
  68. 68. 75 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler Kubernetes Janitor github.com/hjacobs/kube-janitor
  69. 69. 76 MORE INFO ● DevOps Gathering 2019: Ensuring Kubernetes Cost Efficiency across (many) Clusters (slides) ● DevOpsCon Munich 2018: Running Kubernetes in Production: A Million Ways to Crash Your Cluster ● HighLoad++ Moscow 2018: Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency (slides) ● DevOps Lisbon Meetup 2018: Kubernetes at Zalando kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/public-presentations.html
  70. 70. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

×