Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

A Guide to Event-Driven SRE-inspired DevOps

This talk was given at the Online Kubernetes Meetup July 2020 as well as DevOps Fusion 2020. The talk discusses 3 major problems in current delivery and operations: too much time spent in delivery, hard to maintain monolithic delivery pipelines and a lack of auto-remediation of production problems
The talk focuses on new approaches to solve these problems inspired by SRE practices and event-driven architectures.
As an implementation for a new approach we use Keptn (www.keptn.sh) - a CNCF Open Source project.

  • Soyez le premier à commenter

A Guide to Event-Driven SRE-inspired DevOps

  1. 1. A Guide to Event-driven SRE- inspired DevOps Andreas Grabner DevOps Activist at Dynatrace @grabnerandi https://www.linkedin.com/in/grabnerandi A modern approach to delivery & operations with Keptn Star us @ https://github.com/keptn/keptn Follow us @keptnProject More tutorials @ https://tutorials.keptn.sh Online Kubernetes Meetup, July 2020
  2. 2. Confidential 2 Lets start witha POLLINGquestion WhichstatementsholdstrueforyourContinuousDeliveryimplementation? 1. It is veryhard to troubleshoot broken pipelines! 2. Pipeline codeis heavily customized and therefore hard to maintain! 3. Westill have too many manual steps from dev to production! 4. Overall our delivery is good! (multiple-choicepossible)
  3. 3. 3Confidential Problem #1: ClassicalMonolithic Pipelines are hard to maintain Solution: Breaking the monolithic hard-wired delivery pipelines with an event-driven control plane
  4. 4. Confidential 4 Mixedinformationabout • Process(build, deploy,test,evaluate,…) • Targetplatform (k8s, …) • Environments(dev,hardening,…) • Tools(Terraform,Helm,hey,…) No clear separationof concerns • Developers • Define which artifact to use • Want fast feedback on their code • DevOpsEngineers • Define which tools to use • Ensure tools areproperly configured • SiteReliabilityEngineers • Define delivery processes • Define operations workflows 4 Delivery pipelines look like their monolithic source code counterparts 350+ lines
  5. 5. Confidential 5 Andwe get alot of copiesthatmakeit harder tomaintainor fix issues pipeline { stages { stage('Deploy to dev namespace') { steps { container(‘helm’) { } } } stage('Run tests') { steps { container(‘hey’) { } } } stage(‘Evaluate performance’) { steps { container(‘curl’) { } } } if (evaluation.passed) { stage('Deploy to staging') { steps { container(‘helm’) { } } } } } } pipeline { stages { stage('Deploy to dev namespace') { steps { container(‘helm’) { } } } stage('Run tests') { steps { container(‘jmeter’) { } } } stage(‘Evaluate performance’) { steps { container(‘curl’) { } } } if (evaluation.passed) { stage('Deploy to staging') { steps { container(‘helm’) { } } } } } } pipeline { stages { stage('Deploy to dev namespace') { steps { container(‘kustomize’) { } } } stage('Run tests') { steps { container(‘jmeter’) { } } } stage(‘Evaluate performance’) { steps { container(‘curl’) { } } } if (evaluation.passed) { stage('Deploy to staging') { steps { container(‘helm’) { } } } } } } pipeline { stages { stage('Deploy to dev namespace') { steps { container(‘helm’) { } } } stage('Run tests') { steps { container(‘selenium’) { } } } if (evaluation.passed) { stage('Deploy to staging') { steps { container(‘helm’) { } } } } } } pipeline { stages { stage('Deploy to dev namespace') { steps { container(‘helm’) { } } } stage('Run tests') { steps { container(‘jmeter’) { } } } stage(‘Evaluate performance’) { steps { container(‘curl’) { } } } if (evaluation.passed) { stage('Deploy to staging') { steps { container(‘helm’) { } } } } } } 1 Service = 1 Pipeline 1 Project = x Pipelines n Teams = n*x Pipelines
  6. 6. Confidential 6 Solution: Remove hard dependencies and integrations Build Prepare Deploy Test Notify Rollback Config Mgmt. Deploy Test Monitoring ChatOps Rollback
  7. 7. Confidential 7 Solution: Remove hard dependencies and integrations Build Prepare Deploy Test Notify Rollback Config Mgmt. Deploy Test Monitoring ChatOps Rollback Eventing Event:Deploy Artifact:container1 Stage:Dev Strategy:Blue/Green which events to generate  Process Definition who consumes events  Tool Definition
  8. 8. Confidential 8 Eventing Solution: Keptn is built on an architecture that supportsthis paradigm Application Plane (=Process Definition) Define overall process for delivery and operations Control Plane Follow application logic and communicate/configure required services APISite Reliability Engineer DevOps Developer shipyard.yaml - dev: direct, functional - staging: blue/green, perf - prod: canary, real-user uniform.yaml config-change*: helm deploy*: JMeter deploy-finish: Lighthouse problem*: Remediation all: Slack, Dynatrace Execution Plane (=Tool Definition) Deploy Service (Helm, Jenkins …) Test Service (JMeter, Neotys, ..) Validation Service (Keptn Lighthouse …) Remediation Service (Keptn Remediation, SNOW …) Config Service (Git, …) Monitoring Service (Prometheus, Dynatrace, …) Artifact / Microservice config.change: artifact:x.y deploy.finished: http://service1 tests.finished: OK evaluation.done: 98% Score problem.open: High Failure
  9. 9. Confidential 9 Demo#1:Event-DrivenProgressive Delivery with Keptn $ keptn send event new-artifact simplenodeservice:4.0.0 v1.0.0 v2.0.0 v3.0.0 v4.0.0 My sample app: grabnerandi/simplenodeservice:x.0.0 Direct Direct Blue/ Green automated approval manual approval Keep or rollback Promote or not?
  10. 10. Confidential 10 User Example:Progressive Delivery with Keptn PatrickHofmann Sr.Consultant CI CD
  11. 11. 11Confidential Problem #2: Toomuch manual effort in deployment validation Solution: Leverage SLIs/SLOs not only for production SLAreporting but for automating quality gates
  12. 12. Confidential 12 Learning from Google‘s SREPractices • Service Level Indicators (SLIs) • Definition: Measurable Metrics as the base for evaluation • Example: ErrorRate ofLogin Requests • Service Level Objectives (SLOs) • Definition: Binding targets forService Level Indicators • Example: Login ErrorRate must be less than 2% over a 30 day period • Service Level Agreements (SLAs) • Definition: Business Agreement between consumer andprovidertypically based on SLO • Example: Logins must be reliable & fast (ErrorRate, Response Time, Throughput) 99% within a 30 day window • Google Cloud YouTubeVideo • SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps): https://www.youtube.com/watch?v=tEylFyxbDLE SLIs drive SLOs which inform SLAs
  13. 13. Confidential 13 ApplyingSREBest Practices across thelifecycle Authentication Service 0.89s0.5% May 2020 June 2020 0.61s2.5%1000/s 1600/s Service X xxsxx% yysyy%xx/s yy/s ProductionShift-LeftContinuous Delivery Authentication Service Commit #1 Commit #2 Commit #3 Commit #4 Service X QualityGates
  14. 14. Confidential 14 Explainer onSLI/SLO Validation aspart ofContinuousDelivery with Dynatrace& Keptn! Overall Failure Rate Query: builtin:service.errors.total Test Step LOGIN Response Time Query: calc:service.teststeprt:filter(Test, LOGIN) Test Step LOGIN # Service Calls Query: calc:service.testsvc:filter(tx, LOGIN) <= 5% <= 2% <=150ms & <=+10% <= 400ms <= +0% Build 1 0% 80ms 100ms SLO: Overall Score Goal 90% 75% Response Time 95th Perc Query: builtin:service.responsetime(p95) <=100ms <= 250ms SLOSLIs (Service Level Indicators) warn pass 1 100% Build 2 4% 120ms 90ms 1 75% Build 3 1% 90ms 120ms 2 62.5% Build 4 0% 95ms 95ms 1 100% Build 1 Build 2 Build 3 Build 4 $ keptn send event start-evaluation myproject myservice starttime=build1_deploy endtime=build1_testsdone$ keptn send event start-evaluation myproject myservice starttime=build2_deploy endtime=build2_testsdone$ keptn send event start-evaluation myproject myservice starttime=build3_deploy endtime=build3_testsdone$ keptn send event start-evaluation myproject myservice starttime=build4_teststart endtime=build4_testsend DevOps
  15. 15. Confidential 15 SLI/SLO-basedevaluationimplementationinKeptn SLIs definedperSLI Provider as YAML SLIProviderspecificqueries,e.g:DynatraceMetricsQuery QualityGates ... Dynatrace Prometheus Neoload Scores SLIs Queries SLI Providers with SLI Definitions & Timeframe SLOs definedon Keptn ServiceLevelas YAML Listofobjectiveswithfixedorrelativepass& warncriteria indicators: error_rate: "builtin:service.errors.total.count:merge(0):avg" count_dbcalls: "calc:service.toptestdbcalls:merge(0):sum" jvm_memory: "builtin:tech.jvm.memory.pool.committed:merge(0):sum" objectives: - sli: error_rate pass: - criteria: - "<=1“ # We expect a max error rate of 1% - sli: jvm_memory - sli: count_dbcalls pass: - criteria: - "=+2%" # We allow a 2% increase in DB Calls to previous runs warning: - criteria: - "<=10" # We expect no more than 10 DB Calls per TX total_score: pass: "90%" warning: "75%" 0.5 1.0 0.0 info 7/8 (87.5%) 4/8 (50%) $ keptn start-evaluation 30m myservice sli.yaml slo.yaml 5 DB Calls 360MB 4.3% 123SLI Value: SLI Score: Total Score 2 3 4 Tool X 1
  16. 16. Confidential 16 Demo:AutomatedSLI/SLO Validation based onDynatraceDashboards 15.5/16 (97%) 8/16 (50%) Just build a dashboard!
  17. 17. Confidential 17 User Example:AutomatingBuildApprovalsusing Keptn‘s SLIs/SLOs inGitLab Christian Heckelmann Senior Systems Engineer 87.5%: passed Automated SLI/SLO based Quality Gates Trigger Evaluation
  18. 18. 18Confidential Bonus Problem #3: Toomuch manual effortin incident troubleshooting Solution: Leverage Event-Driven approach for auto-remediation and SLIs/SLOs to validate the impact
  19. 19. Confidential 19 Keptn– Closed-LoopRemediation comingwith Keptn0.7 version: 0.2.0 kind: Remediation metadata: name: remediation-ecommerce spec: remediations: - problemType: Conversion Rate Dropped actionsOnOpen: - name: Scaling ReplicaSet by 1 action: scaling values: increment: +1 - name: Stop Ad Campaign action: googleadtoggle values: enable: off campaign: $campaignid Problem ConversionRateDropped Get remediation action(s) Execute remediation action(s) Re-validate SLO/BLO Escalate scaling Google Ad toggle 1 2 1 2 1 2
  20. 20. Confidential 20 CustomExample:ToggleFeature Flags (planned for this year) AbigailWilson Site Reliability Architect
  21. 21. 21Confidential Let‘s wrap it up!
  22. 22. Confidential 22 WhatisKeptn? Define application delivery and operations processes declaratively Use predefined CloudEvents to separate the process from the tools Easy way to integrate and switch between different tools Blue/Green Deployments Automated Quality Gates Automated Operations Standardized communication protocol Keptn’s uniform www.keptn.sh an event-based control plane for continuous delivery and automated operations for cloud-native applications
  23. 23. Confidential 23 Tutorials: tutorials.keptn.sh
  24. 24. A Guide to Event-driven SRE- inspired DevOps Andreas Grabner DevOps Activist at Dynatrace @grabnerandi https://www.linkedin.com/in/grabnerandi A modern approach to delivery & operations with Keptn Star us @ https://github.com/keptn/keptn Follow us @keptnProject More tutorials @ https://tutorials.keptn.sh Online Kubernetes Meetup, July 2020 Questions & Answers
  25. 25. Confidential 25 Keptn Architecture

×