This talk was given at the Online Kubernetes Meetup July 2020 as well as DevOps Fusion 2020. The talk discusses 3 major problems in current delivery and operations: too much time spent in delivery, hard to maintain monolithic delivery pipelines and a lack of auto-remediation of production problems
The talk focuses on new approaches to solve these problems inspired by SRE practices and event-driven architectures.
As an implementation for a new approach we use Keptn (www.keptn.sh) - a CNCF Open Source project.
1. A Guide to Event-driven SRE-
inspired DevOps
Andreas Grabner
DevOps Activist at Dynatrace
@grabnerandi
https://www.linkedin.com/in/grabnerandi
A modern approach to delivery & operations with Keptn
Star us @ https://github.com/keptn/keptn
Follow us @keptnProject
More tutorials @ https://tutorials.keptn.sh
Online Kubernetes Meetup, July 2020
2. Confidential 2
Lets start witha POLLINGquestion
WhichstatementsholdstrueforyourContinuousDeliveryimplementation?
1. It is veryhard to troubleshoot broken pipelines!
2. Pipeline codeis heavily customized and therefore hard to maintain!
3. Westill have too many manual steps from dev to production!
4. Overall our delivery is good!
(multiple-choicepossible)
6. Confidential 6
Solution: Remove hard dependencies and integrations
Build
Prepare
Deploy
Test
Notify
Rollback
Config Mgmt.
Deploy
Test
Monitoring
ChatOps
Rollback
7. Confidential 7
Solution: Remove hard dependencies and integrations
Build
Prepare
Deploy
Test
Notify
Rollback
Config Mgmt.
Deploy
Test
Monitoring
ChatOps
Rollback
Eventing
Event:Deploy
Artifact:container1
Stage:Dev
Strategy:Blue/Green
which events to generate Process Definition who consumes events Tool Definition
8. Confidential 8
Eventing
Solution: Keptn is built on an architecture that supportsthis paradigm
Application Plane (=Process Definition)
Define overall process for delivery and operations
Control Plane
Follow application logic and communicate/configure required services
APISite Reliability
Engineer
DevOps
Developer
shipyard.yaml
- dev: direct, functional
- staging: blue/green, perf
- prod: canary, real-user
uniform.yaml
config-change*: helm
deploy*: JMeter
deploy-finish: Lighthouse
problem*: Remediation
all: Slack, Dynatrace
Execution Plane (=Tool Definition)
Deploy Service
(Helm, Jenkins …)
Test Service
(JMeter, Neotys, ..)
Validation Service
(Keptn Lighthouse …)
Remediation Service
(Keptn Remediation, SNOW …)
Config Service
(Git, …)
Monitoring Service
(Prometheus,
Dynatrace, …)
Artifact /
Microservice
config.change: artifact:x.y deploy.finished: http://service1 tests.finished: OK evaluation.done: 98% Score problem.open: High Failure
9. Confidential 9
Demo#1:Event-DrivenProgressive Delivery with Keptn
$ keptn send event new-artifact simplenodeservice:4.0.0 v1.0.0 v2.0.0 v3.0.0 v4.0.0
My sample app: grabnerandi/simplenodeservice:x.0.0
Direct Direct Blue/
Green
automated approval manual approval
Keep or rollback
Promote or not?
11. 11Confidential
Problem #2: Toomuch manual effort in
deployment validation
Solution: Leverage SLIs/SLOs not only for production SLAreporting but for automating quality
gates
12. Confidential 12
Learning from Google‘s SREPractices
• Service Level Indicators (SLIs)
• Definition: Measurable Metrics as the base for evaluation
• Example: ErrorRate ofLogin Requests
• Service Level Objectives (SLOs)
• Definition: Binding targets forService Level Indicators
• Example: Login ErrorRate must be less than 2% over a 30 day period
• Service Level Agreements (SLAs)
• Definition: Business Agreement between consumer andprovidertypically based on SLO
• Example: Logins must be reliable & fast (ErrorRate, Response Time, Throughput) 99% within a 30 day window
• Google Cloud YouTubeVideo
• SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps): https://www.youtube.com/watch?v=tEylFyxbDLE
SLIs drive SLOs which inform SLAs
13. Confidential 13
ApplyingSREBest Practices across thelifecycle
Authentication Service
0.89s0.5%
May 2020 June 2020
0.61s2.5%1000/s 1600/s
Service X
xxsxx% yysyy%xx/s yy/s
ProductionShift-LeftContinuous Delivery
Authentication Service
Commit
#1
Commit
#2
Commit
#3
Commit
#4
Service X
QualityGates
18. 18Confidential
Bonus Problem #3: Toomuch manual effortin
incident troubleshooting
Solution: Leverage Event-Driven approach for auto-remediation and SLIs/SLOs to validate the
impact
19. Confidential 19
Keptn– Closed-LoopRemediation comingwith Keptn0.7
version: 0.2.0
kind: Remediation
metadata:
name: remediation-ecommerce
spec:
remediations:
- problemType: Conversion Rate Dropped
actionsOnOpen:
- name: Scaling ReplicaSet by 1
action: scaling
values:
increment: +1
- name: Stop Ad Campaign
action: googleadtoggle
values:
enable: off
campaign: $campaignid
Problem
ConversionRateDropped
Get remediation
action(s)
Execute
remediation
action(s)
Re-validate
SLO/BLO
Escalate
scaling
Google
Ad toggle
1 2
1 2
1
2
22. Confidential 22
WhatisKeptn?
Define application delivery and
operations processes
declaratively
Use predefined CloudEvents to
separate the process from the
tools
Easy way to integrate and
switch between different tools
Blue/Green Deployments
Automated Quality Gates
Automated Operations
Standardized communication protocol Keptn’s uniform
www.keptn.sh
an event-based control plane for continuous delivery
and automated operations for cloud-native
applications
24. A Guide to Event-driven SRE-
inspired DevOps
Andreas Grabner
DevOps Activist at Dynatrace
@grabnerandi
https://www.linkedin.com/in/grabnerandi
A modern approach to delivery & operations with Keptn
Star us @ https://github.com/keptn/keptn
Follow us @keptnProject
More tutorials @ https://tutorials.keptn.sh
Online Kubernetes Meetup, July 2020
Questions & Answers