Contenu connexe Similaire à Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 2018 (20) Plus de Amazon Web Services (20) Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shift-Left SRE: Self-Healing with
AWS Lambda
Andreas Grabner
Global Technology Lead & DevOps Activist
Dynatrace
D E V 3 1 3 - S
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
1. Remediation use cases
2. PREVENT in CI/CD vs. Repair in PROD with
AWS Lambda
3. “Auto-Remediation as Code” with Lambda
4. The “Unbreakable Delivery Pipeline”
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Crash -> Restart
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full or slow disk -> Clean up
$ find ./my_dir -mtime +10 -type f -
delete
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bad configuration changes -> Revert
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bad configuration changes -> Revert
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Low on resources -> Scale up
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Overprovisioned after drop in traffic -> Scale down
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Blue vs. Green -> Redirect traffic
BLUE
GREEN
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
End user impact -> Reverse Blue / Green
Deploy Blue Back to Green
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
List of remediation action we discussed
• Process restarts
• Resource (for example, disk) cleanup
• Revert bad configuration changes
• Scale up
• Scale down
• Blue vs. Green switching
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Add key metrics from incidents to quality gates
1 2 3Staging Production
CI CD CI CD
Code / Config change 4 End users
5 Issue impacting SLAs6 Add metric to quality gate
17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in log behavior
• Use cases
• Are we logging too much? Did we turn on verbose logging by accident?
• Metrics
• Total log size
• Number of total and critical log messages
• How to query?
• For example: Using Amazon CloudWatch log filters
aws logs put-metric-filter
--log-group-name MyApp/access.log
--filter-name EventCount
--filter-pattern ""
--metric-transformations
metricName=MyAppEventCount,metricNamespace=MyNamespace,metricValue=1,defaultValue=0
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in resource consumption
• Use cases
• Bad coding leads to higher costs?
• Metrics
• Memory usage
• Bytes sent/received
• Overall CPU
• CPU per transaction type
• How to query?
• Some through CloudWatch API
• Dynatrace Timeseries API
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in dependencies
• Use cases
• Do we have new dependencies? On purpose?
• Are we connecting to the services we are supposed to connect?
• How many container instances are required?
• Metrics
• Number of incoming / outgoing dependencies
• Number of instances running on
• How to query?
• Maybe CloudWatch API
• Dynatrace SmartScape API
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Use cases
• Did we introduce new “hidden” exceptions?
• Metrics
• Total exceptions
• Exceptions by class & service
• How to query?
• Dynatrace Timeseries API
Detect change in application exception handling
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in performance behavior
• Use case
• Are we jeopardizing our SLAs?
• Does load balancing work?
• Difference between canaries?
• Metrics
• Response time (percentiles)
• Throughput & perf per instance / canary
• How to query
• Dynatrace Timeseries API
• Dynatrace SmartScape API
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in error behavior
• Use cases
• New unexpected error conditions?
• Metrics
• HTTP Failure Rate
• JavaScript Error Rate
• Query through
• Real user monitoring (RUM) solution
• Dynatrace Timeseries API
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in end user scenarios
• Use cases
• Average number of page requests per user increased?
• How does this impact resource and capacity requirements?
• Metrics
• Number of user interactions / session
• Page sizes, number of resources
• Query through
• RUM solution
• Dynatrace Timeseries API
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
List of metrics we just discussed
• Logging
• Total log size
• Number of total and critical log messages
• Resources
• Memory usage
• Bytes sent / received
• Overall CPU
• CPU per transaction type
• Dependencies
• Number of incoming / outgoing
dependencies
• Number of instances running on
• Exceptions
• Total exceptions
• Exceptions by class & service
• Performance
• Response time (percentiles)
• Throughput & perf per instance
• Errors
• HTTP failure rate
• JavaScript error rate
• End user scenarios
• Number of user interactions / session
• Page sizes, number of resources
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to add this to our pipeline?
1 2 3Staging Production
CI CD CI CD
Code / Config change 4 End users
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Curtis Bray (re:Invent 2017)
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Thomas Steinmaurer @ Dynatrace
“Performance Signature”
for Build Nov 16
“Performance Signature”
for Build Nov 17
“Performance Signature”
for every build
“Multiple Metrics”
compared to prev
timeframe
Simple Regression Detection
per metric
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Build validation / Monitoring as code”
monspec.json
{
...
"perfsignature" : [
{
"timeseries" : "com.dynatrace.builtin:service.responsetime",
"aggregate" : “p90", // min, max, avg, sum, median, count, percentile
"validate" : "upper", // upper or lower
// "upperlimit" : 100, // Optional: Can be used to define a FIXED THRESHOLD
// "lowerlimit" : 50, // Optional: Can be used to define a FIXED THRESHOLD
},
{
"timeseries" : "com.dynatrace.builtin:service.failurerate",
"aggregate" : "avg"
},
{
"timeseries" : "com.dynatrace.builtin:service.requestspermin",
"aggregate" : "count",
"validate" : "lower"
},
{
"smartscape" : "toRelationships:calls",
"aggregate" : "count",
"upperlimit" : 1 // Validate that we only call to the one backend service and nowhere else!
}
],
Metrics: Which metrics, aggregation, upper/lower boundaries?
Dependencies: How many involved services?
29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
StagingToProduction,5,ApproveStaging
Invoke
RegisterStagingValidation
AWS Lambda
registerDynatraceBuildValidation
Monspec
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
Staging: Register Build Validation!
registerDynatraceBuildValidation
Adds build validation request
Adds item
Build validation request item
- Pipeline Information
- Monspec
- Timestamp + Timeframe
- Comparison Definition Name
- Action Name to Approve / Reject
validateBuildDynatraceWork CloudWatch Events
(e.g:, 1min)
Triggers
Approves/Rejects IF “In Progress” & if RegisterBuildValidation
was called with that Action Name
Monspec from Amazon S3
Dynatrace entities & Timeseries REST API
Resolves tags and gets list of entities
Queries metrics for these entities
Updates build validation request
- Updated Monspec
- Updated status
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build over build results pulled from Amazon DynamoDB
GoodBuild
GoodBuild
GoodBuild
BadBuild
BadBuild
BadBuild
BadBuild
GoodBuild
33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Beachbody
ChatOps
Erik Landsness, Beachbody
Problem evolution
36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Self-healing: Path to autonomous Ops
Auto-mitigate!
1 CPU exhausted? Add a new service instance to distribute load!
3 Caused by Canary Release? Redirect traffic to main canary!
How to escalate?
2 Exhausted connection pool? Increase pool size!
Escalate? Still ongoing?
1
2
Update teams
…
Impact mitigated??
Inform #WebTeam about JavaScript issue on IE
Push status update to inform our customers
Inform Support about potential incoming user complaints!
?
37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Auto-Remediation as Code” triggered by Dynatrace
#1: Push deployment information,
e.g: CodeDeploy DeploymentId
#2: Calling Lambda via API Gateway
handleDynatraceProblemNotification
#4: Redeploy previous
revision
#3
Uses Dynatrace Events API
to pull CUSTOM_DEPLOYMENT events
#5: Push comment to Dynatrace
38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary: Unbreakable cloud-native pipelines
1 2 4 53
Production
Staging Approve staging Production Approve production
CI CD CI CD CI CD CI CD
Pushes deployment into
Dynatrace entities
Compares builds and
approves / rejects pipeline
Pushes deployment info into
Dynatrace entities
Validates production and
approves / rejects pipeline
Executes auto-remediating
actions e.g., roll-back
Build #17 Build #18
41. Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Andreas Grabner
twitter: @grabnerandi
email: andreas.grabner@dynatrace.com
42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sample code slide
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);
44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sample code slide
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);