SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shift-Left SRE: Self-Healing with
AWS Lambda
Andreas Grabner
Global Technology Lead & DevOps Activist
Dynatrace
D E V 3 1 3 - S
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
1. Remediation use cases
2. PREVENT in CI/CD vs. Repair in PROD with
AWS Lambda
3. “Auto-Remediation as Code” with Lambda
4. The “Unbreakable Delivery Pipeline”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Crash -> Restart
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full or slow disk -> Clean up
$ find ./my_dir -mtime +10 -type f -
delete
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bad configuration changes -> Revert
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bad configuration changes -> Revert
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Low on resources -> Scale up
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Overprovisioned after drop in traffic -> Scale down
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Blue vs. Green -> Redirect traffic
BLUE
GREEN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
End user impact -> Reverse Blue / Green
Deploy Blue Back to Green
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
List of remediation action we discussed
• Process restarts
• Resource (for example, disk) cleanup
• Revert bad configuration changes
• Scale up
• Scale down
• Blue vs. Green switching
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Add key metrics from incidents to quality gates
1 2 3Staging Production
CI CD CI CD
Code / Config change 4 End users
5 Issue impacting SLAs6 Add metric to quality gate
Use cases and metrics
we can “Shift-Left”!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in log behavior
• Use cases
• Are we logging too much? Did we turn on verbose logging by accident?
• Metrics
• Total log size
• Number of total and critical log messages
• How to query?
• For example: Using Amazon CloudWatch log filters
aws logs put-metric-filter 
--log-group-name MyApp/access.log 
--filter-name EventCount 
--filter-pattern "" 
--metric-transformations 
metricName=MyAppEventCount,metricNamespace=MyNamespace,metricValue=1,defaultValue=0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in resource consumption
• Use cases
• Bad coding leads to higher costs?
• Metrics
• Memory usage
• Bytes sent/received
• Overall CPU
• CPU per transaction type
• How to query?
• Some through CloudWatch API
• Dynatrace Timeseries API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in dependencies
• Use cases
• Do we have new dependencies? On purpose?
• Are we connecting to the services we are supposed to connect?
• How many container instances are required?
• Metrics
• Number of incoming / outgoing dependencies
• Number of instances running on
• How to query?
• Maybe CloudWatch API
• Dynatrace SmartScape API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Use cases
• Did we introduce new “hidden” exceptions?
• Metrics
• Total exceptions
• Exceptions by class & service
• How to query?
• Dynatrace Timeseries API
Detect change in application exception handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in performance behavior
• Use case
• Are we jeopardizing our SLAs?
• Does load balancing work?
• Difference between canaries?
• Metrics
• Response time (percentiles)
• Throughput & perf per instance / canary
• How to query
• Dynatrace Timeseries API
• Dynatrace SmartScape API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in error behavior
• Use cases
• New unexpected error conditions?
• Metrics
• HTTP Failure Rate
• JavaScript Error Rate
• Query through
• Real user monitoring (RUM) solution
• Dynatrace Timeseries API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in end user scenarios
• Use cases
• Average number of page requests per user increased?
• How does this impact resource and capacity requirements?
• Metrics
• Number of user interactions / session
• Page sizes, number of resources
• Query through
• RUM solution
• Dynatrace Timeseries API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
List of metrics we just discussed
• Logging
• Total log size
• Number of total and critical log messages
• Resources
• Memory usage
• Bytes sent / received
• Overall CPU
• CPU per transaction type
• Dependencies
• Number of incoming / outgoing
dependencies
• Number of instances running on
• Exceptions
• Total exceptions
• Exceptions by class & service
• Performance
• Response time (percentiles)
• Throughput & perf per instance
• Errors
• HTTP failure rate
• JavaScript error rate
• End user scenarios
• Number of user interactions / session
• Page sizes, number of resources
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to add this to our pipeline?
1 2 3Staging Production
CI CD CI CD
Code / Config change 4 End users
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Curtis Bray (re:Invent 2017)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Thomas Steinmaurer @ Dynatrace
“Performance Signature”
for Build Nov 16
“Performance Signature”
for Build Nov 17
“Performance Signature”
for every build
“Multiple Metrics”
compared to prev
timeframe
Simple Regression Detection
per metric
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Build validation / Monitoring as code”
monspec.json
{
...
"perfsignature" : [
{
"timeseries" : "com.dynatrace.builtin:service.responsetime",
"aggregate" : “p90", // min, max, avg, sum, median, count, percentile
"validate" : "upper", // upper or lower
// "upperlimit" : 100, // Optional: Can be used to define a FIXED THRESHOLD
// "lowerlimit" : 50, // Optional: Can be used to define a FIXED THRESHOLD
},
{
"timeseries" : "com.dynatrace.builtin:service.failurerate",
"aggregate" : "avg"
},
{
"timeseries" : "com.dynatrace.builtin:service.requestspermin",
"aggregate" : "count",
"validate" : "lower"
},
{
"smartscape" : "toRelationships:calls",
"aggregate" : "count",
"upperlimit" : 1 // Validate that we only call to the one backend service and nowhere else!
}
],
Metrics: Which metrics, aggregation, upper/lower boundaries?
Dependencies: How many involved services?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
StagingToProduction,5,ApproveStaging
Invoke
RegisterStagingValidation
AWS Lambda
registerDynatraceBuildValidation
Monspec
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
Staging: Register Build Validation!
registerDynatraceBuildValidation
Adds build validation request
Adds item
Build validation request item
- Pipeline Information
- Monspec
- Timestamp + Timeframe
- Comparison Definition Name
- Action Name to Approve / Reject
validateBuildDynatraceWork CloudWatch Events
(e.g:, 1min)
Triggers
Approves/Rejects IF “In Progress” & if RegisterBuildValidation
was called with that Action Name
Monspec from Amazon S3
Dynatrace entities & Timeseries REST API
Resolves tags and gets list of entities
Queries metrics for these entities
Updates build validation request
- Updated Monspec
- Updated status
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build over build results pulled from Amazon DynamoDB
GoodBuild
GoodBuild
GoodBuild
BadBuild
BadBuild
BadBuild
BadBuild
GoodBuild
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Beachbody
ChatOps
Erik Landsness, Beachbody
Problem evolution
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Self-healing: Path to autonomous Ops
Auto-mitigate!
1 CPU exhausted? Add a new service instance to distribute load!
3 Caused by Canary Release? Redirect traffic to main canary!
How to escalate?
2 Exhausted connection pool? Increase pool size!
Escalate? Still ongoing?
1
2
Update teams
…
Impact mitigated??
Inform #WebTeam about JavaScript issue on IE
Push status update to inform our customers
Inform Support about potential incoming user complaints!
?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Auto-Remediation as Code” triggered by Dynatrace
#1: Push deployment information,
e.g: CodeDeploy DeploymentId
#2: Calling Lambda via API Gateway
handleDynatraceProblemNotification
#4: Redeploy previous
revision
#3
Uses Dynatrace Events API
to pull CUSTOM_DEPLOYMENT events
#5: Push comment to Dynatrace
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary: Unbreakable cloud-native pipelines
1 2 4 53
Production
Staging Approve staging Production Approve production
CI CD CI CD CI CD CI CD
Pushes deployment into
Dynatrace entities
Compares builds and
approves / rejects pipeline
Pushes deployment info into
Dynatrace entities
Validates production and
approves / rejects pipeline
Executes auto-remediating
actions e.g., roll-back
Build #17 Build #18
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Andreas Grabner
twitter: @grabnerandi
email: andreas.grabner@dynatrace.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sample code slide
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sample code slide
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);

Contenu connexe

Tendances

Tendances (20)

Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Migrating Your Databases to AWS - Tools and Services.pdf
Migrating Your Databases to AWS -  Tools and Services.pdfMigrating Your Databases to AWS -  Tools and Services.pdf
Migrating Your Databases to AWS - Tools and Services.pdf
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivFinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
(NET406) Deep Dive: AWS Direct Connect and VPNs
(NET406) Deep Dive: AWS Direct Connect and VPNs(NET406) Deep Dive: AWS Direct Connect and VPNs
(NET406) Deep Dive: AWS Direct Connect and VPNs
 
Introduction to AWS Secrets Manager
Introduction to AWS Secrets ManagerIntroduction to AWS Secrets Manager
Introduction to AWS Secrets Manager
 
Highlights of WSO2 API Manager 4.0.0
Highlights of WSO2 API Manager 4.0.0Highlights of WSO2 API Manager 4.0.0
Highlights of WSO2 API Manager 4.0.0
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Introduction to AWS Organizations
Introduction to AWS OrganizationsIntroduction to AWS Organizations
Introduction to AWS Organizations
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
 
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming Applications
 
AWS Secrets Manager
AWS Secrets ManagerAWS Secrets Manager
AWS Secrets Manager
 
Develop Containerized Apps with AWS Fargate
Develop Containerized Apps with AWS Fargate Develop Containerized Apps with AWS Fargate
Develop Containerized Apps with AWS Fargate
 

Similaire à Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 2018

Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Amazon Web Services
 

Similaire à Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 2018 (20)

Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
 
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
 
ServerlessConf 2018 Keynote - Debunking Serverless Myths
ServerlessConf 2018 Keynote - Debunking Serverless MythsServerlessConf 2018 Keynote - Debunking Serverless Myths
ServerlessConf 2018 Keynote - Debunking Serverless Myths
 
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
Achieving Global Consistency Using AWS CloudFormation StackSets - AWS Online ...
 
The Future of API Management Is Serverless
The Future of API Management Is ServerlessThe Future of API Management Is Serverless
The Future of API Management Is Serverless
 
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDD
 
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...
 
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
How can your business benefit from going Serverless
How can your business benefit from going ServerlessHow can your business benefit from going Serverless
How can your business benefit from going Serverless
 
How can your business benefit from going serverless?
How can your business benefit from going serverless?How can your business benefit from going serverless?
How can your business benefit from going serverless?
 
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
 
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Serverless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best PracticesServerless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best Practices
 
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shift-Left SRE: Self-Healing with AWS Lambda Andreas Grabner Global Technology Lead & DevOps Activist Dynatrace D E V 3 1 3 - S
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 1. Remediation use cases 2. PREVENT in CI/CD vs. Repair in PROD with AWS Lambda 3. “Auto-Remediation as Code” with Lambda 4. The “Unbreakable Delivery Pipeline”
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Crash -> Restart
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full or slow disk -> Clean up $ find ./my_dir -mtime +10 -type f - delete
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bad configuration changes -> Revert
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bad configuration changes -> Revert
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Low on resources -> Scale up
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overprovisioned after drop in traffic -> Scale down
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blue vs. Green -> Redirect traffic BLUE GREEN
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. End user impact -> Reverse Blue / Green Deploy Blue Back to Green
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. List of remediation action we discussed • Process restarts • Resource (for example, disk) cleanup • Revert bad configuration changes • Scale up • Scale down • Blue vs. Green switching
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Add key metrics from incidents to quality gates 1 2 3Staging Production CI CD CI CD Code / Config change 4 End users 5 Issue impacting SLAs6 Add metric to quality gate
  • 16. Use cases and metrics we can “Shift-Left”!
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in log behavior • Use cases • Are we logging too much? Did we turn on verbose logging by accident? • Metrics • Total log size • Number of total and critical log messages • How to query? • For example: Using Amazon CloudWatch log filters aws logs put-metric-filter --log-group-name MyApp/access.log --filter-name EventCount --filter-pattern "" --metric-transformations metricName=MyAppEventCount,metricNamespace=MyNamespace,metricValue=1,defaultValue=0
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in resource consumption • Use cases • Bad coding leads to higher costs? • Metrics • Memory usage • Bytes sent/received • Overall CPU • CPU per transaction type • How to query? • Some through CloudWatch API • Dynatrace Timeseries API
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in dependencies • Use cases • Do we have new dependencies? On purpose? • Are we connecting to the services we are supposed to connect? • How many container instances are required? • Metrics • Number of incoming / outgoing dependencies • Number of instances running on • How to query? • Maybe CloudWatch API • Dynatrace SmartScape API
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Use cases • Did we introduce new “hidden” exceptions? • Metrics • Total exceptions • Exceptions by class & service • How to query? • Dynatrace Timeseries API Detect change in application exception handling
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in performance behavior • Use case • Are we jeopardizing our SLAs? • Does load balancing work? • Difference between canaries? • Metrics • Response time (percentiles) • Throughput & perf per instance / canary • How to query • Dynatrace Timeseries API • Dynatrace SmartScape API
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in error behavior • Use cases • New unexpected error conditions? • Metrics • HTTP Failure Rate • JavaScript Error Rate • Query through • Real user monitoring (RUM) solution • Dynatrace Timeseries API
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in end user scenarios • Use cases • Average number of page requests per user increased? • How does this impact resource and capacity requirements? • Metrics • Number of user interactions / session • Page sizes, number of resources • Query through • RUM solution • Dynatrace Timeseries API
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. List of metrics we just discussed • Logging • Total log size • Number of total and critical log messages • Resources • Memory usage • Bytes sent / received • Overall CPU • CPU per transaction type • Dependencies • Number of incoming / outgoing dependencies • Number of instances running on • Exceptions • Total exceptions • Exceptions by class & service • Performance • Response time (percentiles) • Throughput & perf per instance • Errors • HTTP failure rate • JavaScript error rate • End user scenarios • Number of user interactions / session • Page sizes, number of resources
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to add this to our pipeline? 1 2 3Staging Production CI CD CI CD Code / Config change 4 End users
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration from Curtis Bray (re:Invent 2017)
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration from Thomas Steinmaurer @ Dynatrace “Performance Signature” for Build Nov 16 “Performance Signature” for Build Nov 17 “Performance Signature” for every build “Multiple Metrics” compared to prev timeframe Simple Regression Detection per metric
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Build validation / Monitoring as code” monspec.json { ... "perfsignature" : [ { "timeseries" : "com.dynatrace.builtin:service.responsetime", "aggregate" : “p90", // min, max, avg, sum, median, count, percentile "validate" : "upper", // upper or lower // "upperlimit" : 100, // Optional: Can be used to define a FIXED THRESHOLD // "lowerlimit" : 50, // Optional: Can be used to define a FIXED THRESHOLD }, { "timeseries" : "com.dynatrace.builtin:service.failurerate", "aggregate" : "avg" }, { "timeseries" : "com.dynatrace.builtin:service.requestspermin", "aggregate" : "count", "validate" : "lower" }, { "smartscape" : "toRelationships:calls", "aggregate" : "count", "upperlimit" : 1 // Validate that we only call to the one backend service and nowhere else! } ], Metrics: Which metrics, aggregation, upper/lower boundaries? Dependencies: How many involved services?
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automate validation into AWS CodePipeline with Lambda StagingToProduction,5,ApproveStaging Invoke RegisterStagingValidation AWS Lambda registerDynatraceBuildValidation Monspec
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automate validation into AWS CodePipeline with Lambda Staging: Register Build Validation! registerDynatraceBuildValidation Adds build validation request Adds item Build validation request item - Pipeline Information - Monspec - Timestamp + Timeframe - Comparison Definition Name - Action Name to Approve / Reject validateBuildDynatraceWork CloudWatch Events (e.g:, 1min) Triggers Approves/Rejects IF “In Progress” & if RegisterBuildValidation was called with that Action Name Monspec from Amazon S3 Dynatrace entities & Timeseries REST API Resolves tags and gets list of entities Queries metrics for these entities Updates build validation request - Updated Monspec - Updated status
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automate validation into AWS CodePipeline with Lambda
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Build over build results pulled from Amazon DynamoDB GoodBuild GoodBuild GoodBuild BadBuild BadBuild BadBuild BadBuild GoodBuild
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration from Beachbody ChatOps Erik Landsness, Beachbody Problem evolution
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Self-healing: Path to autonomous Ops Auto-mitigate! 1 CPU exhausted? Add a new service instance to distribute load! 3 Caused by Canary Release? Redirect traffic to main canary! How to escalate? 2 Exhausted connection pool? Increase pool size! Escalate? Still ongoing? 1 2 Update teams … Impact mitigated?? Inform #WebTeam about JavaScript issue on IE Push status update to inform our customers Inform Support about potential incoming user complaints! ?
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Auto-Remediation as Code” triggered by Dynatrace #1: Push deployment information, e.g: CodeDeploy DeploymentId #2: Calling Lambda via API Gateway handleDynatraceProblemNotification #4: Redeploy previous revision #3 Uses Dynatrace Events API to pull CUSTOM_DEPLOYMENT events #5: Push comment to Dynatrace
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary: Unbreakable cloud-native pipelines 1 2 4 53 Production Staging Approve staging Production Approve production CI CD CI CD CI CD CI CD Pushes deployment into Dynatrace entities Compares builds and approves / rejects pipeline Pushes deployment info into Dynatrace entities Validates production and approves / rejects pipeline Executes auto-remediating actions e.g., roll-back Build #17 Build #18
  • 41. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Andreas Grabner twitter: @grabnerandi email: andreas.grabner@dynatrace.com
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sample code slide var pd = require('pretty-data').pd; var xml_pp = pd.xml(data); var xml_min = pd.xmlmin(data [,true]); var json_pp = pd.json(data); var json_min = pd.jsonmin(data); var css_pp = pd.css(data); var css_min = pd.cssmin(data [, true]); var sql_pp = pd.sql(data);
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sample code slide var pd = require('pretty-data').pd; var xml_pp = pd.xml(data); var xml_min = pd.xmlmin(data [,true]); var json_pp = pd.json(data); var json_min = pd.jsonmin(data); var css_pp = pd.css(data); var css_min = pd.cssmin(data [, true]); var sql_pp = pd.sql(data); var pd = require('pretty-data').pd; var xml_pp = pd.xml(data); var xml_min = pd.xmlmin(data [,true]); var json_pp = pd.json(data); var json_min = pd.jsonmin(data); var css_pp = pd.css(data); var css_min = pd.cssmin(data [, true]); var sql_pp = pd.sql(data);