Contenu connexe
Similaire à Neptune : Re-thinking Incident Response Automation (20)
Neptune : Re-thinking Incident Response Automation
- 2. Brief Intro: Myself & Neptune
Neptune.io © 2015
• Architected an incident response automation platform for AWS
• Founding team at Amazon S3, DynamoDB for 5 years
Strong engineering-heavy team
- 3. Agenda
Neptune.io © 2015
• State of incident response automation today
• Our learning's from building such a platform for AWS/Neptune
• Best practices
• Intro to Neptune
• Examples : Incident response workflows
• Q/A
- 4. What is Incident Response?
Neptune.io © 2015
How to handle
incidents/outages?
Many more..
Alerts
- 6. Neptune.io © 2015
Source : DevOps survey; Victor Ops incident response
95% of Time To Recovery(TTR) is still manual today
Alert Troubleshooting
Triage | Investigate | Identify
Resolution Documentation
73% 10%5% 12%
Snapshots
• Graphs & metrics
• Logs
• Webpages
Service health checks
• Internal
• External
Host/App diagnostics
• “Top”, “df –H” etc.
• Heap dumps/Stack traces
Runbooks
• On single/cluster
of hosts
• Any script, any
language
Cloud API/CLI
actions
• Start/Stop/
Reboot
• Scale up/down
Root-cause analysis
& Audit
• Heap dumps
• Logs
• Graphs
Post-mortem
• History
• Diagnostics
- 7. What has changed?
Neptune.io © 2015
• Automation, uptime, and agility : #1 priority for businesses
• e.g. People can’t imagine Gmail going down
• #Servers, #VMs, #Containers, #Apps launched exploding!
• Maintenance has become huge burden
• 13 different tools for managing app
• Difficult to track down root cause what’s going on where
• Cloud, dynamic environments => knowledge sharing is a problem
Typical incident takes 1-2 hours to diagnose & fix
- 8. Big companies built custom automation tools
internally
FBAR : Facebook Auto Remediation Platform
“…Its doing the work of approximately 200 sys
admins…”
“We built one for Amazon Web Services!”
Neptune.io © 2015
The rest diagnosed in minutes instead of hours
40-60% of alerts get fixed automatically without
human intervention
- 9. Key takeaways
Neptune.io © 2015
• Uptime and Automation agility are critical drivers for your
businesses
• Incident response automation gives you:
• More uptime, better customer experience
• Reduction in MTTR
• Happier engineers
- 10. Maturity level of Incident Response Teams
Neptune.io © 2015
@jpaulreed @kfinnbraun DevOps Enterprise Summit
- 11. 3 core pieces of incident response platform
Neptune.io © 2015
- 12. 1. Analytics
Neptune.io © 2015
• Helps identify those top-20% alerts causing 80% of pain
• Sorted by frequency and MTTR
• Capture:
• MTTA (mean time to acknowledgement)
• MTTR (mean time to resolution)
• Frequency of occurrence (#times a particular alert has occurred)
• Reporting + Auditing
• Audit all activity (both manual + automated)
• Leads to data-driven post mortems
- 13. 2. Context
Neptune.io © 2015
• When an alert occurs:
• Gather context automatically from 13 different tools
• Monitoring tools, logging tools, health checks, dependent services
Use cases:
• High memory à capture top-10 memory hogs, memory usage graphs
• High app error rate à capture error rate, latency trends, app logs with
5xx errors
- 14. 3. Remediation
Neptune.io © 2015
• When an alert occurs:
• If it’s a known alert => Run a remediation runbook
• Use cases:
• Process crashed à restart process
• Host is unpingable à restart 3 times and escalate if still fails
- 15. Our learnings
Neptune.io © 2015
• Automate simple things first
• Have checks in place to avoid cascading failures
• Don’t automatically fix when you don’t know root cause
• We started with more focus on remediation, but customers really wanted
automated context gathering
• Customers were not of maturity level that we expected, though they’d like to be
• Security is of paramount importance
• Customer prefer vetted runbooks compared to running arbitrary scripts
• Use github or chef/puppet recipes for runbooks (code reviewed/vetted)
- 16. Neptune.io © 2015
Neptune: Incident Response Automation-as-a-Service
IRA as a ServiceMonitoring as a Service Alerting as a Service
Existing tools just alert somebody,
without any context or diagnostics
We provide diagnostics for unknown
issues, and for known issues, we fix
them automatically
- 17. Neptune.io © 2015
Deployment Models
• SaaS Model - available today!
• Github/vetted runbooks
• On-premise AWS VPC deployment model – available today!
• Enterprise customers
• On-premise deployment model (roadmap)
- 18. Deep Dive: Architecture
Neptune.io © 2015
Event
Queue
Policy-based
Rule Engine
Action
Queue
Neptune Web
Service
Dedicated Queue
Per customer
Publish action
results
Neptune Agent
REST API-based
Runbook repo
Custom Tool
Read-only
- 19. Quick Demo
Neptune.io © 2015
UseCase1: Auto-Remediation UseCase2: Auto-Diagnosis
Host-level Alert – high memory
• Collect top-10 memory hogs
• restart the process
App-level Alert – high error rate
• Collect graph snapshots, logs
• Run script on cluster of machines
- 22. You can get started in 10 min
1. Configure monitoring tool to send alerts to Neptune
2. Install a light-weight agent on a few servers
Neptune.io © 2015
- 24. SaaS Model: Why are we secure?
Neptune.io © 2015
• Go-based Agent:
• No dependencies (agent code is open source)
• Outbound access only
• No need to open any inbound firewall ports
• Agent is light weight, dumb, and not chatty
• Sits idle unless there is something to do
• Consumes < 0.01% CPU, 20MB Memory
• Authentication:
• Leverage AWS STS token Auth: use temp credentials, rotate every 4 hours
• Neptune API_KEY
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you
- 25. SaaS Model: You Control Runbooks
Neptune.io © 2015
• All Runbooks stay within your firewall
• Runbooks are version controlled (e.g. Github)
• No one can edit your runbooks
• Even Agent has read-only access to runbook repository
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you
Notes de l'éditeur
- Get started in10 minutes