Neptune : Re-thinking Incident Response Automation

Re-thinking Incident Response Automation
Kiran Gollu, Co-founder/CEO
Neptune.io © 2015

Brief Intro: Myself & Neptune
Neptune.io © 2015
•  Architected an incident response automation platform for AWS
•  Founding team at Amazon S3, DynamoDB for 5 years
Strong engineering-heavy team

Agenda
Neptune.io © 2015
•  State of incident response automation today
•  Our learning's from building such a platform for AWS/Neptune
•  Best practices
•  Intro to Neptune
•  Examples : Incident response workﬂows
•  Q/A

What is Incident Response?
Neptune.io © 2015
How to handle
incidents/outages?
Many more..
Alerts

Incident response automation is broken today!
Neptune.io © 2015

Neptune.io © 2015
Source : DevOps survey; Victor Ops incident response
95% of Time To Recovery(TTR) is still manual today
Alert Troubleshooting
Triage | Investigate | Identify
Resolution Documentation
73% 10%5% 12%
Snapshots
•  Graphs & metrics
•  Logs
•  Webpages
Service health checks
•  Internal
•  External
Host/App diagnostics
•  “Top”, “df –H” etc.
•  Heap dumps/Stack traces
Runbooks
•  On single/cluster
of hosts
•  Any script, any
language
Cloud API/CLI
actions
•  Start/Stop/
Reboot
•  Scale up/down
Root-cause analysis
& Audit
•  Heap dumps
•  Logs
•  Graphs
Post-mortem
•  History
•  Diagnostics

What has changed?
Neptune.io © 2015
•  Automation, uptime, and agility : #1 priority for businesses
•  e.g. People can’t imagine Gmail going down
•  #Servers, #VMs, #Containers, #Apps launched exploding!
•  Maintenance has become huge burden
•  13 different tools for managing app
•  Difﬁcult to track down root cause what’s going on where
•  Cloud, dynamic environments => knowledge sharing is a problem
Typical incident takes 1-2 hours to diagnose & ﬁx

Big companies built custom automation tools
internally
FBAR : Facebook Auto Remediation Platform
“…Its doing the work of approximately 200 sys
admins…”
“We built one for Amazon Web Services!”
Neptune.io © 2015
The rest diagnosed in minutes instead of hours
40-60% of alerts get ﬁxed automatically without
human intervention

Key takeaways
Neptune.io © 2015
•  Uptime and Automation agility are critical drivers for your
businesses
•  Incident response automation gives you:
•  More uptime, better customer experience
•  Reduction in MTTR
•  Happier engineers

Maturity level of Incident Response Teams
Neptune.io © 2015
@jpaulreed @kfinnbraun DevOps Enterprise Summit

3 core pieces of incident response platform
Neptune.io © 2015

1. Analytics
Neptune.io © 2015
•  Helps identify those top-20% alerts causing 80% of pain
•  Sorted by frequency and MTTR
•  Capture:
•  MTTA (mean time to acknowledgement)
•  MTTR (mean time to resolution)
•  Frequency of occurrence (#times a particular alert has occurred)
•  Reporting + Auditing
•  Audit all activity (both manual + automated)
•  Leads to data-driven post mortems

2. Context
Neptune.io © 2015
•  When an alert occurs:
•  Gather context automatically from 13 different tools
•  Monitoring tools, logging tools, health checks, dependent services
Use cases:
•  High memory à capture top-10 memory hogs, memory usage graphs
•  High app error rate à capture error rate, latency trends, app logs with
5xx errors

3. Remediation
Neptune.io © 2015
•  When an alert occurs:
•  If it’s a known alert => Run a remediation runbook
•  Use cases:
•  Process crashed à restart process
•  Host is unpingable à restart 3 times and escalate if still fails

Our learnings
Neptune.io © 2015
•  Automate simple things ﬁrst
•  Have checks in place to avoid cascading failures
•  Don’t automatically ﬁx when you don’t know root cause
•  We started with more focus on remediation, but customers really wanted
automated context gathering
•  Customers were not of maturity level that we expected, though they’d like to be
•  Security is of paramount importance
•  Customer prefer vetted runbooks compared to running arbitrary scripts
•  Use github or chef/puppet recipes for runbooks (code reviewed/vetted)

Neptune.io © 2015
Neptune: Incident Response Automation-as-a-Service
IRA as a ServiceMonitoring as a Service Alerting as a Service
Existing tools just alert somebody,
without any context or diagnostics
We provide diagnostics for unknown
issues, and for known issues, we ﬁx
them automatically

Neptune.io © 2015
Deployment Models
•  SaaS Model - available today!
•  Github/vetted runbooks
•  On-premise AWS VPC deployment model – available today!
•  Enterprise customers
•  On-premise deployment model (roadmap)

Deep Dive: Architecture
Neptune.io © 2015
Event
Queue
Policy-based
Rule Engine
Action
Queue
Neptune Web
Service
Dedicated Queue
Per customer
Publish action
results
Neptune Agent
REST API-based
Runbook repo
Custom Tool
Read-only

Quick Demo
Neptune.io © 2015
UseCase1: Auto-Remediation UseCase2: Auto-Diagnosis
Host-level Alert – high memory
•  Collect top-10 memory hogs
•  restart the process
App-level Alert – high error rate
•  Collect graph snapshots, logs
•  Run script on cluster of machines

You can get started in 10 min
1.  Conﬁgure monitoring tool to send alerts to Neptune
2.  Install a light-weight agent on a few servers
Neptune.io © 2015

SaaS Model: Why are we secure?
Neptune.io © 2015
•  Go-based Agent:
•  No dependencies (agent code is open source)
•  Outbound access only
•  No need to open any inbound ﬁrewall ports
•  Agent is light weight, dumb, and not chatty
•  Sits idle unless there is something to do
•  Consumes < 0.01% CPU, 20MB Memory
•  Authentication:
•  Leverage AWS STS token Auth: use temp credentials, rotate every 4 hours
•  Neptune API_KEY
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you

SaaS Model: You Control Runbooks
Neptune.io © 2015
•  All Runbooks stay within your ﬁrewall
•  Runbooks are version controlled (e.g. Github)
•  No one can edit your runbooks
•  Even Agent has read-only access to runbook repository
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you

Neptune : Re-thinking Incident Response Automation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Neptune : Re-thinking Incident Response Automation

Similaire à Neptune : Re-thinking Incident Response Automation (20)

Dernier

Dernier (20)

Neptune : Re-thinking Incident Response Automation

Notes de l'éditeur