SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Re-thinking Incident Response Automation
Kiran Gollu, Co-founder/CEO
Neptune.io © 2015
Brief Intro: Myself & Neptune
Neptune.io © 2015
•  Architected an incident response automation platform for AWS
•  Founding team at Amazon S3, DynamoDB for 5 years
Strong engineering-heavy team
Agenda
Neptune.io © 2015
•  State of incident response automation today
•  Our learning's from building such a platform for AWS/Neptune
•  Best practices
•  Intro to Neptune
•  Examples : Incident response workflows
•  Q/A
What is Incident Response?
Neptune.io © 2015
How to handle
incidents/outages?
Many more..
Alerts
Incident response automation is broken today!
Neptune.io © 2015
Neptune.io © 2015
Source : DevOps survey; Victor Ops incident response
95% of Time To Recovery(TTR) is still manual today
Alert Troubleshooting
Triage | Investigate | Identify
Resolution Documentation
73% 10%5% 12%
Snapshots
•  Graphs & metrics
•  Logs
•  Webpages
Service health checks
•  Internal
•  External
Host/App diagnostics
•  “Top”, “df –H” etc.
•  Heap dumps/Stack traces
Runbooks
•  On single/cluster
of hosts
•  Any script, any
language
Cloud API/CLI
actions
•  Start/Stop/
Reboot
•  Scale up/down
Root-cause analysis
& Audit
•  Heap dumps
•  Logs
•  Graphs
Post-mortem
•  History
•  Diagnostics
What has changed?
Neptune.io © 2015
•  Automation, uptime, and agility : #1 priority for businesses
•  e.g. People can’t imagine Gmail going down
•  #Servers, #VMs, #Containers, #Apps launched exploding!
•  Maintenance has become huge burden
•  13 different tools for managing app
•  Difficult to track down root cause what’s going on where
•  Cloud, dynamic environments => knowledge sharing is a problem
Typical incident takes 1-2 hours to diagnose & fix
Big companies built custom automation tools
internally
FBAR : Facebook Auto Remediation Platform
“…Its doing the work of approximately 200 sys
admins…”
“We built one for Amazon Web Services!”
Neptune.io © 2015
The rest diagnosed in minutes instead of hours
40-60% of alerts get fixed automatically without
human intervention
Key takeaways
Neptune.io © 2015
•  Uptime and Automation agility are critical drivers for your
businesses
•  Incident response automation gives you:
•  More uptime, better customer experience
•  Reduction in MTTR
•  Happier engineers
Maturity level of Incident Response Teams
Neptune.io © 2015
@jpaulreed @kfinnbraun DevOps Enterprise Summit
3 core pieces of incident response platform
Neptune.io © 2015
1. Analytics
Neptune.io © 2015
•  Helps identify those top-20% alerts causing 80% of pain
•  Sorted by frequency and MTTR
•  Capture:
•  MTTA (mean time to acknowledgement)
•  MTTR (mean time to resolution)
•  Frequency of occurrence (#times a particular alert has occurred)
•  Reporting + Auditing
•  Audit all activity (both manual + automated)
•  Leads to data-driven post mortems
2. Context
Neptune.io © 2015
•  When an alert occurs:
•  Gather context automatically from 13 different tools
•  Monitoring tools, logging tools, health checks, dependent services
Use cases:
•  High memory à capture top-10 memory hogs, memory usage graphs
•  High app error rate à capture error rate, latency trends, app logs with
5xx errors
3. Remediation
Neptune.io © 2015
•  When an alert occurs:
•  If it’s a known alert => Run a remediation runbook
•  Use cases:
•  Process crashed à restart process
•  Host is unpingable à restart 3 times and escalate if still fails
Our learnings
Neptune.io © 2015
•  Automate simple things first
•  Have checks in place to avoid cascading failures
•  Don’t automatically fix when you don’t know root cause
•  We started with more focus on remediation, but customers really wanted
automated context gathering
•  Customers were not of maturity level that we expected, though they’d like to be
•  Security is of paramount importance
•  Customer prefer vetted runbooks compared to running arbitrary scripts
•  Use github or chef/puppet recipes for runbooks (code reviewed/vetted)
Neptune.io © 2015
Neptune: Incident Response Automation-as-a-Service
IRA as a ServiceMonitoring as a Service Alerting as a Service
Existing tools just alert somebody,
without any context or diagnostics
We provide diagnostics for unknown
issues, and for known issues, we fix
them automatically
Neptune.io © 2015
Deployment Models
•  SaaS Model - available today!
•  Github/vetted runbooks
•  On-premise AWS VPC deployment model – available today!
•  Enterprise customers
•  On-premise deployment model (roadmap)
Deep Dive: Architecture
Neptune.io © 2015
Event
Queue
Policy-based
Rule Engine
Action
Queue
Neptune Web
Service
Dedicated Queue
Per customer
Publish action
results
Neptune Agent
REST API-based
Runbook repo
Custom Tool
Read-only
Quick Demo
Neptune.io © 2015
UseCase1: Auto-Remediation UseCase2: Auto-Diagnosis
Host-level Alert – high memory
•  Collect top-10 memory hogs
•  restart the process
App-level Alert – high error rate
•  Collect graph snapshots, logs
•  Run script on cluster of machines
Neptune.io © 2015
Sample error rate incident today (before Neptune)
Neptune.io © 2015
Sample error rate incident (after Neptune)
You can get started in 10 min
1.  Configure monitoring tool to send alerts to Neptune
2.  Install a light-weight agent on a few servers
Neptune.io © 2015
Thanks!
Check out our 2 week free trial!
kiran@neptune.io
Neptune.io © 2015
SaaS Model: Why are we secure?
Neptune.io © 2015
•  Go-based Agent:
•  No dependencies (agent code is open source)
•  Outbound access only
•  No need to open any inbound firewall ports
•  Agent is light weight, dumb, and not chatty
•  Sits idle unless there is something to do
•  Consumes < 0.01% CPU, 20MB Memory
•  Authentication:
•  Leverage AWS STS token Auth: use temp credentials, rotate every 4 hours
•  Neptune API_KEY
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you
SaaS Model: You Control Runbooks
Neptune.io © 2015
•  All Runbooks stay within your firewall
•  Runbooks are version controlled (e.g. Github)
•  No one can edit your runbooks
•  Even Agent has read-only access to runbook repository
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you

Contenu connexe

Tendances

BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat Security Conference
 
Chaos engineering for cloud native security
Chaos engineering for cloud native securityChaos engineering for cloud native security
Chaos engineering for cloud native security
Kennedy
 
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat Security Conference
 

Tendances (20)

Runecast: Simplified Security with Unparalleled Transparency (March 2022)
Runecast: Simplified Security with Unparalleled Transparency (March 2022)Runecast: Simplified Security with Unparalleled Transparency (March 2022)
Runecast: Simplified Security with Unparalleled Transparency (March 2022)
 
Outpost24 webinar - The economics of penetration testing in the new threat la...
Outpost24 webinar - The economics of penetration testing in the new threat la...Outpost24 webinar - The economics of penetration testing in the new threat la...
Outpost24 webinar - The economics of penetration testing in the new threat la...
 
Continuous Delivery
Continuous DeliveryContinuous Delivery
Continuous Delivery
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systems
 
Lessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeLessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec Life
 
Patch your workplaces at home, in a meeting center or at the office
Patch your workplaces at home, in a meeting center or at the officePatch your workplaces at home, in a meeting center or at the office
Patch your workplaces at home, in a meeting center or at the office
 
Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016
 
Final observability starts_with_data
Final observability starts_with_dataFinal observability starts_with_data
Final observability starts_with_data
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Shmoocon 2015 - httpscreenshot
Shmoocon 2015 - httpscreenshotShmoocon 2015 - httpscreenshot
Shmoocon 2015 - httpscreenshot
 
BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...
 
AdvancedMD Customer Presentation
AdvancedMD Customer PresentationAdvancedMD Customer Presentation
AdvancedMD Customer Presentation
 
Chaos engineering for cloud native security
Chaos engineering for cloud native securityChaos engineering for cloud native security
Chaos engineering for cloud native security
 
AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019 AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019
 
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
 
Vulnerability Management V0.1
Vulnerability Management V0.1Vulnerability Management V0.1
Vulnerability Management V0.1
 
Intro to Puppet Enterprise Webinar 07.27.2017
Intro to Puppet Enterprise Webinar 07.27.2017Intro to Puppet Enterprise Webinar 07.27.2017
Intro to Puppet Enterprise Webinar 07.27.2017
 
Dev ops hackformers-matt-tesauro
Dev ops hackformers-matt-tesauroDev ops hackformers-matt-tesauro
Dev ops hackformers-matt-tesauro
 
SplunkLive! Zurich 2018: The Evolution of Splunk at Helvetia Insurance
SplunkLive! Zurich 2018: The Evolution of Splunk at Helvetia InsuranceSplunkLive! Zurich 2018: The Evolution of Splunk at Helvetia Insurance
SplunkLive! Zurich 2018: The Evolution of Splunk at Helvetia Insurance
 
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
 

En vedette

Università degli studi di padova
Università degli studi di padovaUniversità degli studi di padova
Università degli studi di padova
paolo zucchini
 
Football Coach eMag Apr 10
Football Coach eMag Apr 10Football Coach eMag Apr 10
Football Coach eMag Apr 10
John Rice
 
SunnyDissertation12082016_FINAL
SunnyDissertation12082016_FINALSunnyDissertation12082016_FINAL
SunnyDissertation12082016_FINAL
Sunny Yuk Yin Wong
 
JamesKey_AMAZON
JamesKey_AMAZONJamesKey_AMAZON
JamesKey_AMAZON
James Key
 

En vedette (19)

Overview
OverviewOverview
Overview
 
Security Orchestration and Automation with Hexadite+
Security Orchestration and Automation with Hexadite+Security Orchestration and Automation with Hexadite+
Security Orchestration and Automation with Hexadite+
 
Cyber incident response or how to avoid long hours of testimony
Cyber incident response or how to avoid long hours of testimony Cyber incident response or how to avoid long hours of testimony
Cyber incident response or how to avoid long hours of testimony
 
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
 
If We Only Had the Time: How Security Teams Can Focus On What’s Important
If We Only Had the Time: How Security Teams Can Focus On What’s ImportantIf We Only Had the Time: How Security Teams Can Focus On What’s Important
If We Only Had the Time: How Security Teams Can Focus On What’s Important
 
March 2009 - Reducing Incidents: 3-2-1-0 Approach
March 2009 - Reducing Incidents: 3-2-1-0 ApproachMarch 2009 - Reducing Incidents: 3-2-1-0 Approach
March 2009 - Reducing Incidents: 3-2-1-0 Approach
 
SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera...
 SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera... SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera...
SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera...
 
Goodbye CLI, hello API: Leveraging network programmability in security incid...
Goodbye CLI, hello API:  Leveraging network programmability in security incid...Goodbye CLI, hello API:  Leveraging network programmability in security incid...
Goodbye CLI, hello API: Leveraging network programmability in security incid...
 
Strategy for Reducing Ticket Backlog
Strategy for Reducing Ticket BacklogStrategy for Reducing Ticket Backlog
Strategy for Reducing Ticket Backlog
 
Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015
 
case study 12
case study 12case study 12
case study 12
 
Penicilina y Sífilis
Penicilina y SífilisPenicilina y Sífilis
Penicilina y Sífilis
 
Università degli studi di padova
Università degli studi di padovaUniversità degli studi di padova
Università degli studi di padova
 
Key notes on gst registration
Key notes on gst registrationKey notes on gst registration
Key notes on gst registration
 
Football Coach eMag Apr 10
Football Coach eMag Apr 10Football Coach eMag Apr 10
Football Coach eMag Apr 10
 
Superheroes
SuperheroesSuperheroes
Superheroes
 
SunnyDissertation12082016_FINAL
SunnyDissertation12082016_FINALSunnyDissertation12082016_FINAL
SunnyDissertation12082016_FINAL
 
JamesKey_AMAZON
JamesKey_AMAZONJamesKey_AMAZON
JamesKey_AMAZON
 
Communication for employment
Communication for employmentCommunication for employment
Communication for employment
 

Similaire à Neptune : Re-thinking Incident Response Automation

Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
CIVEL Benoit
 

Similaire à Neptune : Re-thinking Incident Response Automation (20)

Preparing for DevOps
Preparing for DevOpsPreparing for DevOps
Preparing for DevOps
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux FestBuilding an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
 
Taking AppSec to 11: AppSec Pipeline, DevOps and Making Things Better
Taking AppSec to 11: AppSec Pipeline, DevOps and Making Things BetterTaking AppSec to 11: AppSec Pipeline, DevOps and Making Things Better
Taking AppSec to 11: AppSec Pipeline, DevOps and Making Things Better
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
PAC 2019 virtual Bruno Audoux
PAC 2019 virtual Bruno Audoux PAC 2019 virtual Bruno Audoux
PAC 2019 virtual Bruno Audoux
 
Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspective
 
The QA/Testing Process
The QA/Testing ProcessThe QA/Testing Process
The QA/Testing Process
 
How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks
 
Automating Security in Cloud Workloads with DevSecOps
Automating Security in Cloud Workloads with DevSecOpsAutomating Security in Cloud Workloads with DevSecOps
Automating Security in Cloud Workloads with DevSecOps
 
Making security-agile matt-tesauro
Making security-agile matt-tesauroMaking security-agile matt-tesauro
Making security-agile matt-tesauro
 
DevOps and AWS
DevOps and AWSDevOps and AWS
DevOps and AWS
 
Dev ops ci-ap-is-oh-my_security-gone-agile_ut-austin
Dev ops ci-ap-is-oh-my_security-gone-agile_ut-austinDev ops ci-ap-is-oh-my_security-gone-agile_ut-austin
Dev ops ci-ap-is-oh-my_security-gone-agile_ut-austin
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec Pipeline
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Monitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureMonitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In Azure
 
Getting to Walk with DevOps
Getting to Walk with DevOpsGetting to Walk with DevOps
Getting to Walk with DevOps
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Neptune : Re-thinking Incident Response Automation

  • 1. Re-thinking Incident Response Automation Kiran Gollu, Co-founder/CEO Neptune.io © 2015
  • 2. Brief Intro: Myself & Neptune Neptune.io © 2015 •  Architected an incident response automation platform for AWS •  Founding team at Amazon S3, DynamoDB for 5 years Strong engineering-heavy team
  • 3. Agenda Neptune.io © 2015 •  State of incident response automation today •  Our learning's from building such a platform for AWS/Neptune •  Best practices •  Intro to Neptune •  Examples : Incident response workflows •  Q/A
  • 4. What is Incident Response? Neptune.io © 2015 How to handle incidents/outages? Many more.. Alerts
  • 5. Incident response automation is broken today! Neptune.io © 2015
  • 6. Neptune.io © 2015 Source : DevOps survey; Victor Ops incident response 95% of Time To Recovery(TTR) is still manual today Alert Troubleshooting Triage | Investigate | Identify Resolution Documentation 73% 10%5% 12% Snapshots •  Graphs & metrics •  Logs •  Webpages Service health checks •  Internal •  External Host/App diagnostics •  “Top”, “df –H” etc. •  Heap dumps/Stack traces Runbooks •  On single/cluster of hosts •  Any script, any language Cloud API/CLI actions •  Start/Stop/ Reboot •  Scale up/down Root-cause analysis & Audit •  Heap dumps •  Logs •  Graphs Post-mortem •  History •  Diagnostics
  • 7. What has changed? Neptune.io © 2015 •  Automation, uptime, and agility : #1 priority for businesses •  e.g. People can’t imagine Gmail going down •  #Servers, #VMs, #Containers, #Apps launched exploding! •  Maintenance has become huge burden •  13 different tools for managing app •  Difficult to track down root cause what’s going on where •  Cloud, dynamic environments => knowledge sharing is a problem Typical incident takes 1-2 hours to diagnose & fix
  • 8. Big companies built custom automation tools internally FBAR : Facebook Auto Remediation Platform “…Its doing the work of approximately 200 sys admins…” “We built one for Amazon Web Services!” Neptune.io © 2015 The rest diagnosed in minutes instead of hours 40-60% of alerts get fixed automatically without human intervention
  • 9. Key takeaways Neptune.io © 2015 •  Uptime and Automation agility are critical drivers for your businesses •  Incident response automation gives you: •  More uptime, better customer experience •  Reduction in MTTR •  Happier engineers
  • 10. Maturity level of Incident Response Teams Neptune.io © 2015 @jpaulreed @kfinnbraun DevOps Enterprise Summit
  • 11. 3 core pieces of incident response platform Neptune.io © 2015
  • 12. 1. Analytics Neptune.io © 2015 •  Helps identify those top-20% alerts causing 80% of pain •  Sorted by frequency and MTTR •  Capture: •  MTTA (mean time to acknowledgement) •  MTTR (mean time to resolution) •  Frequency of occurrence (#times a particular alert has occurred) •  Reporting + Auditing •  Audit all activity (both manual + automated) •  Leads to data-driven post mortems
  • 13. 2. Context Neptune.io © 2015 •  When an alert occurs: •  Gather context automatically from 13 different tools •  Monitoring tools, logging tools, health checks, dependent services Use cases: •  High memory à capture top-10 memory hogs, memory usage graphs •  High app error rate à capture error rate, latency trends, app logs with 5xx errors
  • 14. 3. Remediation Neptune.io © 2015 •  When an alert occurs: •  If it’s a known alert => Run a remediation runbook •  Use cases: •  Process crashed à restart process •  Host is unpingable à restart 3 times and escalate if still fails
  • 15. Our learnings Neptune.io © 2015 •  Automate simple things first •  Have checks in place to avoid cascading failures •  Don’t automatically fix when you don’t know root cause •  We started with more focus on remediation, but customers really wanted automated context gathering •  Customers were not of maturity level that we expected, though they’d like to be •  Security is of paramount importance •  Customer prefer vetted runbooks compared to running arbitrary scripts •  Use github or chef/puppet recipes for runbooks (code reviewed/vetted)
  • 16. Neptune.io © 2015 Neptune: Incident Response Automation-as-a-Service IRA as a ServiceMonitoring as a Service Alerting as a Service Existing tools just alert somebody, without any context or diagnostics We provide diagnostics for unknown issues, and for known issues, we fix them automatically
  • 17. Neptune.io © 2015 Deployment Models •  SaaS Model - available today! •  Github/vetted runbooks •  On-premise AWS VPC deployment model – available today! •  Enterprise customers •  On-premise deployment model (roadmap)
  • 18. Deep Dive: Architecture Neptune.io © 2015 Event Queue Policy-based Rule Engine Action Queue Neptune Web Service Dedicated Queue Per customer Publish action results Neptune Agent REST API-based Runbook repo Custom Tool Read-only
  • 19. Quick Demo Neptune.io © 2015 UseCase1: Auto-Remediation UseCase2: Auto-Diagnosis Host-level Alert – high memory •  Collect top-10 memory hogs •  restart the process App-level Alert – high error rate •  Collect graph snapshots, logs •  Run script on cluster of machines
  • 20. Neptune.io © 2015 Sample error rate incident today (before Neptune)
  • 21. Neptune.io © 2015 Sample error rate incident (after Neptune)
  • 22. You can get started in 10 min 1.  Configure monitoring tool to send alerts to Neptune 2.  Install a light-weight agent on a few servers Neptune.io © 2015
  • 23. Thanks! Check out our 2 week free trial! kiran@neptune.io Neptune.io © 2015
  • 24. SaaS Model: Why are we secure? Neptune.io © 2015 •  Go-based Agent: •  No dependencies (agent code is open source) •  Outbound access only •  No need to open any inbound firewall ports •  Agent is light weight, dumb, and not chatty •  Sits idle unless there is something to do •  Consumes < 0.01% CPU, 20MB Memory •  Authentication: •  Leverage AWS STS token Auth: use temp credentials, rotate every 4 hours •  Neptune API_KEY Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you
  • 25. SaaS Model: You Control Runbooks Neptune.io © 2015 •  All Runbooks stay within your firewall •  Runbooks are version controlled (e.g. Github) •  No one can edit your runbooks •  Even Agent has read-only access to runbook repository Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you

Notes de l'éditeur

  1. Get started in10 minutes