War Room Warrior: How to manage war room situations

Proprietary and confidential
War Room Warrior: How to keep
your cool in a catastrophe
Rashi Khurana,
Vice President of Engineering
@RaKhurana

#WarRoomWarrior
A beautiful hike...

#WarRoomWarrior
...and then I got paged.

Chaos, Customers and
Revenue
annual downtime
ticketmaster.com
nike.com
jcpenny.com
gamestop.com
victoriasecret.com
groupon.com
flipkart.com
taobao.com

1. Ask yourself,
“Are we prepared for
these situations?”
#WarRoomWarrior

An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
● Mean Time To Detect the issue - MTTD
○ the time between when the incident
started and when we first realized (got
paged) about it.
● Mean Time To Resolve the issue - MTTR
○ the time between when the incident was
reported to when it was fully resolved.
Besides uptime,
we measure:
#WarRoomWarrior

Prepare for MTTD
● Monitoring and alerting
● Logging
● Service ownership
● PagerDuty on-calls
○ Triage or escalations
● Organize a war room
● Get the right crew online
● Traceroutes and similar developer tests
#WarRoomWarrior

It’s all about monitoring/alerting

Prepare for MTTR
● Documentation and Runbooks
○ Set-up requirements like Okta,
SumoLogic, LDAP, etc.
○ Runbooks for oncall
● Skills and Training - “I got paged, now what?”
Welcome! You are a war room warrior!
○ On-call runbook walkthroughs
○ On-call expectation
#WarRoomWarrior

The “Follow The Sun”
Approach
● Multiple tiers of respondents
● Tier 1 and escalations
● Set up your Service Operating Centers
globally
● Trainings and documentation
#WarroomWarrior

Change Management
Changes are managed, not controlled.
Create a framework for frequent changes:
● SDLC full cycle includes change requests
● DevOps version of Change Control
● CI/CD and iterate frequently
● Changes are still logged (jira)
● Easy to access and revisit (deployment
markers)
#WarRoomWarrior

Change Control
Change requires risk assessments at specific
points of time.
Create a framework for risky changes:
● Risk profiles
● Conservative approving
● Dev + Ops version of Change Control
#WarroomWarrior

Change Advisory
Board (CAB Approval)
Every change that is critical to all services, such as
DNS routing changes or incoming proxy updates.
Questions to ask:
● Do we have a roll-back procedure?
● Does it include time it takes to execute roll-back?
● What services or products can it impact?
● Are any other changes scheduled around the
same time?
● Was change tested in pre-production?
● Is execution happening at peak customer hours?
#WarRoomWarrior

Internal
Communication
Strategy
How do we communicate what is
going on?
● Accessible email template
● Set-up email group :from and :to
● Easy to read color coding
○ Red - Critical impact
○ Orange - Parts of critical flow impacted
○ Green - All back to normal
● Slack channel #warrroom
#WarroomWarrior

#TechTransformation
Company email from OPS_Incident to
tech.notices

External
Communication
Strategy
● Media and comms for Social
● Status page or Maintenance page
● High revenue customers
#TechTransformation #WarroomWarrior

This layout works great for dividing sections.
Insert an amazing image and align it with this
grey rectangle for a dramatic transition.
Feel free to change the copy to white should
want it to show up better against the image.
#WarRoomWarrior
2. In the war room

#TechTransformation
Impact Definitions
#WarroomWarrior

Recap - What do we
have so far?
● Severity level is determined
● Communication is started
● There is an Incident Manager
● There is a Tech Recovery Manager
● Staff who were paged are present
● There is a decision maker
● Let’s look into the difficult part...

Impact Detection
● Is there customer impact? - SEV 0
● Is the impact functionality-specific or
sitewide? - Dashboards
● Any changes in CAB that day?
● Is the impact perpetual or intermittent?
● Am I able to reproduce the issue in
Production?
● Are customers starting to contact
Customer Care?
● What percentage of customers are
impacted?
● Am I able to reproduce the issue in QA?
(hint)
#WarroomWarrior

ics
#TechTransformation
Marching to resolution -
Infrastructure vs Application
#WarRoomWarrior

Code that runs our
applications and
services
● Issue is siloed to my application.
○ Is the issue reproducible in QA?
● When was the last deployment?
● What part of the site is impacted, and was
there a code change in downstream
dependency?
● Is CPU/Throughput or memory trends
erratic?
● Memcache and DB connection for the
application
● Are their any A/B tests running?
#WarRoomWarrior

Infrastructure that
runs our application
● Includes - Load balancers, KVMs, network,
nodes, storage, puppet, chef, AWS and
K8s, EMC, etc.
● Are multiple teams getting paged?
● Is the issue not reproducible in
DEV/QA?
● What is the common denominator for
the paged application?
● Are errors on a single route for the
application or has overall error rate
spiked?
● Check Network, Load Balancer graphs
● Check the dependency map view of New
Relic to see if there is something red.
● Catch 22 - Possible the traffic does not
even reach us.
#TechTransformation

Infrastructure as Code
● Best practices for code apply to
infrastructure:
○ code reviews
○ versioning
○ automation tests
○ e.g Puppet, Chef, Ansible, Terraform,
helm charts, jenkinsfiles, docker files.
● Application teams own issues that are
infrastructural.
● Self serve - You built it, you run it!
#WarroomWarrior

“Not my issue.”
● Lead from behind
● Listen and gather information
● Be curious, probe from different angles
● Broader context - use your expertise to give
feedback
● Help with trivial tasks
● Moral support
● But don’t get in the way

Sometimes it’s not
your issue until it is.
● Application teams may be needed to restart
● Rebuild a lost image
● Verify post changes
● Lingering issues in the aftermath
● e.g artifactory issue
#WarRoomWarrior

Note the slip-ups
● Are monitoring thresholds set-up correctly
● Are we hearing from our customers before
we are aware of the issue?
● Was there a warning before the alert?
● Could this have been caught by an
automated test in pre-prod environment?

This layout works great for a dramatic quote or statistic.
Insert an amazing image and align it with this grey rectangle for a dramatic transition.
Feel free to change the copy to white should want it to show up better against the image.
3. Post-incident and Ownership
#WarRoomWarrior

It’s a learning
opportunity
Setting up a Postmortem or Root
Cause Analysis
● Postmortem presenters - owners
● Audience
● Knowledge sharing for the organization
● Details published in email
● When is a postmortem closed?
#WarroomWarrior

No Blame!

#WarRoomWarrior
Sample Postmortem

Postmortems
● Overview
● 5 Whys
● Resolution
● Root Cause
#WarRoomWarrior

Postmortems
● Overview
● 5 Whys
● Resolution
● Root Cause
● Action Items / Next Steps (JIRA Ticket
References Required)
#WarRoomWarrior

Postmortems
● Overview
● 5 Whys
● Resolution
● Root Causes
● Impact
● Lessons learned and knowledge shared
#WarRoomWarrior

Postmortems
● Overview
● 5 Whys
● Resolution
● Root Causes
● Impact
● Lessons Learned
● Could the incident have been detected
earlier?
● Were proper procedures followed in
notifying support teams?
● Responders / Attendees
● Timelines
#WarRoomWarrior

#WarRoomWarrior
Thank You!
@RaKhurana

Change Management - Process details

War Room Warrior: How to manage war room situations

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à War Room Warrior: How to manage war room situations

Similaire à War Room Warrior: How to manage war room situations (20)

Plus de UXDXConf

Plus de UXDXConf (20)

Dernier

Dernier (20)

War Room Warrior: How to manage war room situations

Notes de l'éditeur