Site outages and incidents are par for the course for tech companies. Solving the root cause often involves tense conversations and high-pressure situations. The war room serves as a dedicated space where the most critical team members work through these issues, but what happens when everyone gets in the room?
The goal is clear, restore site service as quickly as possible, but there can be many approaches to getting there. In this talk, Rashi Khurana will share best practices in leading teams to troubleshoot issues and navigate incident resolutions. She will address communication strategies, people management, process implementation and managing retroactive evaluations.
8. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
● Mean Time To Detect the issue - MTTD
○ the time between when the incident
started and when we first realized (got
paged) about it.
● Mean Time To Resolve the issue - MTTR
○ the time between when the incident was
reported to when it was fully resolved.
Besides uptime,
we measure:
#WarRoomWarrior
9. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Prepare for MTTD
● Monitoring and alerting
● Logging
● Service ownership
● PagerDuty on-calls
○ Triage or escalations
● Organize a war room
● Get the right crew online
● Traceroutes and similar developer tests
#WarRoomWarrior
12. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Prepare for MTTR
● Documentation and Runbooks
○ Set-up requirements like Okta,
SumoLogic, LDAP, etc.
○ Runbooks for oncall
● Skills and Training - “I got paged, now what?”
Welcome! You are a war room warrior!
○ On-call runbook walkthroughs
○ On-call expectation
#WarRoomWarrior
13. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
The “Follow The Sun”
Approach
● Multiple tiers of respondents
● Tier 1 and escalations
● Set up your Service Operating Centers
globally
● Trainings and documentation
#WarroomWarrior
15. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Change Management
Changes are managed, not controlled.
Create a framework for frequent changes:
● SDLC full cycle includes change requests
● DevOps version of Change Control
● CI/CD and iterate frequently
● Changes are still logged (jira)
● Easy to access and revisit (deployment
markers)
#WarRoomWarrior
16. Proprietary and confidential
Change Control
Change requires risk assessments at specific
points of time.
Create a framework for risky changes:
● Risk profiles
● Conservative approving
● Dev + Ops version of Change Control
#WarroomWarrior
17. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Change Advisory
Board (CAB Approval)
Every change that is critical to all services, such as
DNS routing changes or incoming proxy updates.
Questions to ask:
● Do we have a roll-back procedure?
● Does it include time it takes to execute roll-back?
● What services or products can it impact?
● Are any other changes scheduled around the
same time?
● Was change tested in pre-production?
● Is execution happening at peak customer hours?
#WarRoomWarrior
18. Proprietary and confidential
Internal
Communication
Strategy
How do we communicate what is
going on?
● Accessible email template
● Set-up email group :from and :to
● Easy to read color coding
○ Red - Critical impact
○ Orange - Parts of critical flow impacted
○ Green - All back to normal
● Slack channel #warrroom
#WarroomWarrior
21. This layout works great for dividing sections.
Insert an amazing image and align it with this
grey rectangle for a dramatic transition.
Feel free to change the copy to white should
want it to show up better against the image.
#WarRoomWarrior
2. In the war room
23. Proprietary and confidential
Recap - What do we
have so far?
● Severity level is determined
● Communication is started
● There is an Incident Manager
● There is a Tech Recovery Manager
● Staff who were paged are present
● There is a decision maker
● Let’s look into the difficult part...
24. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Impact Detection
● Is there customer impact? - SEV 0
● Is the impact functionality-specific or
sitewide? - Dashboards
● Any changes in CAB that day?
● Is the impact perpetual or intermittent?
● Am I able to reproduce the issue in
Production?
● Are customers starting to contact
Customer Care?
● What percentage of customers are
impacted?
● Am I able to reproduce the issue in QA?
(hint)
#WarroomWarrior
26. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Code that runs our
applications and
services
● Issue is siloed to my application.
○ Is the issue reproducible in QA?
● When was the last deployment?
● What part of the site is impacted, and was
there a code change in downstream
dependency?
● Is CPU/Throughput or memory trends
erratic?
● Memcache and DB connection for the
application
● Are their any A/B tests running?
#WarRoomWarrior
27. Proprietary and confidential
Infrastructure that
runs our application
● Includes - Load balancers, KVMs, network,
nodes, storage, puppet, chef, AWS and
K8s, EMC, etc.
● Are multiple teams getting paged?
● Is the issue not reproducible in
DEV/QA?
● What is the common denominator for
the paged application?
● Are errors on a single route for the
application or has overall error rate
spiked?
● Check Network, Load Balancer graphs
● Check the dependency map view of New
Relic to see if there is something red.
● Catch 22 - Possible the traffic does not
even reach us.
#TechTransformation
28. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Infrastructure as Code
● Best practices for code apply to
infrastructure:
○ code reviews
○ versioning
○ automation tests
○ e.g Puppet, Chef, Ansible, Terraform,
helm charts, jenkinsfiles, docker files.
● Application teams own issues that are
infrastructural.
● Self serve - You built it, you run it!
#WarroomWarrior
29. Proprietary and confidential
“Not my issue.”
● Lead from behind
● Listen and gather information
● Be curious, probe from different angles
● Broader context - use your expertise to give
feedback
● Help with trivial tasks
● Moral support
● But don’t get in the way
#TechTransformation #WarroomWarrior
30. Proprietary and confidential
Sometimes it’s not
your issue until it is.
● Application teams may be needed to restart
● Rebuild a lost image
● Verify post changes
● Lingering issues in the aftermath
● e.g artifactory issue
#WarRoomWarrior
31. Proprietary and confidential
Note the slip-ups
● Are monitoring thresholds set-up correctly
● Are we hearing from our customers before
we are aware of the issue?
● Was there a warning before the alert?
● Could this have been caught by an
automated test in pre-prod environment?
#TechTransformation #WarroomWarrior
32. This layout works great for a dramatic quote or statistic.
Insert an amazing image and align it with this grey rectangle for a dramatic transition.
Feel free to change the copy to white should want it to show up better against the image.
3. Post-incident and Ownership
#WarRoomWarrior
33. Proprietary and confidential
It’s a learning
opportunity
Setting up a Postmortem or Root
Cause Analysis
● Postmortem presenters - owners
● Audience
● Knowledge sharing for the organization
● Details published in email
● When is a postmortem closed?
#WarroomWarrior
34. An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
No Blame!
It was a sunny Saturday. Breakneck hike at Cold Springs, upstate NY, was more than I had anticipated. It was not just a hike; it was more like rock climbing, though it was a lot of fun. After 10 hours outside with multiple delayed trains, I get home by 8:00 pm.
Head swinging, I get ready to hit the hay by 10:30 and then…
I get paged
Its 1:00 am and my phone rings obnoxiously. There is a work incident. Scratching my head to muster the strength to get out of bed; I look for my laptop, set it up and log-in.
I get my wits about me to understand that there is an manhole fire that has cut all access to our datacenter.
I am in warrior mode
My talk will be a short presentation with a real life incident with the manhole fire that happened. The talk and discussion will help you be the best warrior in similar situations - the war room warrior. My talk will cover -
Being prepared for the situation
Managing from the center of a tornado (in the situation )
(And what happens afterwards.) What you need to take away from each situation.
I am Rashi Khurana … PAUSE .. Vice President of Engineering at Shutterstock.
Shutterstock is a leading global technology company offering a creative platform, tools and services to manage high quality assets.
I lead the eCommerce site (where customers search and download images) and the Front end Platform on which its built.
For those of you who may not be familiar with the term warroom; its a word used in tech organizations to refer to a place where people collaborate to resolve issues. Some companies call it the service operating center
Why is the situation when you are paged so critical?
Firstly, stress - we all know how it feels to be stressed out. And stress is subjective - a Bear running towards you can be as stressful as a spider crawling towards you (if you are afraid of spiders i.e)
Secondly, let’s talk about Revenue and Customers -
A research done shows that B2C business reached a 3 trillion market in 2016.
A more recent one shows that 38% of mobile customers find a website down when they come to it and
100% of them are turned away. It's like finding a locked store when you arrive.
That’s a hit not only to the customer experience and their loyalty when they find a competitor up at that time, it is a huge loss of revenue to the company.
And for customers its actually more than just experience,
what if they are in the middle of a bank transaction which is time critical or
its an ad agency creating a project using a site like Shutterstock that is running against deadlines.
So let’s go back to the version of me scratching my head at 1:33 am, logged into the laptop. The next 2-5-15 minutes are so important to decide the course to take and many hours and days of preparation goes into making those first few minutes the most effective and the least stressful.
What are some questions to ask, to know you are not dealing with these situations well?
What is our uptime of our websites? Do we recover soon when a situation happens or does it take us a long time everytime we go down?
Is their on-call pager duty fatigue in the company thats coming in the way of actually innovating and building new products? This can stifle company attrition as well.
Are we getting a lot of bad press because of instability?
Are we losing customers and staff?
When we prepare for waroom situations, we are driving for these two metrics to be low. Mean Time to Detect the issue and Mean Time to Resolve it (this is where we spend the bulk of our time)
How can we get ready for a shorter MTTD?
Here is a video of what happened when heathcare.gov was launched with no monitoring. Millions of United States individuals went online to buy health insurance and the site was down.
Start at 7:57
Here is an example (not from a manhole fire but similar incident) where a monitored system that sends you an incident page when system sees erratic behavior.
To control MTTD and MTTR there are also other tactics that organizations apply. Since the manhole fire that I am talking about, we have set a level of Tier 1 triage layer where we have 24 hour support, eyes on the glass.
The team is set at a global location, in this case India and they are first line of support for production issues. They escalate and page the right teams when needed freeing engineering teams.
Its called Follow the Sun approach since Sun is always up somewhere in the world. This is a cost investment and should be thought through.
This is where I asked my teams so spend a lot of time. They have documented in detail all the Runbooks for our front end applications. It comes in handy not just during but also for on-boarding new engineers
When changes are introduced to our production environments, more often than not; we spend time in trying to keep change from happening and put relatively less effort in ensuring that when (not if but when) change happens, we manage it effectively.
In the Agile development world that we live in, we need to “welcome change”. Change Management is the process we put into place to deal with changes in general.
It ensures that all the things that should happen for a change actually happen. We achieve this by communication and collaboration - sprint plannings, estimations, prioritization of work, change in requirements by product owners or customers, demos and documentation .
This enables us to repeat low risk changes with confidence since you are adhering to a framework.
Change control on the other hand is evaluating each change to ensure its the right thing to do and during change control authority is vested in individuals and in boards to make decisions based on risk profiles of changes.
For instance, extremely high risk changes need sign-off from the technical functional head of the group that is requesting changes, changes may be requested to be scheduled for after peak hours of traffic.
Change Control is done to avoid the disruptive effect of changes. It is a subset of Change Management.
Change control can be achieved with a CAB meeting.
Its favorite meeting since it gives us one more opportunity to ensure that high risk changes are not disruptive and are organized. In this meeting we discuss all BIG changes that will be released to production in next 24-48 hours!
My role in this meeting is to check to make sure all changes from my teams are bullet-proof; last Thursday we had a change we wanted to roll to prod and pre-prod on the same day which means pre-prod was not yet tested - I kindly rejected it!!
To avoid unnecessary stress for people wanting to know what is going on but not knowing who to ask because they may distract the taskforce, we need internal comms.
The comms need to be easily digestible and therefore we use color coding to communicate impact and should be easy to draft and therefore we use a repeatable template
Here is a sample operations email to the org from the manhole fire, we’ll see in a bit how we mitigated an all services impacted to a few ones - Red to Orange.
And the whole reason we are here is for our customers, so external comms are very crucial to let them know what is going on and that we are on it. We should prepare an external comms strategy.
Alright, this is the real deal, I’m online in the warroom hangout and being a liaison to many parts of our engineering org, in the next 5 minutes I need
to assess the situation,
see if the right people are online and
understand what has already happened.
Now we’ll see how all our preparation comes in handy because back in my apartment at 1:35 am, there is a manhole fire and I need to start acting.
How many people here have been a part of those stressful situations where something was not working and they had to act quickly? And how many wish they acted differently or needed a resource that was not available to them during that stressful moment?
How many prepared to be ready for the next time?
https://kids.frontiersin.org/article/10.3389/frym.2017.00071
Lets get some definitions out of the way
Using to grid above we know that manhole fire has cut all access to DC, that is our 10GB connection between our DataCenter to AWS. Our back up is a 1GB connect. Site is unresponsive, it is a severity 0.
My peer incident manager on-call is drafting the initial internal comms email (the sample we saw before)
So I see an Incident Manager is on-call, I’m the Tech Recovery Manager, all application and infrastructure teams that got alerted, got paged; they are online. Our CTO is online as well - he will be our ultimate decision maker.
No further action has happened and the first 5 minutes of this incident is over
So going further..
Site is unresponsive for our customers.
I checked our new relic monitoring dashboard and saw that it was not only my core application that was inaccessible, many brands were down too; including our internal admin site. That confirmed that the issue is external to the applications. Networking engineers were paged too.
I also checked if any of the CAB changes that were approved for the day went out that could have caused the issue.
Our instant reaction, we have a 1GB back-up; lets fall back.
So with the fallback; we thought we’ll restore functionality but we only recovered parts of our system with extreme slowness and there was knock out effect on other parts.
Stay with me, the next two-three are the only technical slides of the deck and hopefully it will make sense at a high level.
None of the symptoms for the origin of the issue seemed to be in code and that was a relief.
So understanding the network re-routing to fallback should work but it only brought back a slow website; we realized that we had overwhelmed our firewalls with a surge of traffic and some things recovered but some seemed to not have.
We uncovered
that there were retries in the system that caused the traffic to surge
and due to slowness some applications were not connecting to downstreams causing more retries.
Thanks to all documentation; we knew what the timeout setting for the applications were and where the retries were happening from.
Boundaries between Infrastructure and application code are melting. In times of troubleshooting, it is also necessary to understand ownership around the edges.
Example - Not for the manhole fire but for an outage like capacity for an AWS pod cluster is reached from auto-scaling, I have seen infra guys get paged since infrastructure is not scaling but that config and scaling code is owned by application teams. This understanding is very important to have less MTTR. In the case of manhole fire, we realized that this was not a config that was owned by application teams and as such for complex configs like these, infrastructure teams need to be involved. We had the neteng engineers.
There is a quote from Werner Vogels, CTO and Vice President at Amazon, provider of devops and toolset - You built it, you run it
So this was not my teams application issue but it was my peer teams infrastructural issue, how still could I be useful?
Can I go back to sleep? NO!!!! I’m still a warrior! I still need to be supportive for the rest of my teams.
While the manhole fire was not the application teams issue,
there were teams changing the code to ensure whatever they can do to bring back the site, they will do; in case changing retries logic.
We needed diff teams to check diff experiences like we own sites like Bigstock, Premium beat for music, Offset and to ensure complete recovery some teams helps are needed to ensure all dashboard returned to health.
Offset one of our site was still in old DC and had some dependency to a legacy RPC call connection, that team had to reboot that application even after rest of the systems had recovered. Same applied for our internal admin site.
The hangout sometimes can become interesting with too many players and that's another example where the OCC can help provide that buffer.
Some services were not paged but front ends are; are alerts properly sequenced. Also did we rush to fallback without thinking of the repercussions?
Situation is resolved, I can go back to sleep but I know next day at work will be a good retrospective.
We’ll do our due diligence to
understand the root cause of the failures and
its ripple effect to our infrastructure and
learn and document lessons for future reference.
It is time for Root Cause Analysis or Postmortems
This is the meeting where most people are collected and calm since the storm is over.
Blameless Postmortems - Rules of Engagement
There are no punishments for systems that are build not as per expectation or decisions that are made that didn’t work out in our favor. Instead, we dig deeper into what allowed for vulnerabilities in the systems; where did we miss catching it in our process and why a decision was taken and what was the thinking behind it.
Best postmortem person who made can speak to it. and how to not make that mistake.
These help making the technical and cultural environment safer.
Actually this makes me also think; if someone cuts me off in the traffic instead of blaming them as bad drivers, I think what may have caused them to do so - blameless postmortem :)
https://www.cogneurosociety.org/fae_moran/
Some very important pieces of a PIR are -
The 5 whys is a strategy used to execute blameless postmortems to get to the root of an issue instead of coming down to a person or team it exposes a miss in the process or system that we can fix
How was resolution achieved and what was the root cause - The 10GB backup was delayed and did not arrive to have a successful backup.
Next steps -
Audit remove all retries, timeouts, firewall overwhelming and more funnel
Lessons learnt - We cannot control an external manhole fire but we can have resiliency in our systems.
And in this case we realized that proper procedures were followed with the availability of resources at that time.
This concludes my talk and when next time I get paged after midnight to awake my inner warrior;
I know I wont stress as much, because I am prepared for it.
I know what to do during the incident and
I know me and my teams will always learn from it.