8. It Takes Buy-In At All Levels
DevOps can’t just go it alone
Every business function has something to contribute
Messaging needs to be accurate, but also accessible
9. Agree on When, and How Often
Generally: Communicate as soon as possible
Some issues need to be contained before they become public
Set a regular cadence for updates
10. Agree on What Gets Communicated
Help your users make intelligent choices and you’ll like
their decisions better
Provide context and realistic ETAs
Use Incident Templates to ease messaging
11. Agree on Who Manages the Message
Speak with a single,
consistent voice
16. The CC Process can Help During the
Firefight
Context
InsightData
17. Applied DevOps: Retrospect on the CC
Process
•What went well?
•What could have gone better?
•Action items:
•Continue doing…
•Stop doing...
•Start doing...
18. Applied DevOps: X-Functional Post-
Mortems
What was the technical impact?
What was the business impact?
What was the reputational impact?
20. Key Takeaways
• Your failures will become public
• Who controls when and how the information gets out: you,
or someone else?
• Having a plan makes crisis communication less of a burden
on the people fighting the fire
• Doing it right will build trust and loyalty with your users
Notes de l'éditeur
Hello, and thanks for joining us today!
I’m Mike Merideth, the Sr. Director of IT at VictorOps in Boulder and I’m here with Blake Thorne, the Content Director at StatusPage. We’re here to talk about crisis communication; and some of the lessons and best-practices that we’ve learned over the years here at VictorOps
At the end of the presentation we’ll have some time for Q and A, so please enter any questions you have into the chat window and we’ll get to as many as we can
VictorOps doesn’t just take an academic interest in crisis communication. It’s an important part of our DevOps culture. Part of taking ownership of operations is knowing your users and engaging with them to make their experience better.
It’s also important to the business as a whole. It's part of how we keep our brand promise, that we're making life better for our users, and that we won’t let them down in a crisis.
We promote a culture of open communication internally, and that’s how we want it with our customers too. After all, showing trust and respect is a good way to generate trust and respect.
[BLAKE] Customers deserve to hear from you first. We’ve found this makes the difference with crisis communication -- if you can tell your team and customers before they find out, it’s a much better customer experience. It’s actually a completely different experience.
This hasn't always been a core competency for us at VictorOps
In our alpha and beta phases, we leaned on informal channels of communication that don’t really scale. A phone call can have a nice personal touch to it, but once you have hundreds or thousands of customers, you need to be able to broadcast the message.
As we came out of beta we knew we needed to step up our game, so we made a conscious effort to do so. This meant thinking about process, and thinking about channels of communication.
One of the first and easiest things we did was to sign up for StatusPage. Their product is a great fit for our culture, since it removes barriers between users that need up-to-date information and the technical people who are in a position to provide it.
Crisis Communication is a work in progress for us, and it always will be
We retrospect, we learn something and we adjust the process after almost every event. Small, iterative changes can make a big difference in how efficiently a process runs.
The whole company takes an interest, and everyone has something to contribute.
Marketing and Sales know the audience and hear first-hand what kind of messaging is important to our customers.
Operations support can tell us in real-time how issues are affecting our customers, and they are the first people that our users turn to when they’re seeking information.
The DevOps team knows the most about the root causes of events, and can offer insight into how to surface that information quickly. That not only helps us communicate better, it helps us solve problems faster.
Even if we feel like everything went great during an event, we take time to reinforce what went well and ask ourselves what could have gone better
Crisis Communication is scary!
It’s human instinct: when you’re having problems, the last thing you want to do is draw attention to them. Talking publicly about a failure, maybe even before you know what caused it, can feel like an admission of incompetence.
You could publish information that later turns out to be inaccurate, and end up looking foolish or uninformed. You could accidentally disclose information that an attacker could use to make the situation worse.
Your competitors or detractors might use your candor against you. Maybe the trade press will report on your outage and damage your reputation in the marketplace.
Or, maybe your customers didn’t notice the outage! You could be drawing attention to a problem that your users weren’t aware of, and creating ill-will where there was none before.
The alternative is worse!
Don’t kid yourself. Your customers did notice, and Information has a way of becoming public quickly, especially when people have been impacted. If you don’t tweet about an issue, it will still show up in your mentions within minutes, believe me.
If you're seen as dragging your feet with information or covering something up, you'll destroy trust with your users. This is especially true if you’re in the software, platform, or infrastructure-as-a-service industries. People understand that no one is online 100% of the time, but you have to make real commitments and keep them in order to convince people to outsource key functions of their business to you.
Once you lose control of the message, it can be really hard to get it back. An old adage in politics and chess is that if you’re reacting, you’re losing, and it applies here as well. If you’re spending the day responding to other people’s tweets or Facebook posts or accounts in the press, then the perception becomes that you’re trying to hide, that information must be extracted from you.
Your users will reward honesty, candor and timeliness with loyalty
Every organization experiences failures of one kind or another. Not every organization has the courage to own their failures and communicate with their users about them.
Customers know that no one is perfect and value a company that can admit that. “We’re still learning” is a fundamental bit of truth-telling that lays the foundation for constructive engagement going forward. It means that interactions with the customer with be a dialogue, and the customer will see that their feedback is listened to and acted on. One of the most valuable things in any vendor relationship is the feeling that the other party is truly hearing you.
Crisis Communication takes buy-in from all parts of the business
The DevOps team can’t just unilaterally decide to start posting to StatusPage or tweeting about production problems. Crisis communication has to happen in the context of protecting company IP assets, and especially protecting the privacy and security of your users.
If Crisis Communication done without the involvement of the management and business-focused teams, the messaging may be poorly crafted and poorly aimed. Your business teams know where your users are looking for information and know how to talk to them in a way they find useful.
On the other hand, if the technical team isn’t involved, the messaging may be inaccurate and your technical reputation could suffer. You don’t want to give the impression that you don’t know what you’re doing!
There are a few key things that everyone needs to agree on when it comes to Crisis Communication strategy...
Agree on When, and how often, alerts go out
When there’s something affecting your customers, you need to communicate as soon as possible, but depending on the type of issue, going public might cause more damage in the short-term.
The key is to identify different types of issues before they happen, and have a strategy for how to communicate each type. A simple component outage can be reported quickly. A security breach might need to be contained before you can talk about it.
Whenever you post that first notification, you should update on a regular cadence. Stick to that cadence as much as possible, even if there isn’t anything new to report. It shows that you’re still on the case, and won’t leave everyone wondering what’s happening. Obviously if something important occurs, like an issue gets resolved, then get that information out as quickly as possible.
Agree on What gets communicated
Not every detail can be shared, but as a rule, you should report everything you can. Don’t overwhelm the public with minutiae, but try to put yourself in the shoes of your users. What would I want to know about this incident and its cause?
Users need enough information to make intelligent and timely decisions of their own. If their own platform depends on yours, they need to know if it’s time to deploy backup or failover measures. You will like the decisions they make better if you are part of the dialogue. Leave people trying to guess and make decisions on their own, and they may decide to seek another vendor.
[BLAKE] Incident Templates: We see a lot of customers having a way easier time with this communication by setting up some templates and language ahead of time. If you have routine or expected incidents, it’s much easier to execute a pre-planned strategy than figure stuff out on the fly.
Agree on Who is managing the message
It's important to speak with a single, consistent voice. If different information is going out on different channels, it will make you look dumb and you’ll lose trust with your users. The message going out on your StatusPage has to be the same as the message going out on Twitter or in email. Don’t make your customers guess which source is more correct or timely.
You need to make sure there is internal consensus before communicating a status change. Make sure you have multiple confirmations before you report that an incident is resolved.
The person doing the reporting can and should act as a gatekeeper here, ensuring the I’s are dotted and the T’s crossed before the message goes out. This is DevOps, which means ownership. If you hit that “post” button, you own that message.
Agree on Where the message is being shared
There should be well-known and publicized channels for your users to get information. Establish these channels and stick to them. Never make your users hunt or guess where to go.
Twitter, email, and the telephone are all important tools for us and we use them as supplemental channels, especially when an issue is affecting one customer particularly.
StatusPage is our main point of publication for our real-time platform status. Our page is at status.victorops.com. status-dot-whatever is a pretty standard construction for this, and it’s the first URL I try when I’m trying to guess where to find status information for a third-party.
[BLAKE] Private pages. A lot of people don’t realize they might be leaving team members out of the loop. Maybe you have some internal technology that you don’t want public facing. Have you thought about how you’re sharing that info with the team? We’ve have a lot of customers use private pages to keep their team up to speed when internal things break. It can do wonders for team moral when they feel in the loop on these things.
At VO we have a cross-functional crisis communication team that gets involved in incidents in real-time
Members of the team come from business and technical disciplines. The mix means that we’re sure the different departments are being represented during incidents, and we can call on a wide range of skill sets.
The CC team has an on-call rotation, like all of our critical platform teams. In this way, business stakeholders take part in platform operations, and extend the DevOps paradigm throughout the company. For us at VictorOps, this also promotes a better understanding of how our product works and how people use it, so it’s a win for everyone.
Members of our CC team attend training and get certification
They learn how to identify type of crises and handle messaging for each kind. This minimizes the time spent debating the messaging approach. They already know what to do, they just need to execute
They learn how to plan in advance for handling crises, and to learn from incidents in retrospect
They learn how to identify the key stakeholders and communicate with them. This means not only our end users and account holders, but our internal stakeholders as well. Does the executive team need to be notified? How about sales people getting ready to do demos?
We aim to manage CC without disrupting critical work. The way to do that is:
Have someone act as a liaison between first responders and the Crisis Communication team. Make sure the people working the problem know who that point person is.
Keep conversations about CC separate from firefighting. Use a different slack channel or conference bridge, but keep it open to others who want to contribute when they have a moment
Have plans and runbooks ready for how to get the message out. Even if someone is managing an incident for the first time, they should know when, where and how to share information
The Crisis Communication effort can actually help with troubleshooting, by ensuring everyone has the same understanding of what’s going on
Taking a breath to answer a question can sometimes give everyone a chance to gain perspective, or maybe spot a critical detail that was missed. Sometimes the person with the key piece of data and the person with the key piece of context just need to get together.
You may have people wasting time on a dead-end approach to a problem, because they didn’t hear about a critical status update. Getting everyone together to gather status can help get people off of wild goose chases.
If the team is fighting multiple fires, the Crisis Communication team can help relay feedback from the users, and help focus the effort on the most important issues.
It’s important to retrospect on the Crisis Communication process. This is separate from incident post mortems. The Crisis Communication and other stakeholders should get together from time to time to check in on process. Focus on:
Was our communication accurate and timely?
Did we make bad situations better for our users?
Did our efforts to communicate help or hinder the front-line process?
Is there anything in the process that we can automate/better document/eliminate?
Another great tool is cross-functional incident post-mortems. Try to get together as soon as possible after an incident, once everyone is rested and has had a chance to gather their thoughts. These meetings are usually technical, but other departments can learn and contribute here.
Having business stakeholders at the technical post-mortem can help raise questions from an end-user perspective
The better the CC team understands this incident and the platform as a whole, the fewer questions they'll need to ask during the next incident
We’re a DevOps company, and we use a DevOps toolset for managing both incidents and Crisis communication
We use Slack integrated with VictorOps to facilitate conversation.
We have a channel with our VictorOps timeline, and different channels for side conversations and for coordinating messaging. This keeps all of the conversations focused
Our integration with statuspage.io makes it easy to do quick updates to component status from the VictorOps timeline window, even as the Crisis Communication team crafts detailed messages to post.
Key takeaways
Information about your failures will become public. That is the one fact you cannot change
It's a question of who controls when and how that information gets out. It can be you, or it can be someone else. Believe me, you’ll like the outcome better if you’re the one leading the conversation
Having a plan makes Crisis Communication less of a burden on the people fighting the fire and can even help. It also means that your message can get out more quickly, and more accurately
Doing this right will build trust with your users. They’ll know you’re not perfect, but they’ll also know that they can count on you to keep them informed when things aren’t going right, and during an incident, they will have a sense that you’re doing everything you can.