What went wrong? Why does this always happen? How can we ensure it Never Happens Again? For most of the internet age, engineering teams have focused on finding a cause of an outage. A belief existed, and persists, that all errors or behaviors can be traced back to a single causal entity. The Root Cause Analysis is conducted in service of finding that entity, and correcting it. By doing so, we have been taught, we prevent recurrence of the error in question.
Much of RCA thinking comes from manufacturing and electrical systems, where simple causality can exist. An oft failing fuse is caused by poor wiring. In computing environments, there is rarely so simple a cause. Within even the simplest application nest dependencies, logic, bottlenecks, and inefficiency. By wrapping that application in an operating system, on a server, on a network, on the internet, managed by process, actioned by people we add enough complexity to force us to reconsider the Root Cause Analysis approach.
Modern tools and practices, like DevOps, enable engineering teams to adopt significant complexity at relatively low operational cost. Once unthinkable, microservice architecture in a public cloud environment is now a common choice for new software projects. Consider, for a moment, the layers of complexity captured in that decision. Now consider how opaque the agents in those systems are to the operators (us).
Emergence is a phenomenon whereby larger entities arise through interactions among smaller or simpler entities. In theory, complex systems exhibit highly unpredictable behavior, and generate surprising patterns. In practice, teams operating complex engineering systems always see deeply interrelated causality - a blend of people, process, and the systems themselves. So why do we still focus our after action analysis on a Single Cause?
In this talk, we’ll explore these conflicting realities for incident management teams. Attendees will learn about differences between Root Cause Analysis, and more techniques like Postmortem. While this is a technical talk with examples of both simple and complex infrastructures, much time will be spent considering the impacts of people and process to those same systems. Attendees will leave with some actionable ideas to bring back to their teams to improve their own after action analysis activities.
Speaker
matthew-boeckman
Matthew Boeckman
Matthew is an 18 year veteran building infrastructure and leading engineering teams. Despite his heavy Ops background, Matthew has been a longtime friend of Developers and considers DevOps his primary passion and focus. Most recently VP of Infrastructure at Craftsy, Matthew now owns Dryas.io, a consulting practice focused on DevOps, Cloud adoption, and startup growth strategy.
8. “We can easily identify the cause of faults in our digital offerings” - same guy
Simple systems
9. Let’s change once a year, then it will be easier to point fingers at Dev.
Deployment Schedules
10. ● It took a long time to create requirements
● It took a long time to write software
● It took a very long time to deploy applications
● It took a really, really long time to test software
● Testing patches was hard
● Deploying patches was all or nothing
● Managing Hardware was an entire departments job
● Software and Hardware changes often required orchestration (that was hard)
Playing the long game
There were some good reasons
11. ● It took a long time to create requirements
● It took a long time to write software
● It took a very long time to deploy applications
● It took a really, really long time to test software
● Testing patches was hard
● Deploying patches was all or nothing
● Managing Hardware was an entire departments job
● Software and Hardware changes often required orchestration (that was hard)
Playing the long game
There aren’t anymore
12. Root Cause = Static model, Binary Thinking
GOOD
Working
Expected
Certain
Understood
Responsible
Uptime
BAD
Broken
Problem
Disaster
Confused
Wrong
FAILURE
13. “3 tiers should be enough tiers for anybody” - some guy, probably
Simple Systems
19. “... refers to the existence or formation of collective
behaviors — what parts of a system do together that they
would not do alone.”1
Emergence and Complex Systems
1 Bar-Yam Concepts: Emergence
Properties and behaviors of systems arise from both the
fine structures that compose those systems, and the
interrelationships between the systems’ discrete parts.
25. Cynefin
● Created by Dave Snowden @snowded
● Originally for managing IBM Intellectual
Capital
● Draws on research in systems, complexity,
network and learning theories
34. Adopting Cynefin
In the moment:
What Quadrant does this map to?
In the PIR:
How did we manage the pattern?
In your sprint planning:
What patterns can we manage
clockwise?
35. Root Cause Analysis Cynefin
Simple Causality
Static Model
Binary Thinking
After-Action
Focus on Blame
Dynamic
Expects Change
Embraces Emergence
Present in the Moment
Call to Action