13. IT ALL STARTED
WITH A BUG…
Missing a large number of
tracking events for one
customer… 🤔
A customer is telling us we’ve
a problem with our system.
@ShaneCarroll84
14. WE WERE IN FIRE
FIGHTING MODE
Scrambling to try find the
cause of the issue…
Don’t know the full impact,
causing stress for the team.
@ShaneCarroll84
16. LET’S LOOK
AT THE LOGS…
Logs were noisy and not easily
searchable…
Not all useful information to
help isolate the issue was
logged.
@ShaneCarroll84
17. MACGYVER,
WE HAVE A PROBLEM
Issue with titles containing
special characters… 🤦
Lots of customers impacted
and over a million tracking
events lost.
@ShaneCarroll84
18. WHY DID WE NOT
GET AN ALERT?
IT WAS LOST IN
A SEA OF ALERTS.
@ShaneCarroll84
19. OUR PROCESS
WAS BROKEN!
Took weeks to gather missing
data and reprocess events…
Problem continued to impact
customers and other teams.
@ShaneCarroll84
20. WE DIDN’T KNOW OUR
SYSTEM’S HEALTH
What can we learn from this?
Use a bad experience such
as a bug to learn from
and spark change.
@ShaneCarroll84
21. WHEN YOU LOOK AT YOUR
CURRENT SYSTEM,
HOW DO YOU KNOW
IT’S HEALTHY?
@ShaneCarroll84
26. THREE AMIGOS
George Dinwiddie first came up
with this strategy in his blog.
The Three Amigos – Product,
Developer, and Tester – discuss
the new feature
@ShaneCarroll84
27. THREE AMIGOS
Allow for discussion on risk
before beginning work on a
feature…
We now ‘Three Amigo’ each
new story before beginning any
new development.
@ShaneCarroll84
28. REFINING ALERTS
Reduce noise and only alert on
what is important to the team.
Entire team takes responsibility
for investigating and fixing
alerts.
@ShaneCarroll84
29. DAILY STAND-UP
We changed our standup, to
include a new question…
Small change but now it’s part
of our team process!
Any new alerts today?
@ShaneCarroll84