Observability - From fire fighting to smoke detection

•

3 j'aime•759 vues

Shane Carroll

How the nightmare of losing weeks of data transformed our teams way of working

Technologie

THE IMPORTANCE OF
OBSERVABILITY
From Fire Fighting to Smoke Detection
@ShaneCarroll84

OBSERVABILITY
IS LIKE A FITNESS TRACKER,
BUT FOR 
YOUR SYSTEM
@ShaneCarroll84

Track content opens, clicks,
views, interactions…
Get this data and send it on 
for other teams.
MACGYVER’S JOB?
@ShaneCarroll84

CHALLENGES
Considerable amount of old
enterprise environments…
Ancient systems that are 
not easily testable.
@ShaneCarroll84

IMPACT OUR
TEAM?
HOW DID THE LACK OF
OBSERVABILITY
@ShaneCarroll84

IT ALL STARTED
WITH A BUG…
Missing a large number of
tracking events for one
customer… 🤔
A customer is telling us we’ve 
a problem with our system.
@ShaneCarroll84

WE WERE IN FIRE
FIGHTING MODE
Scrambling to try ﬁnd the 
cause of the issue…
Don’t know the full impact,
causing stress for the team.
@ShaneCarroll84

LET’S LOOK 
AT THE LOGS…
Logs were noisy and not easily
searchable…
Not all useful information to
help isolate the issue was
logged.
@ShaneCarroll84

MACGYVER, 
WE HAVE A PROBLEM
Issue with titles containing
special characters… 🤦
Lots of customers impacted 
and over a million tracking 
events lost.
@ShaneCarroll84

WHY DID WE NOT
GET AN ALERT?
IT WAS LOST IN 
A SEA OF ALERTS.
@ShaneCarroll84

OUR PROCESS 
WAS BROKEN!
Took weeks to gather missing
data and reprocess events…
Problem continued to impact
customers and other teams.
@ShaneCarroll84

WE DIDN’T KNOW OUR
SYSTEM’S HEALTH
What can we learn from this?
Use a bad experience such 
as a bug to learn from 
and spark change.
@ShaneCarroll84

WHEN YOU LOOK AT YOUR
CURRENT SYSTEM,
HOW DO YOU KNOW
IT’S HEALTHY?
@ShaneCarroll84

TESTING 
AT POPPULO
Rob Meaney, Head of Testing
CODS model
10 P’s of Testability
@ShaneCarroll84

LEARNING REVIEW
Detection Impact
Isolation Fix & Retest
Repair Minimise Impact
Prevention
@ShaneCarroll84

THREE AMIGOS
George Dinwiddie ﬁrst came up
with this strategy in his blog.
The Three Amigos – Product,
Developer, and Tester – discuss
the new feature
@ShaneCarroll84

THREE AMIGOS
Allow for discussion on risk
before beginning work on a
feature…
We now ‘Three Amigo’ each
new story before beginning any
new development.
@ShaneCarroll84

REFINING ALERTS
Reduce noise and only alert on
what is important to the team.
Entire team takes responsibility
for investigating and ﬁxing
alerts.
@ShaneCarroll84

DAILY STAND-UP
We changed our standup, to
include a new question…
Small change but now it’s part
of our team process!
Any new alerts today?
@ShaneCarroll84

IMPROVED LOGGING
Removed noise.
Logged everything that helped
isolate potential issues.
Used structured logging.
@ShaneCarroll84

DASHBOARDS
Identiﬁed critical metrics
Highlight failures
Show trends
Added important tests to the
dashboard
@ShaneCarroll84

WATCHING TV 
AT WORK
Why invest time and effort into
monitoring tools if no one looks
at them?
Dashboards are now in constant
view!
@ShaneCarroll84

BUT BE CAREFUL
Don’t introduce a dependency
in your system when adding
monitoring tools!
@ShaneCarroll84

SMOKE DETECTION
MODE ACTIVATED
Risk is discussed early.
Alerts are raised and actioned.
Dashboards show our critical
metrics.
@ShaneCarroll84

SMOKE DETECTION
MODE ACTIVATED
Investigating logs and isolating
issues is easier.
Quickly see who's impacted.
Replay scripts are available.
@ShaneCarroll84

HOW DO YOU GO FROM FIRE
FIGHTING TO SMOKE
DETECTION?
LEARN FROM THE FIRE!
@ShaneCarroll84

Contenu connexe

Similaire à Observability - From fire fighting to smoke detection

SpatzAI.pdfDesmond Sherlock

Why Agile Marketing Doesn't Mean Chaos with Andrea Fryrear & Anjali YakkundiAprimo

SpatzAI.pdfDesmond Sherlock

STQA-Vol9-Issue2-March-2012-Software-Testing-MagazineAlbert Gareev

Beyond the Button: Tests that actually move the needle - Karen Hopper, RazorfishDigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)Gianna Brachetti-Truskawa 🐙

The future of market researchInSites on Stage

Agile TestingPradeepa Narayanaswamy

Cloud Elements Lean Product Development 101Greg Lindahl

Lean Product Development 101Cloud Elements

Last 2018 - Choose your own Transformation AdventureMirco Hering

Growth Hacking for Product ManagersAndrea Darabos

Failure is inevitable but it isn't permanentTom Stiehm

Measurement for success - #SASCON 2014Neil Walker

Cultural learnings of testing for make benefit glorious nation of startupGil Tayar

The Future of Market ResearchTom De Ruyck

Agile bodensee - Agile Testing: Bug prevention vs. bug detectionMichael Palotas

SpatzAIDesmond Sherlock

Marketing Lessons for Early Stage Startups Maria Dykstra

How to Product Manage Yourself by CNET Sr Product ManagerProduct School

Similaire à Observability - From fire fighting to smoke detection (20)

SpatzAI.pdf

Why Agile Marketing Doesn't Mean Chaos with Andrea Fryrear & Anjali Yakkundi

SpatzAI.pdf

STQA-Vol9-Issue2-March-2012-Software-Testing-Magazine

Beyond the Button: Tests that actually move the needle - Karen Hopper, Razorfish

TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)

The future of market research

Agile Testing

Cloud Elements Lean Product Development 101

Lean Product Development 101

Last 2018 - Choose your own Transformation Adventure

Growth Hacking for Product Managers

Failure is inevitable but it isn't permanent

Measurement for success - #SASCON 2014

Cultural learnings of testing for make benefit glorious nation of startup

The Future of Market Research

Agile bodensee - Agile Testing: Bug prevention vs. bug detection

SpatzAI

Marketing Lessons for Early Stage Startups

How to Product Manage Yourself by CNET Sr Product Manager

Dernier

FWD Group - Insurer Innovation Award 2024The Digital Insurer

ICT role in 21st century education and its challengesrafiqahmad00786416

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

Corporate and higher education May webinar.pptxRustici Software

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

MINDCTI Revenue Release Quarter One 2024MIND CTI

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Platformless Horizons for Digital AdaptabilityWSO2

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Dernier (20)

FWD Group - Insurer Innovation Award 2024

ICT role in 21st century education and its challenges

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Boost Fertility New Invention Ups Success Rates.pdf

Vector Search -An Introduction in Oracle Database 23ai.pptx

Corporate and higher education May webinar.pptx

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Introduction to Multilingual Retrieval Augmented Generation (RAG)

MINDCTI Revenue Release Quarter One 2024

presentation ICT roal in 21st century education

Artificial Intelligence Chap.5 : Uncertainty

MS Copilot expands with MS Graph connectors

Platformless Horizons for Digital Adaptability

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Six Myths about Ontologies: The Basics of Formal Ontology

Observability - From fire fighting to smoke detection

1. THE IMPORTANCE OF OBSERVABILITY From Fire Fighting to Smoke Detection @ShaneCarroll84

2. WHAT IS OBSERVABILITY? @ShaneCarroll84

3. DASHBOARDS @ShaneCarroll84

4. KEY METRICS @ShaneCarroll84

5. OBJECTIVES @ShaneCarroll84

6. ALERTS @ShaneCarroll84

7. OBSERVABILITY IS LIKE A FITNESS TRACKER, BUT FOR  YOUR SYSTEM @ShaneCarroll84

10. Track content opens, clicks, views, interactions… Get this data and send it on  for other teams. MACGYVER’S JOB? @ShaneCarroll84

11. CHALLENGES Considerable amount of old enterprise environments… Ancient systems that are  not easily testable. @ShaneCarroll84

12. IMPACT OUR TEAM? HOW DID THE LACK OF OBSERVABILITY @ShaneCarroll84

13. IT ALL STARTED WITH A BUG… Missing a large number of tracking events for one customer… 🤔 A customer is telling us we’ve  a problem with our system. @ShaneCarroll84

14. WE WERE IN FIRE FIGHTING MODE Scrambling to try ﬁnd the  cause of the issue… Don’t know the full impact, causing stress for the team. @ShaneCarroll84

15. @ShaneCarroll84

16. LET’S LOOK  AT THE LOGS… Logs were noisy and not easily searchable… Not all useful information to help isolate the issue was logged. @ShaneCarroll84

17. MACGYVER,  WE HAVE A PROBLEM Issue with titles containing special characters… 🤦 Lots of customers impacted  and over a million tracking  events lost. @ShaneCarroll84

18. WHY DID WE NOT GET AN ALERT? IT WAS LOST IN  A SEA OF ALERTS. @ShaneCarroll84

19. OUR PROCESS  WAS BROKEN! Took weeks to gather missing data and reprocess events… Problem continued to impact customers and other teams. @ShaneCarroll84

20. WE DIDN’T KNOW OUR SYSTEM’S HEALTH What can we learn from this? Use a bad experience such  as a bug to learn from  and spark change. @ShaneCarroll84

21. WHEN YOU LOOK AT YOUR CURRENT SYSTEM, HOW DO YOU KNOW IT’S HEALTHY? @ShaneCarroll84

22. TESTING  AT POPPULO Rob Meaney, Head of Testing CODS model 10 P’s of Testability @ShaneCarroll84

23. LEARNING REVIEW Detection Impact Isolation Fix & Retest Repair Minimise Impact Prevention @ShaneCarroll84

24. SHARE LEARNINGS @ShaneCarroll84

25. RISK @ShaneCarroll84

26. THREE AMIGOS George Dinwiddie ﬁrst came up with this strategy in his blog. The Three Amigos – Product, Developer, and Tester – discuss the new feature @ShaneCarroll84

27. THREE AMIGOS Allow for discussion on risk before beginning work on a feature… We now ‘Three Amigo’ each new story before beginning any new development. @ShaneCarroll84

28. REFINING ALERTS Reduce noise and only alert on what is important to the team. Entire team takes responsibility for investigating and ﬁxing alerts. @ShaneCarroll84

29. DAILY STAND-UP We changed our standup, to include a new question… Small change but now it’s part of our team process! Any new alerts today? @ShaneCarroll84

30. IMPROVED LOGGING Removed noise. Logged everything that helped isolate potential issues. Used structured logging. @ShaneCarroll84

31. UNSTRUCTURED LOGS @ShaneCarroll84

32. UNSTRUCTURED LOGS @ShaneCarroll84

33. VISUALISATIONS @ShaneCarroll84

34. DASHBOARDS Identiﬁed critical metrics Highlight failures Show trends Added important tests to the dashboard @ShaneCarroll84

35. KEY METRICS @ShaneCarroll84

36. COMPARE DATA @ShaneCarroll84

37. DRILL-DOWN @ShaneCarroll84

38. LIVE ARCHITECTURE @ShaneCarroll84

39. WATCHING TV  AT WORK Why invest time and effort into monitoring tools if no one looks at them? Dashboards are now in constant view! @ShaneCarroll84

40. TRACING @ShaneCarroll84

41. TRACING @ShaneCarroll84

42. BUT BE CAREFUL Don’t introduce a dependency in your system when adding monitoring tools! @ShaneCarroll84

43. PERFORMANCE TESTS @ShaneCarroll84

44. PERFORMANCE TESTS @ShaneCarroll84

45. CLIENT-SIDE ERRORS @ShaneCarroll84

46. CLIENT-SIDE ERRORS @ShaneCarroll84

47. CLIENT-SIDE ERRORS @ShaneCarroll84

48. CLIENT-SIDE ERRORS @ShaneCarroll84

49. IN-HOUSE TOOLS @ShaneCarroll84

50. IN-HOUSE TOOLS @ShaneCarroll84

51. IN-HOUSE TOOLS @ShaneCarroll84

52. SMOKE DETECTION MODE ACTIVATED Risk is discussed early. Alerts are raised and actioned. Dashboards show our critical metrics. @ShaneCarroll84

53. SMOKE DETECTION MODE ACTIVATED Investigating logs and isolating issues is easier. Quickly see who's impacted. Replay scripts are available. @ShaneCarroll84

54. HOW DO YOU GO FROM FIRE FIGHTING TO SMOKE DETECTION? LEARN FROM THE FIRE! @ShaneCarroll84

55. THANK YOU! @ShaneCarroll84

Observability - From fire fighting to smoke detection

Recommandé

Recommandé

Contenu connexe

Similaire à Observability - From fire fighting to smoke detection

Similaire à Observability - From fire fighting to smoke detection (20)

Dernier

Dernier (20)

Observability - From fire fighting to smoke detection