SlideShare a Scribd company logo
1 of 37
Download to read offline
Building a Culture of
Observability at Stripe
Maaaaaaaybe?
Cory “gphat” Watson
• Joined Stripe in August, 2015
• Previously at Keen IO and Twitter
• Generalist
Starting Point
• Stripe had some visibility, but not enough.
• No clear ownership, broken windows.
• Lack of confidence, vision for future.
• Very reactive.
This isn’t about a specific
technology. This is about people.
Did it work?
See my resume at:
onemogin.com/resume
(jk)
You’re here because you
know this is important.
How can we get others to
agree and work toward it?
Stripe Org Facts
• ~450 employees, 100% growth in last year
• ~2 dozen teams
• ~200 services
• Thousands of hosts (AWS)
• Ruby, JVM, lots of OSS stuff
• Team: 3 + intern (starting Q2)
Where to begin?
Start Over, Kinda
• Spend time with the tools
• Improve if possible
• Replace if not
• Leverage past knowledge
Empathy and Respect
• People not generally evil, but they are busy!
• Stressed, doing best with what they have
• Being a hater is lazy
• Help people be great at their jobs
Replaced Existing System
• Maybe a bad call, technically better
• Overcoming momentum is hard, adds work
• Declaring bankruptcy
• Saved us ops headaches
• Still going
Tip: Nemawashi
• Start small, you’re a great guinea pig
• Quietly lay a foundation and gather feedback
• Ask how you can improve, follow up!
• Engage discontent! Usually fine. Sometimes you need
whisky.
Identify Power Users
• Find interested parties
• Talk to them, give them what they need
• Empower them to help others
• Watch them grow!
Value
• What are you improving?
• How can you measure it?
• Is this the best way?
What is Observability?
Why do we want it?
In control theory, observability is a measure
for how well internal states of a system can
be inferred by knowledge of its external
outputs.
Systems output work.
If the internal state goes bad,
the work goes bad.
We need to add sensors!
Make This Great
Programmer
Reference
System
Sensor(s)
Work
Flat Org Work Ethic
• Probably the biggest challenge, getting started
• So, ya know, get started
• Be willing to do the work, shave the preposterous line
of yaks
• Stigmergy
• Strike when good opportunities arise (incidents, etc)
Advertise
• Don’t be afraid!
• Promote team accomplishments.
• Moreso, promote the accomplishment of others.
• Humbly ask to help, then learn.
• We send monthly “State of” addresses…
Make It Easy & Good
• Harder than it sounds (email!)
• Make it easy/automatic to do things right and hard to
do wrong.
• Quality is important.
Automated Monitors
• Baseline monitoring
• Common problems, common solutions
• Users have no state, are surprised
• People care when you show them failure and how to
fix it.
Automatic Ticket Creation
And Resolution!
Investigation Dashboard
Such Helpful!
Getting Feedback
How we improve.
Teach the Basics
• Company curriculum: Teach ‘em early!
• Measuring work metrics
• Metrics types
• Schemas (dotted, tags, etc)
• Rates, histograms
• Visualizations
Ownership
• Poor story for this
• Org was ready for this, management was on board.
• Evolving, tools are lacking.
Did it work?
Yes, but not done.
• Some teams? Hell yes. Strong champions, huge
improvement.
• Some other teams, kinda the same.
• Some other other teams, what is Observability and
why do I care? Rare!
Usage?
• 200+ dashboards created, 339 in old (over 2 years)
• 200+ monitors created, dozens in old (nobody trusted,
was unreliable!)
• ~3000 distinct metrics (can’t compare, tags now!)
• All positive feedback from automation. (Avg 4.5, 2.5%
response)
Tools?
• Dozens of OSS PRs, OSS *StatsD library (Scala),
internal libraries (we own)
• Vast improvement over old pipeline, no loss
• New styles, better naming, more consistency
• Being tied to a commercial product cuts both ways
Adjustments?
• Embracing other tools (log analysis, error catching)
• Beginning to work on strategic things (global timers,
histograms and sets)
• Need to improve metrics on our own work (we got by
easy for a while)
• Monitoring is hard, need to fix.
Summary
• Start small
• Seek feedback
• Think on your value
• Measure effectiveness
• Enjoy!
Thanks
Team @antifuchs and @shu, all of Stripe
onemogin.com
@gphat
github.com/gphat
cory@stripe.com
Questions?
@gphat
Info
Slides
Feedback
Talk
Help me improve.

More Related Content

Viewers also liked

Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's MonitoringBrian Overstreet
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a ServiceJames Turnbull
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleChris Jackson
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Appsbrucelawson
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 
Monitoring, graphs and visualisations
Monitoring, graphs and visualisationsMonitoring, graphs and visualisations
Monitoring, graphs and visualisationsmorekid
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Adrian Cockcroft
 
Performance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanizecoreygoldberg
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsSatya Sanjibani Routray
 
Monitoring Docker containers - Docker NYC Feb 2015
Monitoring Docker containers - Docker NYC Feb 2015Monitoring Docker containers - Docker NYC Feb 2015
Monitoring Docker containers - Docker NYC Feb 2015Datadog
 
Measuring Micro-services. Richard Rodger
Measuring Micro-services. Richard RodgerMeasuring Micro-services. Richard Rodger
Measuring Micro-services. Richard RodgerFuture Insights
 
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...DynamicInfraDays
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observabilityTheo Schlossnagle
 
Voxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in productionVoxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in productionVoxxed Days Thessaloniki
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripebetabeers
 
2008 "An overview of Methods for analysis of Identifiability and Observabilit...
2008 "An overview of Methods for analysis of Identifiability and Observabilit...2008 "An overview of Methods for analysis of Identifiability and Observabilit...
2008 "An overview of Methods for analysis of Identifiability and Observabilit...Steinar Elgsæter
 
BFF Pattern in Action: SoundCloud’s Microservices
BFF Pattern in Action: SoundCloud’s MicroservicesBFF Pattern in Action: SoundCloud’s Microservices
BFF Pattern in Action: SoundCloud’s MicroservicesBora Tunca
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice ArchitectureEngin Yoeyen
 

Viewers also liked (20)

Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's Monitoring
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
Monitoring, graphs and visualisations
Monitoring, graphs and visualisationsMonitoring, graphs and visualisations
Monitoring, graphs and visualisations
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
 
How to Speak "Manager"
How to Speak "Manager"How to Speak "Manager"
How to Speak "Manager"
 
Performance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanize
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applications
 
Monitoring Docker containers - Docker NYC Feb 2015
Monitoring Docker containers - Docker NYC Feb 2015Monitoring Docker containers - Docker NYC Feb 2015
Monitoring Docker containers - Docker NYC Feb 2015
 
Measuring Micro-services. Richard Rodger
Measuring Micro-services. Richard RodgerMeasuring Micro-services. Richard Rodger
Measuring Micro-services. Richard Rodger
 
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
ContainerDays NYC 2016: "Observability and Manageability in a Container Envir...
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Voxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in productionVoxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in production
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripe
 
2008 "An overview of Methods for analysis of Identifiability and Observabilit...
2008 "An overview of Methods for analysis of Identifiability and Observabilit...2008 "An overview of Methods for analysis of Identifiability and Observabilit...
2008 "An overview of Methods for analysis of Identifiability and Observabilit...
 
BFF Pattern in Action: SoundCloud’s Microservices
BFF Pattern in Action: SoundCloud’s MicroservicesBFF Pattern in Action: SoundCloud’s Microservices
BFF Pattern in Action: SoundCloud’s Microservices
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
 

Recently uploaded

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Building A Culture of Observability At Stripe

  • 1. Building a Culture of Observability at Stripe Maaaaaaaybe?
  • 2. Cory “gphat” Watson • Joined Stripe in August, 2015 • Previously at Keen IO and Twitter • Generalist
  • 3. Starting Point • Stripe had some visibility, but not enough. • No clear ownership, broken windows. • Lack of confidence, vision for future. • Very reactive.
  • 4. This isn’t about a specific technology. This is about people.
  • 6. See my resume at: onemogin.com/resume (jk)
  • 7. You’re here because you know this is important.
  • 8. How can we get others to agree and work toward it?
  • 9. Stripe Org Facts • ~450 employees, 100% growth in last year • ~2 dozen teams • ~200 services • Thousands of hosts (AWS) • Ruby, JVM, lots of OSS stuff • Team: 3 + intern (starting Q2)
  • 11. Start Over, Kinda • Spend time with the tools • Improve if possible • Replace if not • Leverage past knowledge
  • 12. Empathy and Respect • People not generally evil, but they are busy! • Stressed, doing best with what they have • Being a hater is lazy • Help people be great at their jobs
  • 13. Replaced Existing System • Maybe a bad call, technically better • Overcoming momentum is hard, adds work • Declaring bankruptcy • Saved us ops headaches • Still going
  • 14. Tip: Nemawashi • Start small, you’re a great guinea pig • Quietly lay a foundation and gather feedback • Ask how you can improve, follow up! • Engage discontent! Usually fine. Sometimes you need whisky.
  • 15. Identify Power Users • Find interested parties • Talk to them, give them what they need • Empower them to help others • Watch them grow!
  • 16. Value • What are you improving? • How can you measure it? • Is this the best way?
  • 17. What is Observability? Why do we want it?
  • 18. In control theory, observability is a measure for how well internal states of a system can be inferred by knowledge of its external outputs.
  • 19. Systems output work. If the internal state goes bad, the work goes bad. We need to add sensors!
  • 21. Flat Org Work Ethic • Probably the biggest challenge, getting started • So, ya know, get started • Be willing to do the work, shave the preposterous line of yaks • Stigmergy • Strike when good opportunities arise (incidents, etc)
  • 22. Advertise • Don’t be afraid! • Promote team accomplishments. • Moreso, promote the accomplishment of others. • Humbly ask to help, then learn. • We send monthly “State of” addresses…
  • 23. Make It Easy & Good • Harder than it sounds (email!) • Make it easy/automatic to do things right and hard to do wrong. • Quality is important.
  • 24. Automated Monitors • Baseline monitoring • Common problems, common solutions • Users have no state, are surprised • People care when you show them failure and how to fix it.
  • 28. Teach the Basics • Company curriculum: Teach ‘em early! • Measuring work metrics • Metrics types • Schemas (dotted, tags, etc) • Rates, histograms • Visualizations
  • 29. Ownership • Poor story for this • Org was ready for this, management was on board. • Evolving, tools are lacking.
  • 31. Yes, but not done. • Some teams? Hell yes. Strong champions, huge improvement. • Some other teams, kinda the same. • Some other other teams, what is Observability and why do I care? Rare!
  • 32. Usage? • 200+ dashboards created, 339 in old (over 2 years) • 200+ monitors created, dozens in old (nobody trusted, was unreliable!) • ~3000 distinct metrics (can’t compare, tags now!) • All positive feedback from automation. (Avg 4.5, 2.5% response)
  • 33. Tools? • Dozens of OSS PRs, OSS *StatsD library (Scala), internal libraries (we own) • Vast improvement over old pipeline, no loss • New styles, better naming, more consistency • Being tied to a commercial product cuts both ways
  • 34. Adjustments? • Embracing other tools (log analysis, error catching) • Beginning to work on strategic things (global timers, histograms and sets) • Need to improve metrics on our own work (we got by easy for a while) • Monitoring is hard, need to fix.
  • 35. Summary • Start small • Seek feedback • Think on your value • Measure effectiveness • Enjoy!
  • 36. Thanks Team @antifuchs and @shu, all of Stripe onemogin.com @gphat github.com/gphat cory@stripe.com