SlideShare une entreprise Scribd logo
1  sur  27
Turn the lights on
Creating a culture of ownership and trust
With Visibility & transparency
Shai Peretz
About myself:
- Technologist, devops culture advocate
- Held various technology management positions (Outbrain, Cyota,
Shopping.com among others)
- Piano player, mostly Jazz
- Amateur photographer
- Co-founded a waldorf education (Anthroposophy) school in Tel Aviv
So, what is a culture of Ownership?
You’ve build it, you run it!
- You understand best how it is built, and what can go
wrong, hence:
- Responsible to take your code to production
- Create the tests and monitor them
- Set monitoring and alert thresholds
- Decide what is critical and what isn’t
- Say what actions to take when something goes wrong
- Receive the critical alerts and act upon them
- Automate any possible action, so that it will not wake you
up next time:)
What is a culture of Ownership?
- Nothing is ‘not my problem’:
- First verify that all my systems are working as they should, then look
elsewhere
- Be transparent and don’t hide your mistakes
- Culture of cooperation and helping others (production party:)
What is a culture of Ownership?
Learn from your mistakes:
- Document your actions (automatically)
- Blameless take-ins (post mortem):
- Lead by the event manager, as close as you can to the event
- Include all stakeholders
- 5 whys methodology
- Create tasks with due dates and priority
- Go back and check that tasks are done
Celebrate failure -
it is the best opportunity to learn!
So you have built a new service…
It is a really great app, smart and useful, people love it.
It is responsive, using all latest technology and buzzwords.
It is communicating with dozen other services via efficient APIs
It is collecting tons of data, process and move it via latest message queues, store
it in several data stores
It grabs the data back using a smart search engine and oops…
The user is getting an error. What happened?
Something went wrong...
Is it the search engine?
Or maybe the data store?
Or maybe a problem with the network?
Or maybe a broken API call?
Or the reverse proxy is down?
Or...
You are in the dark.
Can someone please turn the lights on?
Well, you can start looking for the problem with a flashlight...
But you only have 5 seconds...
That’s because you have committed to a 99.95% SLA, and you have used most of
your allowed downtime already:(
And your system is complicated...
So let’s monitor and log everything!
We now have:
3 millions application metrics per minute +
1 million system metrics per minute +
750,000 log lines per minute
75 different dashboards rotating on six 65” monitors.
Is that enough light?
No. that’s too much light.
Which still leaves us in the dark.
What we need is some filters:
That’s better:)
Now we can see some details...
Yet, we can’t find where the problem is...
That’s because we have too much information. Well, at least for a human being.
Why don’t we let machines deal with it?
That’s exactly what they are built for*:
- Process tons of information in fractions of a second
- Correlate data from many different sources
- Analyze the data, search for anomalies
- Act upon it automatically
(or at least notify someone)
* Assuming the humans who programmed those machines did a good job:)
Great visibility:
Encourage Prevention - helps preventing problems before
they occur, by forcing you to consider most possible
problems in advance
Enables automatic, self healing when possible
And if not - provides us with a laser focus pointer into where
the problem is, in a timely manner (near real time) and allows
us to fix quickly (and automate for next time:)
So what tools should we use?
It doesn't really matter.
Well, it actually does:)
Choose the right set of tools for your organization, that you are comfortable with
as long as you get good coverage of:
- Automatic testing (visibility of the build & deploy pipeline)
- Infrastructure/system metrics and logs
- Application level metrics and logs
- External (user experience) monitoring
- Prediction and anomaly detection
Select tools that you trust and make their availability first priority!
Benefits of good visibility
Enabler for Agile and DevOps culture -
easier to take responsibility, better communication
Drives quality up (both code and infrastructure)
Improves MTTD, MTTR and MTTS (better SLA)
Reduce frustration and improve productivity
Helps to achieve business goals
Now, Who volunteers to monitor the monitoring system??
Monitoring the monitoring system
No alerts. All dashboards are green. Does it really mean all is good?
Not necessarily…
You have to verify:
- Set another layer of independent monitoring, outside your network
- Create ‘positive’ checks, that confirms the system is up
If you don’t trust your monitoring system, it is useless!
Ownership + Transparency => Trust
Bring facts to your discussions
Take ownership on your stuff
Share your mistakes
Don’t blame others
When trust exist, people are more cooperative and open to learn =>
problems are fixed faster and rarely repeat themselves
Transparency
Status pages (if done properly):
- Can save a lot of time while troubleshooting a problem
- Increase transparency, build trust
- Should be automated wherever possible
- Use multi level pages - different level of details for engineering, business and
customers
Share your plans and progres
- Especially when you have delays...
How transparent should it be?
My rule of thumb - open up everything that will not hurt your organization
In order to be able to do so:
- People need to respect confidentiality
- People should have effective filters as to what is relevant for them
T r u s t This is a fragile circle, very easy to break!
Transparency
Impact of good visibility and transparency
Visibility
Transparency
Responsibility
Ownership
Communication
Quality
Frustration
Fatigue
MTTD
MTTR
MTTS
Uptime
SLA
Revenew
Customer
satisfaction
Employee
satisfaction
Thank you for listening:)
shai.peretz@gmail.com

Contenu connexe

Tendances

Tendances (19)

What we learned from three years sciencing the crap out of devops
What we learned from three years sciencing the crap out of devopsWhat we learned from three years sciencing the crap out of devops
What we learned from three years sciencing the crap out of devops
 
Are We Secure? Answering the Unanswerable
Are We Secure? Answering the UnanswerableAre We Secure? Answering the Unanswerable
Are We Secure? Answering the Unanswerable
 
DevOps Challenges and Mitigation
DevOps Challenges and MitigationDevOps Challenges and Mitigation
DevOps Challenges and Mitigation
 
Is it Safe? measuring product security goodness
Is it Safe?   measuring product security goodnessIs it Safe?   measuring product security goodness
Is it Safe? measuring product security goodness
 
5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System
 
How should we build that? Evolving a development environment that's suitable ...
How should we build that? Evolving a development environment that's suitable ...How should we build that? Evolving a development environment that's suitable ...
How should we build that? Evolving a development environment that's suitable ...
 
5 Tips To Getting Your Network Ready For Digital Transformation
5 Tips To Getting Your Network Ready For Digital Transformation5 Tips To Getting Your Network Ready For Digital Transformation
5 Tips To Getting Your Network Ready For Digital Transformation
 
What We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOpsWhat We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOps
 
Defect Patterns Analysis for Agile and Waterfall - XBOSoft Webinar with Micha...
Defect Patterns Analysis for Agile and Waterfall - XBOSoft Webinar with Micha...Defect Patterns Analysis for Agile and Waterfall - XBOSoft Webinar with Micha...
Defect Patterns Analysis for Agile and Waterfall - XBOSoft Webinar with Micha...
 
POKA YOKE
POKA YOKEPOKA YOKE
POKA YOKE
 
SAP and Lean MindSet: Short and Fast project with India by Christophe Berbeye...
SAP and Lean MindSet: Short and Fast project with India by Christophe Berbeye...SAP and Lean MindSet: Short and Fast project with India by Christophe Berbeye...
SAP and Lean MindSet: Short and Fast project with India by Christophe Berbeye...
 
Agile User Acceptance Testing - Incorporating UAT into Agile
Agile User Acceptance Testing - Incorporating UAT into AgileAgile User Acceptance Testing - Incorporating UAT into Agile
Agile User Acceptance Testing - Incorporating UAT into Agile
 
Scaling Autonomy in a FinTech Unicorn - WeAreDevelopers 2019
Scaling Autonomy in a FinTech Unicorn - WeAreDevelopers 2019Scaling Autonomy in a FinTech Unicorn - WeAreDevelopers 2019
Scaling Autonomy in a FinTech Unicorn - WeAreDevelopers 2019
 
Can You Really Automate Yourself Secure
Can You Really Automate Yourself SecureCan You Really Automate Yourself Secure
Can You Really Automate Yourself Secure
 
Poka yoke
Poka yokePoka yoke
Poka yoke
 
User Sentiment to Determine App Quality
User Sentiment to Determine App QualityUser Sentiment to Determine App Quality
User Sentiment to Determine App Quality
 
Learn fast to build fast, Le Monde case study by Ismaël Hery - Lean IT Summit...
Learn fast to build fast, Le Monde case study by Ismaël Hery - Lean IT Summit...Learn fast to build fast, Le Monde case study by Ismaël Hery - Lean IT Summit...
Learn fast to build fast, Le Monde case study by Ismaël Hery - Lean IT Summit...
 
ProNovos - The Construction Analytics Company
ProNovos - The Construction Analytics CompanyProNovos - The Construction Analytics Company
ProNovos - The Construction Analytics Company
 
Poka yoke and Devops
Poka yoke and DevopsPoka yoke and Devops
Poka yoke and Devops
 

Similaire à Creating a Culture of Ownership and Trust with Visibility and Transparency by Shai Peretz

Better Service Management with AI
Better Service Management with AIBetter Service Management with AI
Better Service Management with AI
TOPdesk
 
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
Nick Galbreath
 

Similaire à Creating a Culture of Ownership and Trust with Visibility and Transparency by Shai Peretz (20)

Analytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopAnalytics Driven SIEM Workshop
Analytics Driven SIEM Workshop
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Rapid Threat Modeling : case study
Rapid Threat Modeling : case studyRapid Threat Modeling : case study
Rapid Threat Modeling : case study
 
Гірка правда про безпеку програмного забезпечення, Володимир Стиран
Гірка правда про безпеку програмного забезпечення, Володимир СтиранГірка правда про безпеку програмного забезпечення, Володимир Стиран
Гірка правда про безпеку програмного забезпечення, Володимир Стиран
 
Real Estate Systems: 3X your business by systematizing what you're already doing
Real Estate Systems: 3X your business by systematizing what you're already doingReal Estate Systems: 3X your business by systematizing what you're already doing
Real Estate Systems: 3X your business by systematizing what you're already doing
 
Icinga camp ams 2016 icinga2
Icinga camp ams 2016 icinga2Icinga camp ams 2016 icinga2
Icinga camp ams 2016 icinga2
 
Icinga Camp Amsterdam - Monitoring – When to start
Icinga Camp Amsterdam - Monitoring – When to startIcinga Camp Amsterdam - Monitoring – When to start
Icinga Camp Amsterdam - Monitoring – When to start
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Better Service Management with AI
Better Service Management with AIBetter Service Management with AI
Better Service Management with AI
 
Better Service Management with Artificial Intelligence
Better Service Management with Artificial IntelligenceBetter Service Management with Artificial Intelligence
Better Service Management with Artificial Intelligence
 
10 Rules for Vendors - an Overview
10 Rules for Vendors - an Overview10 Rules for Vendors - an Overview
10 Rules for Vendors - an Overview
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
 
Why DevOps Needs to Embrace Distributed Tracing
Why DevOps Needs to Embrace Distributed TracingWhy DevOps Needs to Embrace Distributed Tracing
Why DevOps Needs to Embrace Distributed Tracing
 
beginners-guide-to-observability.pdf
beginners-guide-to-observability.pdfbeginners-guide-to-observability.pdf
beginners-guide-to-observability.pdf
 
How To Elminate Errors and Increase Efficiency
How To Elminate Errors and Increase EfficiencyHow To Elminate Errors and Increase Efficiency
How To Elminate Errors and Increase Efficiency
 
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
Fraud Engineering, from Merchant Risk Council Annual Meeting 2012
 
Better the devil you know
Better the devil you knowBetter the devil you know
Better the devil you know
 
The digital workplace
The digital workplaceThe digital workplace
The digital workplace
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solution
 

Plus de AgileSparks

Plus de AgileSparks (20)

What Do Agile Leaders Do by Kurt Bittner
What Do Agile Leaders Do by Kurt Bittner What Do Agile Leaders Do by Kurt Bittner
What Do Agile Leaders Do by Kurt Bittner
 
Distributed Teams by Kevin Goldsmith
Distributed Teams by Kevin GoldsmithDistributed Teams by Kevin Goldsmith
Distributed Teams by Kevin Goldsmith
 
A Back-End Approach to Customer Driven by Adi Gostynski
A Back-End Approach to Customer Driven by Adi GostynskiA Back-End Approach to Customer Driven by Adi Gostynski
A Back-End Approach to Customer Driven by Adi Gostynski
 
Jira Portfolio by Elad Ben-Noam
Jira Portfolio by Elad Ben-NoamJira Portfolio by Elad Ben-Noam
Jira Portfolio by Elad Ben-Noam
 
Agile Hiring at Scale by Yon Bergman
Agile Hiring at Scale by Yon Bergman Agile Hiring at Scale by Yon Bergman
Agile Hiring at Scale by Yon Bergman
 
Are We Really Using Our Resources in The Most Effective Way? by Perry Yaqubo...
Are We Really Using Our Resources in The Most Effective Way?  by Perry Yaqubo...Are We Really Using Our Resources in The Most Effective Way?  by Perry Yaqubo...
Are We Really Using Our Resources in The Most Effective Way? by Perry Yaqubo...
 
Honest Experimentation by Jonathan Bertfield
 Honest Experimentation by Jonathan Bertfield Honest Experimentation by Jonathan Bertfield
Honest Experimentation by Jonathan Bertfield
 
Pango Journey to an Agile Cloud by Yaniv Kalo
Pango Journey to an Agile Cloud by Yaniv KaloPango Journey to an Agile Cloud by Yaniv Kalo
Pango Journey to an Agile Cloud by Yaniv Kalo
 
ClickSoftware Agile Tranistion by Meny Duek
ClickSoftware Agile Tranistion by Meny DuekClickSoftware Agile Tranistion by Meny Duek
ClickSoftware Agile Tranistion by Meny Duek
 
Augury's Journey Towards CD by Assaf Mizrachi
Augury's Journey Towards CD by Assaf Mizrachi Augury's Journey Towards CD by Assaf Mizrachi
Augury's Journey Towards CD by Assaf Mizrachi
 
Kubernetes is Hard! Lessons Learned Taking Our Apps to Kubernetes by Eldad Assis
Kubernetes is Hard! Lessons Learned Taking Our Apps to Kubernetes by Eldad AssisKubernetes is Hard! Lessons Learned Taking Our Apps to Kubernetes by Eldad Assis
Kubernetes is Hard! Lessons Learned Taking Our Apps to Kubernetes by Eldad Assis
 
Real Innovation is with Real Customers by Baat Enosh
Real Innovation is with Real Customers by Baat EnoshReal Innovation is with Real Customers by Baat Enosh
Real Innovation is with Real Customers by Baat Enosh
 
True Continuous Improvement with Toyota Kata by Jesper Boeg
True Continuous Improvement with Toyota Kata by Jesper BoegTrue Continuous Improvement with Toyota Kata by Jesper Boeg
True Continuous Improvement with Toyota Kata by Jesper Boeg
 
Homo-Adaptus Agile Worker by Lior Frenkel
Homo-Adaptus Agile Worker by Lior FrenkelHomo-Adaptus Agile Worker by Lior Frenkel
Homo-Adaptus Agile Worker by Lior Frenkel
 
Leading Innovation by Jonathan Bertfield
Leading Innovation by Jonathan BertfieldLeading Innovation by Jonathan Bertfield
Leading Innovation by Jonathan Bertfield
 
Organization architecture autonomy and accountability
Organization architecture autonomy and accountability Organization architecture autonomy and accountability
Organization architecture autonomy and accountability
 
Tribal Unity, Agile Israel 2017
Tribal Unity, Agile Israel 2017Tribal Unity, Agile Israel 2017
Tribal Unity, Agile Israel 2017
 
The mindful manager, Agile Israel 2017
The mindful manager, Agile Israel 2017The mindful manager, Agile Israel 2017
The mindful manager, Agile Israel 2017
 
Agile Israel 2017 bugs zero by Arlo Belshee
Agile Israel 2017 bugs zero by Arlo BelsheeAgile Israel 2017 bugs zero by Arlo Belshee
Agile Israel 2017 bugs zero by Arlo Belshee
 
Agile בעידן הדיגיטלי
Agile בעידן הדיגיטליAgile בעידן הדיגיטלי
Agile בעידן הדיגיטלי
 

Dernier

Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTECAbortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

situational leadership theory by Misba Fathima S
situational leadership theory by Misba Fathima Ssituational leadership theory by Misba Fathima S
situational leadership theory by Misba Fathima S
 
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
 
Continuous Improvement Posters for Learning
Continuous Improvement Posters for LearningContinuous Improvement Posters for Learning
Continuous Improvement Posters for Learning
 
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTECAbortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
 
Reviewing and summarization of university ranking system to.pptx
Reviewing and summarization of university ranking system  to.pptxReviewing and summarization of university ranking system  to.pptx
Reviewing and summarization of university ranking system to.pptx
 
GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
 
Day 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampDay 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC Bootcamp
 
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
 
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
 
Imagine - HR; are handling the 'bad banter' - Stella Chandler.pdf
Imagine - HR; are handling the 'bad banter' - Stella Chandler.pdfImagine - HR; are handling the 'bad banter' - Stella Chandler.pdf
Imagine - HR; are handling the 'bad banter' - Stella Chandler.pdf
 
Becoming an Inclusive Leader - Bernadette Thompson
Becoming an Inclusive Leader - Bernadette ThompsonBecoming an Inclusive Leader - Bernadette Thompson
Becoming an Inclusive Leader - Bernadette Thompson
 
Discover -CQ Master Class - Rikita Wadhwa.pdf
Discover -CQ Master Class - Rikita Wadhwa.pdfDiscover -CQ Master Class - Rikita Wadhwa.pdf
Discover -CQ Master Class - Rikita Wadhwa.pdf
 
Imagine - Creating Healthy Workplaces - Anthony Montgomery.pdf
Imagine - Creating Healthy Workplaces - Anthony Montgomery.pdfImagine - Creating Healthy Workplaces - Anthony Montgomery.pdf
Imagine - Creating Healthy Workplaces - Anthony Montgomery.pdf
 
Intro_University_Ranking_Introduction.pptx
Intro_University_Ranking_Introduction.pptxIntro_University_Ranking_Introduction.pptx
Intro_University_Ranking_Introduction.pptx
 
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
 
Peak Performance & Resilience - Dr Dorian Dugmore
Peak Performance & Resilience - Dr Dorian DugmorePeak Performance & Resilience - Dr Dorian Dugmore
Peak Performance & Resilience - Dr Dorian Dugmore
 
Continuous Improvement Infographics for Learning
Continuous Improvement Infographics for LearningContinuous Improvement Infographics for Learning
Continuous Improvement Infographics for Learning
 
LoveLocalGov - Chris Twigg, Inner Circle
LoveLocalGov - Chris Twigg, Inner CircleLoveLocalGov - Chris Twigg, Inner Circle
LoveLocalGov - Chris Twigg, Inner Circle
 
Unlocking the Future - Dr Max Blumberg, Founder of Blumberg Partnership
Unlocking the Future - Dr Max Blumberg, Founder of Blumberg PartnershipUnlocking the Future - Dr Max Blumberg, Founder of Blumberg Partnership
Unlocking the Future - Dr Max Blumberg, Founder of Blumberg Partnership
 
Leadership in Crisis - Helio Vogas, Risk & Leadership Keynote Speaker
Leadership in Crisis - Helio Vogas, Risk & Leadership Keynote SpeakerLeadership in Crisis - Helio Vogas, Risk & Leadership Keynote Speaker
Leadership in Crisis - Helio Vogas, Risk & Leadership Keynote Speaker
 

Creating a Culture of Ownership and Trust with Visibility and Transparency by Shai Peretz

  • 1. Turn the lights on Creating a culture of ownership and trust With Visibility & transparency Shai Peretz
  • 2. About myself: - Technologist, devops culture advocate - Held various technology management positions (Outbrain, Cyota, Shopping.com among others) - Piano player, mostly Jazz - Amateur photographer - Co-founded a waldorf education (Anthroposophy) school in Tel Aviv
  • 3. So, what is a culture of Ownership? You’ve build it, you run it! - You understand best how it is built, and what can go wrong, hence: - Responsible to take your code to production - Create the tests and monitor them - Set monitoring and alert thresholds - Decide what is critical and what isn’t - Say what actions to take when something goes wrong - Receive the critical alerts and act upon them - Automate any possible action, so that it will not wake you up next time:)
  • 4. What is a culture of Ownership? - Nothing is ‘not my problem’: - First verify that all my systems are working as they should, then look elsewhere - Be transparent and don’t hide your mistakes - Culture of cooperation and helping others (production party:)
  • 5. What is a culture of Ownership? Learn from your mistakes: - Document your actions (automatically) - Blameless take-ins (post mortem): - Lead by the event manager, as close as you can to the event - Include all stakeholders - 5 whys methodology - Create tasks with due dates and priority - Go back and check that tasks are done Celebrate failure - it is the best opportunity to learn!
  • 6.
  • 7. So you have built a new service… It is a really great app, smart and useful, people love it. It is responsive, using all latest technology and buzzwords. It is communicating with dozen other services via efficient APIs It is collecting tons of data, process and move it via latest message queues, store it in several data stores It grabs the data back using a smart search engine and oops… The user is getting an error. What happened?
  • 8. Something went wrong... Is it the search engine? Or maybe the data store? Or maybe a problem with the network? Or maybe a broken API call? Or the reverse proxy is down? Or...
  • 9. You are in the dark.
  • 10. Can someone please turn the lights on? Well, you can start looking for the problem with a flashlight...
  • 11. But you only have 5 seconds... That’s because you have committed to a 99.95% SLA, and you have used most of your allowed downtime already:( And your system is complicated...
  • 12. So let’s monitor and log everything!
  • 13. We now have: 3 millions application metrics per minute + 1 million system metrics per minute + 750,000 log lines per minute 75 different dashboards rotating on six 65” monitors. Is that enough light?
  • 14. No. that’s too much light. Which still leaves us in the dark. What we need is some filters:
  • 15. That’s better:) Now we can see some details...
  • 16. Yet, we can’t find where the problem is... That’s because we have too much information. Well, at least for a human being.
  • 17. Why don’t we let machines deal with it? That’s exactly what they are built for*: - Process tons of information in fractions of a second - Correlate data from many different sources - Analyze the data, search for anomalies - Act upon it automatically (or at least notify someone) * Assuming the humans who programmed those machines did a good job:)
  • 18. Great visibility: Encourage Prevention - helps preventing problems before they occur, by forcing you to consider most possible problems in advance Enables automatic, self healing when possible And if not - provides us with a laser focus pointer into where the problem is, in a timely manner (near real time) and allows us to fix quickly (and automate for next time:)
  • 19. So what tools should we use? It doesn't really matter. Well, it actually does:) Choose the right set of tools for your organization, that you are comfortable with as long as you get good coverage of: - Automatic testing (visibility of the build & deploy pipeline) - Infrastructure/system metrics and logs - Application level metrics and logs - External (user experience) monitoring - Prediction and anomaly detection Select tools that you trust and make their availability first priority!
  • 20. Benefits of good visibility Enabler for Agile and DevOps culture - easier to take responsibility, better communication Drives quality up (both code and infrastructure) Improves MTTD, MTTR and MTTS (better SLA) Reduce frustration and improve productivity Helps to achieve business goals
  • 21. Now, Who volunteers to monitor the monitoring system??
  • 22. Monitoring the monitoring system No alerts. All dashboards are green. Does it really mean all is good? Not necessarily… You have to verify: - Set another layer of independent monitoring, outside your network - Create ‘positive’ checks, that confirms the system is up If you don’t trust your monitoring system, it is useless!
  • 23. Ownership + Transparency => Trust Bring facts to your discussions Take ownership on your stuff Share your mistakes Don’t blame others When trust exist, people are more cooperative and open to learn => problems are fixed faster and rarely repeat themselves
  • 24. Transparency Status pages (if done properly): - Can save a lot of time while troubleshooting a problem - Increase transparency, build trust - Should be automated wherever possible - Use multi level pages - different level of details for engineering, business and customers Share your plans and progres - Especially when you have delays...
  • 25. How transparent should it be? My rule of thumb - open up everything that will not hurt your organization In order to be able to do so: - People need to respect confidentiality - People should have effective filters as to what is relevant for them T r u s t This is a fragile circle, very easy to break! Transparency
  • 26. Impact of good visibility and transparency Visibility Transparency Responsibility Ownership Communication Quality Frustration Fatigue MTTD MTTR MTTS Uptime SLA Revenew Customer satisfaction Employee satisfaction
  • 27. Thank you for listening:) shai.peretz@gmail.com