SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
Monitoring
Deeper dive
Who am I
Robert Kubiś
DevOps Engineer
https://www.linkedin.com/in/robertkubis89
Mikey
Dickerson
Hierarchy of
Needs
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data
about a system, such as query counts and types, error counts and types,
processing times, and server lifetimes.
● White-box monitoring
● Black-box monitoring
● Dashboard
● Alert
● Root cause
● Push
● Node and machine
Why Monitor?
● Analyzing long-term trends
● Comparing over time or experiment groups
● Alerting
● Building dashboards
● Conducting ad hoc retrospective analysis (i.e., debugging)
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Disadvantages:
● Doesn’t scale - cannot be clustering -
Thruk hack
● Millions lines of configuration -
check_mk hack
● Horrible interface
● Only for static infrastructure
● Stupid format of clients - hacks
● Perfdata…
● Doesn’t have API - livestatus hack
● Always need to hack….
Nagios
When your monitoring suck...
- Improve the quality of alerts
- Improve monitoring tools, or even change them
Wait a minute…. Before you start to solve them...
UNDERSTAND PROBLEMS AND MEASURE THEM!!!
“To measure is to know”
“If you can not measure it,
you can not improve it”
Lord William Thomson
(aka Baron Kelvin)
Over-monitoring and alarm fatigue: For whom do the
bells toll? Hospitals in USA
- Ignoring Alarms notification
- “Yeah that is no important”
- 72–99% false alerts
- Young parents vs nurses in hospital
- Monitoring means more money
- More is not better
- Patient could died
- Telemetry as a means of preventing, detecting, and improving
Source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
What to do?
So what to use for monitoring?
What monitoring should be?
Actionable
Compatible
Essential - only alerts which are needed
Fully Automated
Proactive - should predict failures
Easy for operators
State monitoring what it should be like?
State or blackbox monitoring now has the most of sense in VMs and bare-metals
What should be monitored with those kind of tools?
● Health endpoints
● Service states (like systemctl status *)
What could be monitored?
● Specific endpoints (using for example satellite node) with http/tcp checks
Icinga2 - Nagios fork but rewritten in many places, has scaling scenarios
(multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC),
clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc.
What we can get from Icinga2?
● High Available and distributed setup
● Nice and good documented REST API
● (dynamic inventory)
● Decrease amount of time needed for implement features
Metrics
Metric tools could be used in two ways:
1. Failure prediction
2. Graphing the data for humans - for humans it means SIMPLE
First case is quite simple - rules for detecting anomalies like more traffic than
usual and alert if it can make an impact on other clients
Second case is also simple - just graphs for debugging and better understanding
what’s happening with applications
Not every metric should have alert (and notifications)!
Prometheus
Circa 120 ready to use dashboards in Grafana repository(ie. MySQL board by
Percona)
Many useful features in one tool - Prometheus has a rich query language, Alert
manager, support for PagerDuty etc.
Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX,
Pagespeed, BIND, Jenkins, scollector
Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf,
jmx-exporter, collectd)
Logs
Servers, application, network and security devices generate log files.
Errors, problems, and more information is constantly logged and saved for
analysis.
Once an event is detected, the monitoring system will send alert, either to a
person or to another software/hardware system.
Elasticsearch stack
Monitoring strategy
Icinga 2 for state monitoring on bare metal, VMs and VMs in cloud.
Prometheus for metrics and data from Kubernetes (or other container) clusters.
ELK stack for logs
Is that enough?
What should be next step?
What is Pager Duty?
Users Settings
Notification Roules
Schedules
Escalation Policies
Services
Integrations
Integrations list
Connect with any
tool that provides
incoming event
data.
Extensions
Extensions list
Extend the PagerDuty
workflow to your existing tools.
Good practices for alerts
● Notify before accident
● Actionable alarms
● Value of measure things
● Documentation - not only one-liners in on call wiki
● Reduce number of tools
● Terraform
Let’s say that you’re rich :)
New Relic
STACKDRIVER
● Full-Stack Monitoring, Powered by Google
● For Cloud Platform, AWS, and Hybrid Deployments
● Identify Trends, Prevent Issues
● Reduce Monitoring Overhead
● Improve Signal-to-Noise
● Fix Problems Faster
Stackdriver heatmap
STACKDRIVER MONITORING FEATURES
● Debugger
● Error reporting
● Rapid discovery
● Uptime monitoring
● Integrations
● Smart defaults
● Alerts
● Tracing
● Logging
● Dashboards
● Profiling
MONITORING = PEOPLE
Not only tools...
Team
To make sense currently and in the future changing the monitoring infrastructure
should be supported by development and "reacting" teams.
Reacting team:
● 24/7 people for looking on boards and reacting on issues work shifts
● Incident manager taking decisions and investigating tuning of monitoring
● People with “programming” skills responsible for deploy proposals of IM
(writing new checks, adding some pieces of code)
Plan your work
Monitoring  - deeper dive
Monitoring  - deeper dive

Contenu connexe

Tendances

Tendances (10)

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
[WSO2Con Asia 2018] Tooling for Observability
[WSO2Con Asia 2018] Tooling for Observability[WSO2Con Asia 2018] Tooling for Observability
[WSO2Con Asia 2018] Tooling for Observability
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task manager
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Observability
ObservabilityObservability
Observability
 
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly BenchmarkEvaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
 
Happy users and good sleep. How?
Happy users and good sleep. How?Happy users and good sleep. How?
Happy users and good sleep. How?
 
Observability für alle
Observability für alleObservability für alle
Observability für alle
 

Similaire à Monitoring - deeper dive

Similaire à Monitoring - deeper dive (20)

Proactive monitoring tools or services - Open Source
Proactive monitoring tools or services - Open Source Proactive monitoring tools or services - Open Source
Proactive monitoring tools or services - Open Source
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systems
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
DevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMDevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBM
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Oksana Safronova - Will you detect it or not? How to check if security team i...
Oksana Safronova - Will you detect it or not? How to check if security team i...Oksana Safronova - Will you detect it or not? How to check if security team i...
Oksana Safronova - Will you detect it or not? How to check if security team i...
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Nagios En
Nagios EnNagios En
Nagios En
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016
 
OSSEC Holidaycon 2020.pdf
OSSEC Holidaycon 2020.pdfOSSEC Holidaycon 2020.pdf
OSSEC Holidaycon 2020.pdf
 

Dernier

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Dernier (20)

Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 

Monitoring - deeper dive

  • 2. Who am I Robert Kubiś DevOps Engineer https://www.linkedin.com/in/robertkubis89
  • 4.
  • 5. Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. ● White-box monitoring ● Black-box monitoring ● Dashboard ● Alert ● Root cause ● Push ● Node and machine
  • 6. Why Monitor? ● Analyzing long-term trends ● Comparing over time or experiment groups ● Alerting ● Building dashboards ● Conducting ad hoc retrospective analysis (i.e., debugging)
  • 7. Please stop using nagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it?
  • 8. Please stop using nagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it? Advantages: ● Incredible simple plugins model. ● Simple to use ● Many people know it. ● On the top in google and everybody use it :)
  • 9. Please stop using nagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it? Advantages: ● Incredible simple plugins model. ● Simple to use ● Many people know it. ● On the top in google and everybody use it :) Disadvantages: ● Doesn’t scale - cannot be clustering - Thruk hack ● Millions lines of configuration - check_mk hack ● Horrible interface ● Only for static infrastructure ● Stupid format of clients - hacks ● Perfdata… ● Doesn’t have API - livestatus hack ● Always need to hack….
  • 11.
  • 12.
  • 13. When your monitoring suck... - Improve the quality of alerts - Improve monitoring tools, or even change them Wait a minute…. Before you start to solve them...
  • 14. UNDERSTAND PROBLEMS AND MEASURE THEM!!! “To measure is to know” “If you can not measure it, you can not improve it” Lord William Thomson (aka Baron Kelvin)
  • 15.
  • 16.
  • 17. Over-monitoring and alarm fatigue: For whom do the bells toll? Hospitals in USA - Ignoring Alarms notification - “Yeah that is no important” - 72–99% false alerts - Young parents vs nurses in hospital - Monitoring means more money - More is not better - Patient could died - Telemetry as a means of preventing, detecting, and improving Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
  • 18. What to do? So what to use for monitoring?
  • 19. What monitoring should be? Actionable Compatible Essential - only alerts which are needed Fully Automated Proactive - should predict failures Easy for operators
  • 20. State monitoring what it should be like? State or blackbox monitoring now has the most of sense in VMs and bare-metals What should be monitored with those kind of tools? ● Health endpoints ● Service states (like systemctl status *) What could be monitored? ● Specific endpoints (using for example satellite node) with http/tcp checks
  • 21. Icinga2 - Nagios fork but rewritten in many places, has scaling scenarios (multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC), clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc. What we can get from Icinga2? ● High Available and distributed setup ● Nice and good documented REST API ● (dynamic inventory) ● Decrease amount of time needed for implement features
  • 22. Metrics Metric tools could be used in two ways: 1. Failure prediction 2. Graphing the data for humans - for humans it means SIMPLE First case is quite simple - rules for detecting anomalies like more traffic than usual and alert if it can make an impact on other clients Second case is also simple - just graphs for debugging and better understanding what’s happening with applications Not every metric should have alert (and notifications)!
  • 23. Prometheus Circa 120 ready to use dashboards in Grafana repository(ie. MySQL board by Percona) Many useful features in one tool - Prometheus has a rich query language, Alert manager, support for PagerDuty etc. Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX, Pagespeed, BIND, Jenkins, scollector Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf, jmx-exporter, collectd)
  • 24. Logs Servers, application, network and security devices generate log files. Errors, problems, and more information is constantly logged and saved for analysis. Once an event is detected, the monitoring system will send alert, either to a person or to another software/hardware system.
  • 26.
  • 27. Monitoring strategy Icinga 2 for state monitoring on bare metal, VMs and VMs in cloud. Prometheus for metrics and data from Kubernetes (or other container) clusters. ELK stack for logs
  • 28. Is that enough? What should be next step?
  • 29. What is Pager Duty?
  • 35. Integrations Integrations list Connect with any tool that provides incoming event data.
  • 36. Extensions Extensions list Extend the PagerDuty workflow to your existing tools.
  • 37. Good practices for alerts ● Notify before accident ● Actionable alarms ● Value of measure things ● Documentation - not only one-liners in on call wiki ● Reduce number of tools ● Terraform
  • 38. Let’s say that you’re rich :)
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. STACKDRIVER ● Full-Stack Monitoring, Powered by Google ● For Cloud Platform, AWS, and Hybrid Deployments ● Identify Trends, Prevent Issues ● Reduce Monitoring Overhead ● Improve Signal-to-Noise ● Fix Problems Faster
  • 53. STACKDRIVER MONITORING FEATURES ● Debugger ● Error reporting ● Rapid discovery ● Uptime monitoring ● Integrations ● Smart defaults ● Alerts ● Tracing ● Logging ● Dashboards ● Profiling
  • 54. MONITORING = PEOPLE Not only tools...
  • 55. Team To make sense currently and in the future changing the monitoring infrastructure should be supported by development and "reacting" teams. Reacting team: ● 24/7 people for looking on boards and reacting on issues work shifts ● Incident manager taking decisions and investigating tuning of monitoring ● People with “programming” skills responsible for deploy proposals of IM (writing new checks, adding some pieces of code)