SlideShare une entreprise Scribd logo
1  sur  20
Monitoring & Alerting
Quick dive
How much do outages cost us?
Facebook - $500k in just 30 min of outage in 2014
Amazon - $66k/min
Industry average - $300k/hour
Industry total lost revenue - $26.5B
What is monitoring?
The process of becoming aware of the state of a system.
Is my website up and accessible?
Does all the important functionality work?
Is each server up?
Are all the applications we deployed up?
What’s my CPU usage per machine? disk? memory?
Swap?
Start simple
Basic monitoring systems that you can try straight away:
● Google analytics (Android, iOS, UNITY, HTTP, analytics.js)
● Fabric (Crashlytics integration for Android and iOS)
You can also check this detailed comparison table of different monitoring systems.
What does monitoring help with?
● Early problem detection
● Decision making
● Automation
Early problem detection
Performance
● Monitoring anomalies in the behavior of the system helps to detect resource
saturation and rare defects (hard to spot by QA)
● Particular types of bugs related to heavy system load are hard to detect in test
environments, but can be consistently reproduced in production
Availability
● Downtime usually translates directly to losses in revenue and credibility
● 99.99% availability is the industry standard (50min/year)
Decision making
Baselining
● Know the normal, average state of your system (baseline)
● Data-backed Service-Level Agreements (SLAs)
● In-depth performance analysis, saving costs
Predictions
● Help predict what normal traffic levels are during peaks of activity, like
holidays, social events and such (capacity planning)
● Close interaction with monitoring may help predict business trends
Automation
Allows system to automatically adapt to high load situations.
Bursts of input may saturate a system’s capacity and it may have to drop
some traffic. In order to prevent uniformly bad experience for all users an
attempt is made to reject a portion of inputs. This is commonly known as
admission control.
Monitoring system architecture
● Data collection
● Data aggregation and storage
● Presentation
Data collection
The source of data are logs, device statistics, and system measurements:
● Logging network request failure rates (4xx, 5xx)
● Tracking performance of calls to individual
remote services
● Database calls and response time
● Disk and CPU usage
● Logging mobile clients analytics events
Data aggregation and storage
● Incoming data inputs are grouped by their properties and stored as timeseries
● Resulting timeseries submitted to an alarm evaluation engine, which
generates alarms if anomalies are detected (anomaly detection).
One such system is Graphite.
Presentation
Allows visualisation of the real time state of the system. When a fault is identified
and fixed, the correction should be immediately visible.
One powerful tool for dashboarding is Grafana:
● Integrate with Graphite, InfluxDB, OpenTSDB, and KairosDB
● Introduction and basic concepts can be found here
● Useful video on how to setup your first dashboard
● Give it a try
Alerting
Alerting is the capability of a
monitoring system to detect and notify
the engineer about meaningful events.
Levels of alert urgency
● Alerts as records - anomalies that do not impact the service functionality.
● Alerts as notifications - do not need immediate attention.
● Alerts as pages - high severity, response time inforced by internal SLAs.
Tools
● Pagerduty
● OpsGenie
● VictorOps
Anomaly detection
The identification of items, events or observations which do not conform to an
expected pattern or other items in a dataset.
Let’s see how Uber does it.
Issue is detected and fixed, now what?
Detecting and fixing an issue are only the first steps. We need to make sure that the
issue does not happen again.
Use of postmortems is one interesting approach.
Challenges
● Baselining
● Coverage
● Manageability
● Accuracy
● Context
● Human nature
Conclusion
● Get in the habit of measuring, you cannot manage what you cannot measure
● Monitor extensively
● Alarm selectively
● Work smart, not hard, learn from the experience of others
● Have a tactic
Further reading: Effective Monitoring and Alerting
Thank you!
Contact:
sabin.roman@gmail.com
https://nl.linkedin.com/in/sabinroman

Contenu connexe

Tendances

Tendances (20)

Grafana.pptx
Grafana.pptxGrafana.pptx
Grafana.pptx
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Monitoring via Datadog
Monitoring via DatadogMonitoring via Datadog
Monitoring via Datadog
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
 
Observability
ObservabilityObservability
Observability
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Grafana
GrafanaGrafana
Grafana
 
CICD with Jenkins
CICD with JenkinsCICD with Jenkins
CICD with Jenkins
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
Flusso Continuous Integration & Continuous Delivery
Flusso Continuous Integration & Continuous DeliveryFlusso Continuous Integration & Continuous Delivery
Flusso Continuous Integration & Continuous Delivery
 
Jfrog artifactory artifact management c tamilmaran presentation - copy
Jfrog artifactory artifact management c tamilmaran presentation - copyJfrog artifactory artifact management c tamilmaran presentation - copy
Jfrog artifactory artifact management c tamilmaran presentation - copy
 
Observability
ObservabilityObservability
Observability
 

Similaire à Monitoring & alerting presentation sabin&mustafa

Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
slandelle
 
Asp Abstracts, Sample Copy 15+ Abstracts
Asp Abstracts, Sample Copy 15+ AbstractsAsp Abstracts, Sample Copy 15+ Abstracts
Asp Abstracts, Sample Copy 15+ Abstracts
ncct
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
Argyle Executive Forum
 

Similaire à Monitoring & alerting presentation sabin&mustafa (20)

Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance Testing
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
IDEA.pptx
IDEA.pptxIDEA.pptx
IDEA.pptx
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systems
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
Asp Abstracts, Sample Copy 15+ Abstracts
Asp Abstracts, Sample Copy 15+ AbstractsAsp Abstracts, Sample Copy 15+ Abstracts
Asp Abstracts, Sample Copy 15+ Abstracts
 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for management
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
The Shape of Cloud to Come
The Shape of Cloud to ComeThe Shape of Cloud to Come
The Shape of Cloud to Come
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
OSMC 2017 | Monitoring Challenges in a World of Automation by Anthony Goddard
OSMC 2017 | Monitoring Challenges in a World of Automation by Anthony GoddardOSMC 2017 | Monitoring Challenges in a World of Automation by Anthony Goddard
OSMC 2017 | Monitoring Challenges in a World of Automation by Anthony Goddard
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive Application
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
 

Plus de Lama K Banna

Plus de Lama K Banna (20)

The TikTok Masterclass Deck.pdf
The TikTok Masterclass Deck.pdfThe TikTok Masterclass Deck.pdf
The TikTok Masterclass Deck.pdf
 
دليل كتابة المشاريع.pdf
دليل كتابة المشاريع.pdfدليل كتابة المشاريع.pdf
دليل كتابة المشاريع.pdf
 
Investment proposal
Investment proposalInvestment proposal
Investment proposal
 
Funding proposal
Funding proposalFunding proposal
Funding proposal
 
5 incisions
5 incisions5 incisions
5 incisions
 
Lecture 3 facial cosmetic surgery
Lecture 3 facial cosmetic surgery Lecture 3 facial cosmetic surgery
Lecture 3 facial cosmetic surgery
 
lecture 1 facial cosmatic surgery
lecture 1 facial cosmatic surgery lecture 1 facial cosmatic surgery
lecture 1 facial cosmatic surgery
 
Facial neuropathology Maxillofacial Surgery
Facial neuropathology Maxillofacial SurgeryFacial neuropathology Maxillofacial Surgery
Facial neuropathology Maxillofacial Surgery
 
Lecture 2 Facial cosmatic surgery
Lecture 2 Facial cosmatic surgery Lecture 2 Facial cosmatic surgery
Lecture 2 Facial cosmatic surgery
 
Lecture 12 general considerations in treatment of tmd
Lecture 12 general considerations in treatment of tmdLecture 12 general considerations in treatment of tmd
Lecture 12 general considerations in treatment of tmd
 
Lecture 10 temporomandibular joint
Lecture 10 temporomandibular jointLecture 10 temporomandibular joint
Lecture 10 temporomandibular joint
 
Lecture 11 temporomandibular joint Part 3
Lecture 11 temporomandibular joint Part 3Lecture 11 temporomandibular joint Part 3
Lecture 11 temporomandibular joint Part 3
 
Lecture 9 TMJ anatomy examination
Lecture 9 TMJ anatomy examinationLecture 9 TMJ anatomy examination
Lecture 9 TMJ anatomy examination
 
Lecture 7 correction of dentofacial deformities Part 2
Lecture 7 correction of dentofacial deformities Part 2Lecture 7 correction of dentofacial deformities Part 2
Lecture 7 correction of dentofacial deformities Part 2
 
Lecture 8 management of patients with orofacial clefts
Lecture 8 management of patients with orofacial cleftsLecture 8 management of patients with orofacial clefts
Lecture 8 management of patients with orofacial clefts
 
Lecture 5 Diagnosis and management of salivary gland disorders Part 2
Lecture 5 Diagnosis and management of salivary gland disorders Part 2Lecture 5 Diagnosis and management of salivary gland disorders Part 2
Lecture 5 Diagnosis and management of salivary gland disorders Part 2
 
Lecture 6 correction of dentofacial deformities
Lecture 6 correction of dentofacial deformitiesLecture 6 correction of dentofacial deformities
Lecture 6 correction of dentofacial deformities
 
lecture 4 Diagnosis and management of salivary gland disorders
lecture 4 Diagnosis and management of salivary gland disorderslecture 4 Diagnosis and management of salivary gland disorders
lecture 4 Diagnosis and management of salivary gland disorders
 
Lecture 3 maxillofacial trauma part 3
Lecture 3 maxillofacial trauma part 3Lecture 3 maxillofacial trauma part 3
Lecture 3 maxillofacial trauma part 3
 
Lecture 2 maxillofacial trauma
Lecture 2 maxillofacial traumaLecture 2 maxillofacial trauma
Lecture 2 maxillofacial trauma
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Monitoring & alerting presentation sabin&mustafa

  • 2. How much do outages cost us? Facebook - $500k in just 30 min of outage in 2014 Amazon - $66k/min Industry average - $300k/hour Industry total lost revenue - $26.5B
  • 3. What is monitoring? The process of becoming aware of the state of a system. Is my website up and accessible? Does all the important functionality work? Is each server up? Are all the applications we deployed up? What’s my CPU usage per machine? disk? memory? Swap?
  • 4. Start simple Basic monitoring systems that you can try straight away: ● Google analytics (Android, iOS, UNITY, HTTP, analytics.js) ● Fabric (Crashlytics integration for Android and iOS) You can also check this detailed comparison table of different monitoring systems.
  • 5. What does monitoring help with? ● Early problem detection ● Decision making ● Automation
  • 6. Early problem detection Performance ● Monitoring anomalies in the behavior of the system helps to detect resource saturation and rare defects (hard to spot by QA) ● Particular types of bugs related to heavy system load are hard to detect in test environments, but can be consistently reproduced in production Availability ● Downtime usually translates directly to losses in revenue and credibility ● 99.99% availability is the industry standard (50min/year)
  • 7. Decision making Baselining ● Know the normal, average state of your system (baseline) ● Data-backed Service-Level Agreements (SLAs) ● In-depth performance analysis, saving costs Predictions ● Help predict what normal traffic levels are during peaks of activity, like holidays, social events and such (capacity planning) ● Close interaction with monitoring may help predict business trends
  • 8. Automation Allows system to automatically adapt to high load situations. Bursts of input may saturate a system’s capacity and it may have to drop some traffic. In order to prevent uniformly bad experience for all users an attempt is made to reject a portion of inputs. This is commonly known as admission control.
  • 9. Monitoring system architecture ● Data collection ● Data aggregation and storage ● Presentation
  • 10. Data collection The source of data are logs, device statistics, and system measurements: ● Logging network request failure rates (4xx, 5xx) ● Tracking performance of calls to individual remote services ● Database calls and response time ● Disk and CPU usage ● Logging mobile clients analytics events
  • 11. Data aggregation and storage ● Incoming data inputs are grouped by their properties and stored as timeseries ● Resulting timeseries submitted to an alarm evaluation engine, which generates alarms if anomalies are detected (anomaly detection). One such system is Graphite.
  • 12. Presentation Allows visualisation of the real time state of the system. When a fault is identified and fixed, the correction should be immediately visible. One powerful tool for dashboarding is Grafana: ● Integrate with Graphite, InfluxDB, OpenTSDB, and KairosDB ● Introduction and basic concepts can be found here ● Useful video on how to setup your first dashboard ● Give it a try
  • 13. Alerting Alerting is the capability of a monitoring system to detect and notify the engineer about meaningful events.
  • 14. Levels of alert urgency ● Alerts as records - anomalies that do not impact the service functionality. ● Alerts as notifications - do not need immediate attention. ● Alerts as pages - high severity, response time inforced by internal SLAs.
  • 16. Anomaly detection The identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Let’s see how Uber does it.
  • 17. Issue is detected and fixed, now what? Detecting and fixing an issue are only the first steps. We need to make sure that the issue does not happen again. Use of postmortems is one interesting approach.
  • 18. Challenges ● Baselining ● Coverage ● Manageability ● Accuracy ● Context ● Human nature
  • 19. Conclusion ● Get in the habit of measuring, you cannot manage what you cannot measure ● Monitor extensively ● Alarm selectively ● Work smart, not hard, learn from the experience of others ● Have a tactic Further reading: Effective Monitoring and Alerting

Notes de l'éditeur

  1. Today we will discuss about what we love the most in engineering, being waken up at 4am in the morning because of a bug! Talk about how to detect problems with your application and how to fix them as soon as possible
  2. Has anybody used this tools?
  3. The ability to predict demands and then match them based on seasonality translates directly into revenue gains
  4. When a data store that supports a user-facing service starts serving queries much slower than usual, but not slow enough to make an appreciable difference in the overall service’s response time, that should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work the data store is running low on disk space and should be scaled out in the next several days
  5. Pics, charts, examples, how much time it takes to setup system, conclusion, pitfalls,
  6. Baselining: “nothing endures but change” Coverage: systems evolve, so should the coverage
  7. Tactic: Runbooks 80% disc storage issue