SRE Demystified - 06 - Distributed Monitoring

•

0 j'aime•83 vues

According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain distributed monitoring concepts. Youtube channel here: https://youtu.be/EgpCw15fIK8

Technologie

SRE Demystified
Distributed Monitoring
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer

SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257

Why Monitor?
3
Analyzing long-term trends
Comparing time/iterations
Alerting
Building dashboards
How big is my database and how fast is it
growing? How quickly is my daily-active user
count growing?
Are queries faster with Acme Bucket of Bytes
2.72 versus Ajax DB 3.14? How much better
is my memcache hit rate with an extra node?
Is my site slower than it was last week?
Something is broken,
and somebody needs to
fix it right now! Or,
something might break
soon, so attention
needed soon.
Dashboards should answer basic questions
about your service, and normally include
some form of the four golden signals
Conducting ad hoc retrospective analysis
Our latency just shot up; what else happened around the same time?
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

Setting Reasonable Expectations for Monitoring
• Simpler and faster monitoring systems, with better tools for post
hoc analysis
• Other uses of monitoring data such as capacity planning and
traffic prediction can tolerate more fragility, and thus, more
complexity
• To keep noise low and signal high, the elements of your
monitoring system that direct to a pager need to be very simple
and robust
• Rules that generate alerts for humans should be simple to
understand and represent a clear failure
4https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

Example symptoms and causes
5
Symptom Cause
I’m serving HTTP 500s or 404s Database servers are refusing connections
My responses are slow
CPUs are overloaded by a bogosort, or an Ethernet
cable is crimped under a rack, visible as partial
packet loss
Users in Antarctica aren’t receiving animated cat
GIFs
Your Content Distribution Network hates scientists
and felines, and thus blacklisted some client IPs
Private content is world-readable
A new software push caused ACLs to be forgotten
and allowed all requests
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

Black-box monitoring White-box monitoring
• Restricted to external
service behaviour
• e.g. ping, http
6
• Internal metrics and logs to
surface system issues
• e.g. request_errors_total,
request_processing_time,
DevOps School

7https://www.devopsschool.com/blog/white-box-monitoring-and-black-box-monitoring-explained/

Choosing an Appropriate Resolution for
Measurements
• Observing CPU load over the time span of a minute won’t reveal even
quite long-lived spikes that drive high tail latencies
• For a web service targeting no more than 9 hours aggregate downtime
per year (99.9% annual uptime), probing for a 200 (success) status
more than once or twice a minute is probably unnecessarily frequent.
• Similarly, checking hard drive fullness for a service targeting 99.9%
availability more than once every 1–2 minutes is probably unnecessary
9

Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

Recommandé

SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignDr Ganesh Iyer

SRE Demystified - 07 - Practical AlertingDr Ganesh Iyer

SRE Demystified - 03 - Choosing SLIs and SLOsDr Ganesh Iyer

SRE Demystified - 11 - Release management-2Dr Ganesh Iyer

SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer

SRE Demystified - 12 - Docs that matter -1 Dr Ganesh Iyer

SRE Demystified - 09 - SimplicityDr Ganesh Iyer

SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer

Recommandé

SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignDr Ganesh Iyer

SRE Demystified - 07 - Practical AlertingDr Ganesh Iyer

SRE Demystified - 03 - Choosing SLIs and SLOsDr Ganesh Iyer

SRE Demystified - 11 - Release management-2Dr Ganesh Iyer

SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer

SRE Demystified - 12 - Docs that matter -1 Dr Ganesh Iyer

SRE Demystified - 09 - SimplicityDr Ganesh Iyer

SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer

Understanding parametersniffing sqlsatSanil Mhatre

Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksThoughtworks

Ride the database in JUnit tests with Database RiderMikalai Alimenkou

Ohio 2012-help-sysad-outmralexjuarez

Modern CI/CD in the microservices world with KubernetesMikalai Alimenkou

SFScon 21 - Eduardo Guerra - A Lean Software Analytics Canvas for Agile Small...South Tyrol Free Software Conference

ImprovingSBHSRemotetriggered server and websiteDevdutt Kumar

Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari

Baltimore share point user group june 2015Toby McGrail

Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe

Csdn Drdobbs Tenni Theurer Yahooguestb1b95b

Perfmon And Profiler 101Quest Software

The 5S Approach to Performance Tuning by Chuck EzellDatavail

Observability in real time at scaleBalvinder Hira

Plopoakleaf

High Performance Web SitesPáris Neto

High Performance Websites By Souders Stevew3guru

DockerCon SF 2019 - Observability WorkshopKevin Crawley

Big server-is-watching-youmkherlakian

Performance OptimizationNeha Thakur

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner

Contenu connexe

Tendances

Understanding parametersniffing sqlsatSanil Mhatre

Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorksThoughtworks

Ride the database in JUnit tests with Database RiderMikalai Alimenkou

Ohio 2012-help-sysad-outmralexjuarez

Modern CI/CD in the microservices world with KubernetesMikalai Alimenkou

SFScon 21 - Eduardo Guerra - A Lean Software Analytics Canvas for Agile Small...South Tyrol Free Software Conference

ImprovingSBHSRemotetriggered server and websiteDevdutt Kumar

Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari

Baltimore share point user group june 2015Toby McGrail

Tendances (9)

Understanding parametersniffing sqlsat

Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks

Ride the database in JUnit tests with Database Rider

Ohio 2012-help-sysad-out

Modern CI/CD in the microservices world with Kubernetes

SFScon 21 - Eduardo Guerra - A Lean Software Analytics Canvas for Agile Small...

ImprovingSBHSRemotetriggered server and website

Performance tuning Grails applications SpringOne 2GX 2014

Baltimore share point user group june 2015

Similaire à SRE Demystified - 06 - Distributed Monitoring

Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe

Csdn Drdobbs Tenni Theurer Yahooguestb1b95b

Perfmon And Profiler 101Quest Software

The 5S Approach to Performance Tuning by Chuck EzellDatavail

Observability in real time at scaleBalvinder Hira

Plopoakleaf

High Performance Web SitesPáris Neto

High Performance Websites By Souders Stevew3guru

DockerCon SF 2019 - Observability WorkshopKevin Crawley

Big server-is-watching-youmkherlakian

Performance OptimizationNeha Thakur

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

Os Solomonoscon2007

Super Sizing Youtube with Pythondidip

Orion Network Performance Monitor (NPM) Optimization and Tuning TrainingSolarWinds

Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...rschuppe

Empowering Customers with Personalized InsightsCloudera, Inc.

SenchaCon Roadshow Irvine 2017Speedment, Inc.

Similaire à SRE Demystified - 06 - Distributed Monitoring (20)

Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...

Csdn Drdobbs Tenni Theurer Yahoo

Perfmon And Profiler 101

The 5S Approach to Performance Tuning by Chuck Ezell

Observability in real time at scale

Plop

High Performance Web Sites

High Performance Websites By Souders Steve

DockerCon SF 2019 - Observability Workshop

Big server-is-watching-you

Performance Optimization

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

Os Solomon

Super Sizing Youtube with Python

Orion Network Performance Monitor (NPM) Optimization and Tuning Training

Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...

Empowering Customers with Personalized Insights

SenchaCon Roadshow Irvine 2017

Plus de Dr Ganesh Iyer

SRE Demystified - 14 - SRE Practices overviewDr Ganesh Iyer

SRE Demystified - 13 - Docs that matter -2Dr Ganesh Iyer

SRE Demystified - 10 - Release management-1Dr Ganesh Iyer

SRE Demystified - 04 - Engagement ModelDr Ganesh Iyer

Machine Learning for Statisticians - IntroductionDr Ganesh Iyer

Making Decisions - A Game Theoretic approachDr Ganesh Iyer

Cloud and Industry4.0Dr Ganesh Iyer

Game Theory and Engineering ApplicationsDr Ganesh Iyer

Machine Learning and its ApplicationsDr Ganesh Iyer

How to become a successful entrepreneurDr Ganesh Iyer

Dockers and kubernetesDr Ganesh Iyer

Containerization Principles Overview for app development and deploymentDr Ganesh Iyer

Game Theory and Engineering ApplicationsDr Ganesh Iyer

Demystifying Containerization Principles for Data ScientistsDr Ganesh Iyer

Cloud computing for image processing and bio informaticsDr Ganesh Iyer

Io t trendsDr Ganesh Iyer

Trends in IoT 2017Dr Ganesh Iyer

Docker 101 - High level introduction to dockerDr Ganesh Iyer

DrGanesh-Jan-17-Resume-V1.0Dr Ganesh Iyer

Simplify enterprise IT with no code platform - aPaaSDr Ganesh Iyer

Plus de Dr Ganesh Iyer (20)

SRE Demystified - 14 - SRE Practices overview

SRE Demystified - 13 - Docs that matter -2

SRE Demystified - 10 - Release management-1

SRE Demystified - 04 - Engagement Model

Machine Learning for Statisticians - Introduction

Making Decisions - A Game Theoretic approach

Cloud and Industry4.0

Game Theory and Engineering Applications

Machine Learning and its Applications

How to become a successful entrepreneur

Dockers and kubernetes

Containerization Principles Overview for app development and deployment

Game Theory and Engineering Applications

Demystifying Containerization Principles for Data Scientists

Cloud computing for image processing and bio informatics

Io t trends

Trends in IoT 2017

Docker 101 - High level introduction to docker

DrGanesh-Jan-17-Resume-V1.0

Simplify enterprise IT with no code platform - aPaaS

Dernier

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

"ML in Production",Oleksandr BaganFwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

From Family Reminiscence to Scholarly Archive .Alan Dix

Dernier (20)

Artificial intelligence in cctv survelliance.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Nell’iperspazio con Rocket: il Framework Web di Rust!

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Commit 2024 - Secret Management made easy

Advanced Test Driven-Development @ php[tek] 2024

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Unraveling Multimodality with Large Language Models.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

Designing IA for AI - Information Architecture Conference 2024

TeamStation AI System Report LATAM IT Salaries 2024

The Ultimate Guide to Choosing WordPress Pros and Cons

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Are Multi-Cloud and Serverless Good or Bad?

"ML in Production",Oleksandr Bagan

Anypoint Exchange: It’s Not Just a Repo!

Streamlining Python Development: A Guide to a Modern Project Setup

Developer Data Modeling Mistakes: From Postgres to NoSQL

Connect Wave/ connectwave Pitch Deck Presentation

From Family Reminiscence to Scholarly Archive .

SRE Demystified - 06 - Distributed Monitoring

1. SRE Demystified Distributed Monitoring ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com, http://ganeshniyer.com Dr Ganesh Neelakanta Iyer

2. SRE • 2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257

3. Why Monitor? 3 Analyzing long-term trends Comparing time/iterations Alerting Building dashboards How big is my database and how fast is it growing? How quickly is my daily-active user count growing? Are queries faster with Acme Bucket of Bytes 2.72 versus Ajax DB 3.14? How much better is my memcache hit rate with an extra node? Is my site slower than it was last week? Something is broken, and somebody needs to fix it right now! Or, something might break soon, so attention needed soon. Dashboards should answer basic questions about your service, and normally include some form of the four golden signals Conducting ad hoc retrospective analysis Our latency just shot up; what else happened around the same time? https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

4. Setting Reasonable Expectations for Monitoring • Simpler and faster monitoring systems, with better tools for post hoc analysis • Other uses of monitoring data such as capacity planning and traffic prediction can tolerate more fragility, and thus, more complexity • To keep noise low and signal high, the elements of your monitoring system that direct to a pager need to be very simple and robust • Rules that generate alerts for humans should be simple to understand and represent a clear failure 4https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

5. Example symptoms and causes 5 Symptom Cause I’m serving HTTP 500s or 404s Database servers are refusing connections My responses are slow CPUs are overloaded by a bogosort, or an Ethernet cable is crimped under a rack, visible as partial packet loss Users in Antarctica aren’t receiving animated cat GIFs Your Content Distribution Network hates scientists and felines, and thus blacklisted some client IPs Private content is world-readable A new software push caused ACLs to be forgotten and allowed all requests https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

6. Black-box monitoring White-box monitoring • Restricted to external service behaviour • e.g. ping, http 6 • Internal metrics and logs to surface system issues • e.g. request_errors_total, request_processing_time, DevOps School

7. 7https://www.devopsschool.com/blog/white-box-monitoring-and-black-box-monitoring-explained/

8. Four Golden Signals 8

9. Choosing an Appropriate Resolution for Measurements • Observing CPU load over the time span of a minute won’t reveal even quite long-lived spikes that drive high tail latencies • For a web service targeting no more than 9 hours aggregate downtime per year (99.9% annual uptime), probing for a 200 (success) status more than once or twice a minute is probably unnecessarily frequent. • Similarly, checking hard drive fullness for a service targeting 99.9% availability more than once every 1–2 minutes is probably unnecessary 9

10. References 10

11. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com