SlideShare a Scribd company logo
1 of 13
Singapore Meetup, 2017-11-23
Speaker: Arseny Chernov
So The Story Goes Like…
Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud
• 2012 - Joined SoundCloud
• Left Google in 2012 after 5+ years
• Side-project for open-source
monitoring system for Not Only IT
(econometrics, biochemical etc.)
• Started LevelDB-backed
Prometheus
• Server, client_golang
• Protocol Buffers
• 2012 - Joined SoundCloud
• Left Google in 2012 after 2+ years
• Configuration, query language
& &
• 2013 - Joined SoundCloud
• Left Google in 2013 after 7+ years
• Storage rewrite (LevelDB to Chunks): March 2014
• Public release: January 2015
• Join Cloud Native Computing Foundation (CNCF): May 2016
• Prometheus 2.0 announced: November 08, 2017
• Singapore Meetup: 23 November, 2017
Motivation Behind - Google SRE Best Practices
Read book: https://landing.google.com/sre/book.html
• SRE: Have software engineers do operations
• Do the same work as an operations team, but with
automation instead of manual labour
• 50% upper bound cap on the amount of “ops”
Google SLI, SLO, SLA
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Service Level Indicators (SLIs)
• A carefully defined quantitative measure of some aspect of the level of service that is provided
• request latency / error rate (often expressed as % of all requests received ) / system throughput,
Service Level Objectives (SLOs)
• Lower bound ≤ SLI ≤ upper bound
• Define the lowest level of reliability, and state that as your Service Level Objective
(SLO).
Service Level Agreements (SLAs)
• SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics.
• I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95%
• A promise to someone using a service that its availability should meet a certain level over a certain
period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee
paid by customers for that period, or subscription time added for free)
Example 1
Example 2
Latency
• The time it takes to service a request.
• Successful vs. failed requests
• Slow error is even worse than a fast error. Track error latency.
Traffic
• A measure of how much demand is being placed on your system
• Usually HTTP requests per second (static vs dynamic content)
• Streaming system - network I/O rate or concurrent sessions
• Key-value storage system - TPS.
Errors
• The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content)
Saturation
• How "full" your service is. CPU, Memory, I/O
• Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it
currently receives?
• Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its
hard drive in 4 hours.”
Four Golden Signals
Error Budget = 100% - SLO
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Move fast without breaking SLO
• 100% is the wrong reliability target
• Error Budgets balance the goals of:
• Product development teams (KPI is feature velocity, incentive to push code often)
• SRE teams (KPI is reliability of a service, incentive to pushback against change)
• Error budget can be spent on anything: launching features, etc.
• Error budget provokes for discussion of phased rollouts and 1% experiments
Goal of SRE team isn’t “zero outages”
• SRE and product incentive-aligned to spend error budget and get max. feature velocity
Googlers use Borgmon (a.k.a. Borgmon rules)
Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html
%curl http://webserver:80/varz
http_requests 37
errors_total 12
Each of the major languages used at Google has an implementation of the exported variable interface that automagically
registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “
Time Series:
Distributed:
…traditional monitoring in kube era
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
A lot of traffic to monitor
Way more targets to monitor
…and they constantly change
Need a fleet-wide view (i..e What’s my overall 99th percentile latency)?
Still need to be able to drill down for troubleshooting
&
Prometheus Relies on Exporters
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically
called exporter, e.g. the host-level metrics exporter is node-exporter.
https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
Prometheus Architecture
Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html
The 3 path-method combinations with the highest number of failing
requests?
topk(3,
sum by(path, method) (
rate(http_requests_total{status=~"5.."}[5m]))
)
The 99th percentile request latency by request path?
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))
PromQL:
Prometheus Storage Architecture
• A monitoring system must be more reliabile than the systems it is monitoring
• Prometheus's local storage is not meant as durable long-term storage.
• Chunks of data are in RAM, with WAL on disk
needed_disk_space =
retention_time_seconds *
ingested_samples_per_second *
bytes_per_sample [1…2 bytes]
• Possible LVM solution if _really_ desperate
As of writing (Nov. 2017) moment possible to integrate via adapters to:
Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc.
This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any
solution in this space to confirm it can handle your data volumes.
Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
What Protetheus Is Not & Best Practice
• Not 100% accurate
• No logs, only metrics
• Not a durable long-term storage
• Not an anomaly detection
• Not a dashboarding solution
Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit
Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone.
Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.

More Related Content

What's hot

HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Camuel Gilyadov
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 

What's hot (20)

HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 

Similar to Introduction to Prometheus Monitoring (Singapore Meetup)

Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveDevOpsProdigy
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Gaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume LaforgeGaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume LaforgeGuillaume Laforge
 
Database performance management
Database performance managementDatabase performance management
Database performance managementscottaver
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public CloudIntuit Inc.
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayBizTalk360
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopBrian Christner
 
No Devops Without Continuous Testing
No Devops Without Continuous TestingNo Devops Without Continuous Testing
No Devops Without Continuous TestingParasoft
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner
 
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...Vikas Sahni
 
Building azure applications ireland
Building azure applications irelandBuilding azure applications ireland
Building azure applications irelandMichael Meagher
 
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...gree_tech
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTSabrina Marechal
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesIvo Andreev
 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Stamo Petkov
 

Similar to Introduction to Prometheus Monitoring (Singapore Meetup) (20)

Salesforce Performance hacks - Client Side
Salesforce Performance hacks - Client SideSalesforce Performance hacks - Client Side
Salesforce Performance hacks - Client Side
 
Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspective
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Gaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume LaforgeGaelyk - JFokus 2011 - Guillaume Laforge
Gaelyk - JFokus 2011 - Guillaume Laforge
 
Software Performance
Software Performance Software Performance
Software Performance
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration Monday
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
No Devops Without Continuous Testing
No Devops Without Continuous TestingNo Devops Without Continuous Testing
No Devops Without Continuous Testing
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
 
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
Building Real World Applications using Windows Azure - Scott Guthrie, 2nd Dec...
 
Building azure applications ireland
Building azure applications irelandBuilding azure applications ireland
Building azure applications ireland
 
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
Common Pitfalls of Functional Programming and How to Avoid Them: A Mobile Gam...
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
 

Recently uploaded

PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 

Recently uploaded (20)

PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 

Introduction to Prometheus Monitoring (Singapore Meetup)

  • 2. So The Story Goes Like… Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud • 2012 - Joined SoundCloud • Left Google in 2012 after 5+ years • Side-project for open-source monitoring system for Not Only IT (econometrics, biochemical etc.) • Started LevelDB-backed Prometheus • Server, client_golang • Protocol Buffers • 2012 - Joined SoundCloud • Left Google in 2012 after 2+ years • Configuration, query language & & • 2013 - Joined SoundCloud • Left Google in 2013 after 7+ years • Storage rewrite (LevelDB to Chunks): March 2014 • Public release: January 2015 • Join Cloud Native Computing Foundation (CNCF): May 2016 • Prometheus 2.0 announced: November 08, 2017 • Singapore Meetup: 23 November, 2017
  • 3. Motivation Behind - Google SRE Best Practices Read book: https://landing.google.com/sre/book.html • SRE: Have software engineers do operations • Do the same work as an operations team, but with automation instead of manual labour • 50% upper bound cap on the amount of “ops”
  • 4. Google SLI, SLO, SLA Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html Service Level Indicators (SLIs) • A carefully defined quantitative measure of some aspect of the level of service that is provided • request latency / error rate (often expressed as % of all requests received ) / system throughput, Service Level Objectives (SLOs) • Lower bound ≤ SLI ≤ upper bound • Define the lowest level of reliability, and state that as your Service Level Objective (SLO). Service Level Agreements (SLAs) • SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics. • I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95% • A promise to someone using a service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee paid by customers for that period, or subscription time added for free)
  • 6. Latency • The time it takes to service a request. • Successful vs. failed requests • Slow error is even worse than a fast error. Track error latency. Traffic • A measure of how much demand is being placed on your system • Usually HTTP requests per second (static vs dynamic content) • Streaming system - network I/O rate or concurrent sessions • Key-value storage system - TPS. Errors • The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content) Saturation • How "full" your service is. CPU, Memory, I/O • Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? • Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours.” Four Golden Signals
  • 7. Error Budget = 100% - SLO Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html Move fast without breaking SLO • 100% is the wrong reliability target • Error Budgets balance the goals of: • Product development teams (KPI is feature velocity, incentive to push code often) • SRE teams (KPI is reliability of a service, incentive to pushback against change) • Error budget can be spent on anything: launching features, etc. • Error budget provokes for discussion of phased rollouts and 1% experiments Goal of SRE team isn’t “zero outages” • SRE and product incentive-aligned to spend error budget and get max. feature velocity
  • 8. Googlers use Borgmon (a.k.a. Borgmon rules) Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html %curl http://webserver:80/varz http_requests 37 errors_total 12 Each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “ Time Series: Distributed:
  • 9. …traditional monitoring in kube era Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus A lot of traffic to monitor Way more targets to monitor …and they constantly change Need a fleet-wide view (i..e What’s my overall 99th percentile latency)? Still need to be able to drill down for troubleshooting &
  • 10. Prometheus Relies on Exporters Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically called exporter, e.g. the host-level metrics exporter is node-exporter. https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
  • 11. Prometheus Architecture Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html The 3 path-method combinations with the highest number of failing requests? topk(3, sum by(path, method) ( rate(http_requests_total{status=~"5.."}[5m])) ) The 99th percentile request latency by request path? histogram_quantile(0.99, sum by(le, path) ( rate(http_requests_duration_seconds_bucket[5m]) )) PromQL:
  • 12. Prometheus Storage Architecture • A monitoring system must be more reliabile than the systems it is monitoring • Prometheus's local storage is not meant as durable long-term storage. • Chunks of data are in RAM, with WAL on disk needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample [1…2 bytes] • Possible LVM solution if _really_ desperate As of writing (Nov. 2017) moment possible to integrate via adapters to: Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc. This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any solution in this space to confirm it can handle your data volumes. Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
  • 13. What Protetheus Is Not & Best Practice • Not 100% accurate • No logs, only metrics • Not a durable long-term storage • Not an anomaly detection • Not a dashboarding solution Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone. Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.