SlideShare une entreprise Scribd logo
1  sur  48
Introduction to Prometheus
An Approach to Whitebox Monitoring
Who am I?
Engineer passionate about running software reliably in production.
Studied Computer Science in Trinity College Dublin.
Google SRE for 7 years, working on high-scale reliable systems.
Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
Founder of Robust Perception, provider of commercial support and consulting
for Prometheus.
What is Whitebox Monitoring?
Blackbox monitoring
Monitoring from the outside
No knowledge of how the application works internally
Examples: ping, HTTP request, inserting data and waiting for it to appear on
dashboard
Where to use Blackbox
Blackbox monitoring should be treated similarly to smoke tests.
It’s good for finding when things have badly broken in an obvious way, and testing
from outside your network.
Not so good for knowing what’s going on inside a system.
Nor should it be treated like regression testing and try to test every single feature.
Tend to be flaky, as they either pass or fail.
Whitebox Monitoring
Complementary to blackbox monitoring.
Works with information from inside your systems.
Can be simple things like CPU usage, down to the number of requests triggering a
particular obscure codepath.
Prometheus
Inspired by Google’s Borgmon monitoring system.
Started in 2012 by ex-Googlers working in Soundcloud as an open source project.
Mainly written in Go. Version 1.0 released in 2016. Incubating with the CNCF.
500+ companies using it including Digital Ocean, Ericsson, Weave and CoreOS.
What is Monitoring For?
Why monitor?
Know when things go wrong
Be able to debug and gain insight
Trending to see changes over time
Plumbing data to other systems/processes
Knowing when things go wrong
The first thing people think of you say monitoring is alerting.
What is the wrongness we want to detect and alert on?
A blip with no real consequence, or a latency issue affecting users?
Symptoms vs Causes
Humans are limited in what they can handle.
If you alert on every single thing that might be a problem, you'll get overwhelmed
and suffer from alert fatigue.
Key problem: You care about things like user facing latency. There are hundreds
of things that could cause that.
Alerting on every possible cause is a Sisyphean task, but alerting on the symptom
of high latency is just one alert.
Example: CPU usage
Some monitoring systems don't allow you to alert on the latency of your servers.
The closest you can get is CPU usage.
False positives due to e.g. logrotate running too long.
False negatives due to deadlocks.
End result: Spammy alerts which operators learn to ignore, missing real problems.
Many Approaches have Limited Visibility
Services have Internals
Monitor the Internals
Monitor as a Service, not as Machines
Freedom for Alerting
A system like Prometheus gives you the freedom to alert on whatever you like.
Alerting on error ratio across all the machines in a datacenter? No problem.
Alerting on 95th percentile latency for the service being <200ms? No problem.
Alerting on data taking too long to get through your pipeline? No problem.
Alerting on your VIP not giving the right HTTP response codes? No problem.
Produce alerts that require intelligent human action!
Alerting Architecture
Debugging to Gain Insight
After you receive an alert notification you need to investigate it.
How do you work from a high level symptom alert such as increased latency?
You drill down through your stack with dashboards to find the subsystem that's the
cause!
Dashboards
Metrics from All Levels of the Stack
Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine,
Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul,
HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ,
Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft and Factorio.
Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition.
It’s so easy, most of the above were written without the core team even knowing
about them!
Metrics are just one Tool
Metrics are good for alerting on issues and letting you drill down the focus of your
debugging.
Not a panacea though, as with all approaches fundamental limitations on data
volumes.
For successful debugging of complex problems you need a mix of logs, profiling
and source code analysis.
Complementary Debugging Tools
Trending and Reporting
Alerting and debugging is short term.
Trending is medium to long term.
How is cache hit rate changing over time?
Is anyone still using that obscure feature?
With Prometheus you can do analysis beyond this.
Powerful Query Language
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Answer questions like:
What’s the 95th percentile latency in each datacenter over the past month?
How full will the disks be in 4 days?
Which services are the top 5 users of CPU?
Example: Top 5 Docker images by CPU
topk(5,
sum by (image)(
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)
Structured Data: Labels
Prometheus doesn’t use dotted.strings like metric.grafnacon.nyc.
Multi-dimensional labels instead like
metric{event=”grafanacon”,aircraft_carrier_location=”nyc”}
Can aggregate, cut, and slice along them.
Can come from instrumentation, or be added based on the service you are
monitoring.
Example: Labels from Node Exporter
Python Instrumentation: An example
pip install prometheus_client
from prometheus_client import Summary, start_http_server
REQUEST_DURATION = Summary('request_duration_seconds',
'Request duration in seconds')
@REQUEST_DURATION.time()
def my_handler(request):
pass // Your code here
start_http_server(8000)
Adding Dimensions (No Evil Twins Please)
from prometheus_client import Counter
REQUESTS = Counter('requests_total',
'Total requests', ['method'])
def my_handler(request):
REQUESTS.labels(request.method).inc()
pass // Your code here
Labels go beyond Prometheus
If you're using Kubernetes, Prometheus can take in your labels and annotations
too.
Similar data models and mutual integrations make your life easier!
Plumbing
Prometheus isn't just open source, it's also an open ecosystem.
We know we can't support everything, so at every level there's a generic interface
to let you get data in and/or out.
So for example if you want to run a shell script when an alert fires, you can make
that happen.
Prometheus Clients as a Clearinghouse
Live Demo!
Monitoring What Matters with Prometheus
To summarise, the key things Prometheus empowers you to build:
Alerting on symptoms. Alerts which require intelligent human action.
Debugging dashboards that let you drill down to where the problem is.
The ability to run complex queries to slice and dice your data.
Easy integration points for other systems.
These are good things to have no matter which monitoring system(s) you use.
10 Tips for Monitoring
With potentially millions of time series across your system, can be difficult to know
what is and isn't useful.
What approaches help manage this complexity?
How do you avoid getting caught out?
Here's some tips.
#1: Choose your key statistics
Users don't care that one of your machines is short of CPU.
Users care if the service is slow or throwing errors.
For your primary dashboards focus on high-level metrics that directly impact
users.
#2: Use aggregations
Think about services, not machines.
Once you have more than a handful of machines, you should treat them as an
amorphous blob.
Looking at the key statistics is easier for 10 services than 10 services each of
which is on 10 machines
Once you have isolated a problem to one service, then can see if one machine is
the problem
#3: Avoid the Wall of Graphs
Dashboards tend to grow without bound. Worst I've seen was 600 graphs.
It might look impressive, but humans can't deal with that much data at once.
(and they take forever to load)
Your services will have a rough tree structure, have a dashboard per service and
talk the tree from the top when you have a problem. Similarly for each service,
have dashboards per subsystem.
Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
#4: Client-side quantiles aren't aggregatable
Many instrumentation systems calculate quantiles/percentiles inside each
process, and export it to the TSDB.
It is not statistically possible to aggregate these.
If you want meaningful quantiles, you should track histogram buckets in each
process, aggregate those in your monitoring system and then calculate the
quantile.
This is done using histogram_quantile() and rate() in Prometheus.
#5: Averages are easy to reason about
Q: Say you have a service with two backends. If 95th percentile latency goes up
due to one of the backends, what will you see in 95th percentile latency for that
backend?
A: ?
#5: Averages are easy to reason about
Q: Say you have a service with two backends. If 95th percentile latency goes up
due to one of the backends, what will you see in 95th percentile latency for that
backend?
A: It depends, could be no change. If the latencies are strongly correlated for each
request across the backends, you'll see the same latency bump.
This is tricky to reason about, especially in an emergency.
Averages don't have this problem, as they include all requests.
#6: Costs and Benefits
1s resolution monitoring of all metrics would be handy for debugging.
But is it ten time more valuable than 10s monitoring? And sixty times more
valuable than 60s monitoring?
Monitoring isn't free. It costs resources to run, and resources in the services being
monitored too. Quantiles and histograms can get expensive fast.
60s resolution is generally a good balance. Reserve 1s granularity or a literal
handful of key metrics.
#7: Nyquist-Shannon Sampling Theorem
To reconstruct a signal you need a resolution that's at least double it's frequency.
If you've got a 10s resolution time series, you can't reconstruct patterns that are
less than 20s long.
Higher frequency patterns can cause effects like aliasing, and mislead you.
If you suspect that there's something more to the data, try a higher resolution
temporarily or start profiling.
#8: Correlation is not Causation - Confirmation Bias
Humans are great at spotting patterns. Not all of them are actually there.
Always try to look for evidence that'd falsify your hypothesis.
If two metrics seem to correlate on a graph that doesn't mean that they're related.
They could be independent tasks running on the same schedule.
Or if you zoom out there plenty of times when one spikes but not the other.
Or one could be causing a slight increase in resource contention, pushing the
other over the edge.
#9 Know when to use Logs and Metrics
You want a metrics time series system for your primary monitoring.
Logs have information about every event. This limits the number of fields (<100),
but you have unlimited cardinality.
Metrics aggregate across events, but you can have many metrics (>10000) with
limited cardinality.
Metrics help you determine where in the system the problem is. From there, logs
can help you pinpoint which requests are tickling the problem.
#10 Have a way to deal with non-critical alerts
Most alerts don't justify waking up someone at night, but someone needs to look
at them sometime.
Often they're sent to a mailing list, where everyone promptly filters them away.
Better to have some form of ticketing system that'll assign a single owner for each
alert.
A daily email with all firing alerts that the oncall has to process can also work.
Questions?
Project Website: prometheus.io
Demo: demo.robustperception.io
Company Website: www.robustperception.io

Contenu connexe

Tendances

Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
Tim Vaillancourt
 

Tendances (20)

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Prometheus workshop
Prometheus workshopPrometheus workshop
Prometheus workshop
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Prometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb SolutionPrometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb Solution
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & Grafana
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Prometheus
PrometheusPrometheus
Prometheus
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
An Introduction to Prometheus
An Introduction to PrometheusAn Introduction to Prometheus
An Introduction to Prometheus
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Thanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoring
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 

En vedette

En vedette (20)

What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
 
Foreman como provisionador
Foreman como provisionadorForeman como provisionador
Foreman como provisionador
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 

Similaire à An Introduction to Prometheus (GrafanaCon 2016)

LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
Mike Julian
 

Similaire à An Introduction to Prometheus (GrafanaCon 2016) (20)

Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least Privilege
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
MySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutMySQL Monitoring Shoot Out
MySQL Monitoring Shoot Out
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
 
Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?Is your Automation Infrastructure ‘Well Architected’?
Is your Automation Infrastructure ‘Well Architected’?
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
 
Operations: Production Readiness
Operations: Production ReadinessOperations: Production Readiness
Operations: Production Readiness
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
 
Cartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management FrameworkCartographer, or Building A Next Generation Management Framework
Cartographer, or Building A Next Generation Management Framework
 
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
 
Which watcher watches CloudWatch
Which watcher watches CloudWatch Which watcher watches CloudWatch
Which watcher watches CloudWatch
 

Plus de Brian Brazil

Plus de Brian Brazil (9)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 

Dernier

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 

Dernier (20)

20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 

An Introduction to Prometheus (GrafanaCon 2016)

  • 1. Introduction to Prometheus An Approach to Whitebox Monitoring
  • 2. Who am I? Engineer passionate about running software reliably in production. Studied Computer Science in Trinity College Dublin. Google SRE for 7 years, working on high-scale reliable systems. Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. Founder of Robust Perception, provider of commercial support and consulting for Prometheus.
  • 3. What is Whitebox Monitoring?
  • 4. Blackbox monitoring Monitoring from the outside No knowledge of how the application works internally Examples: ping, HTTP request, inserting data and waiting for it to appear on dashboard
  • 5. Where to use Blackbox Blackbox monitoring should be treated similarly to smoke tests. It’s good for finding when things have badly broken in an obvious way, and testing from outside your network. Not so good for knowing what’s going on inside a system. Nor should it be treated like regression testing and try to test every single feature. Tend to be flaky, as they either pass or fail.
  • 6. Whitebox Monitoring Complementary to blackbox monitoring. Works with information from inside your systems. Can be simple things like CPU usage, down to the number of requests triggering a particular obscure codepath.
  • 7. Prometheus Inspired by Google’s Borgmon monitoring system. Started in 2012 by ex-Googlers working in Soundcloud as an open source project. Mainly written in Go. Version 1.0 released in 2016. Incubating with the CNCF. 500+ companies using it including Digital Ocean, Ericsson, Weave and CoreOS.
  • 9. Why monitor? Know when things go wrong Be able to debug and gain insight Trending to see changes over time Plumbing data to other systems/processes
  • 10. Knowing when things go wrong The first thing people think of you say monitoring is alerting. What is the wrongness we want to detect and alert on? A blip with no real consequence, or a latency issue affecting users?
  • 11. Symptoms vs Causes Humans are limited in what they can handle. If you alert on every single thing that might be a problem, you'll get overwhelmed and suffer from alert fatigue. Key problem: You care about things like user facing latency. There are hundreds of things that could cause that. Alerting on every possible cause is a Sisyphean task, but alerting on the symptom of high latency is just one alert.
  • 12. Example: CPU usage Some monitoring systems don't allow you to alert on the latency of your servers. The closest you can get is CPU usage. False positives due to e.g. logrotate running too long. False negatives due to deadlocks. End result: Spammy alerts which operators learn to ignore, missing real problems.
  • 13. Many Approaches have Limited Visibility
  • 16. Monitor as a Service, not as Machines
  • 17. Freedom for Alerting A system like Prometheus gives you the freedom to alert on whatever you like. Alerting on error ratio across all the machines in a datacenter? No problem. Alerting on 95th percentile latency for the service being <200ms? No problem. Alerting on data taking too long to get through your pipeline? No problem. Alerting on your VIP not giving the right HTTP response codes? No problem. Produce alerts that require intelligent human action!
  • 19. Debugging to Gain Insight After you receive an alert notification you need to investigate it. How do you work from a high level symptom alert such as increased latency? You drill down through your stack with dashboards to find the subsystem that's the cause!
  • 21. Metrics from All Levels of the Stack Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine, Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul, HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft and Factorio. Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition. It’s so easy, most of the above were written without the core team even knowing about them!
  • 22. Metrics are just one Tool Metrics are good for alerting on issues and letting you drill down the focus of your debugging. Not a panacea though, as with all approaches fundamental limitations on data volumes. For successful debugging of complex problems you need a mix of logs, profiling and source code analysis.
  • 24. Trending and Reporting Alerting and debugging is short term. Trending is medium to long term. How is cache hit rate changing over time? Is anyone still using that obscure feature? With Prometheus you can do analysis beyond this.
  • 25. Powerful Query Language Can multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time. Answer questions like: What’s the 95th percentile latency in each datacenter over the past month? How full will the disks be in 4 days? Which services are the top 5 users of CPU?
  • 26. Example: Top 5 Docker images by CPU topk(5, sum by (image)( rate(container_cpu_usage_seconds_total{ id=~"/system.slice/docker.*"}[5m] ) ) )
  • 27. Structured Data: Labels Prometheus doesn’t use dotted.strings like metric.grafnacon.nyc. Multi-dimensional labels instead like metric{event=”grafanacon”,aircraft_carrier_location=”nyc”} Can aggregate, cut, and slice along them. Can come from instrumentation, or be added based on the service you are monitoring.
  • 28. Example: Labels from Node Exporter
  • 29. Python Instrumentation: An example pip install prometheus_client from prometheus_client import Summary, start_http_server REQUEST_DURATION = Summary('request_duration_seconds', 'Request duration in seconds') @REQUEST_DURATION.time() def my_handler(request): pass // Your code here start_http_server(8000)
  • 30. Adding Dimensions (No Evil Twins Please) from prometheus_client import Counter REQUESTS = Counter('requests_total', 'Total requests', ['method']) def my_handler(request): REQUESTS.labels(request.method).inc() pass // Your code here
  • 31. Labels go beyond Prometheus If you're using Kubernetes, Prometheus can take in your labels and annotations too. Similar data models and mutual integrations make your life easier!
  • 32. Plumbing Prometheus isn't just open source, it's also an open ecosystem. We know we can't support everything, so at every level there's a generic interface to let you get data in and/or out. So for example if you want to run a shell script when an alert fires, you can make that happen.
  • 33. Prometheus Clients as a Clearinghouse
  • 35. Monitoring What Matters with Prometheus To summarise, the key things Prometheus empowers you to build: Alerting on symptoms. Alerts which require intelligent human action. Debugging dashboards that let you drill down to where the problem is. The ability to run complex queries to slice and dice your data. Easy integration points for other systems. These are good things to have no matter which monitoring system(s) you use.
  • 36. 10 Tips for Monitoring With potentially millions of time series across your system, can be difficult to know what is and isn't useful. What approaches help manage this complexity? How do you avoid getting caught out? Here's some tips.
  • 37. #1: Choose your key statistics Users don't care that one of your machines is short of CPU. Users care if the service is slow or throwing errors. For your primary dashboards focus on high-level metrics that directly impact users.
  • 38. #2: Use aggregations Think about services, not machines. Once you have more than a handful of machines, you should treat them as an amorphous blob. Looking at the key statistics is easier for 10 services than 10 services each of which is on 10 machines Once you have isolated a problem to one service, then can see if one machine is the problem
  • 39. #3: Avoid the Wall of Graphs Dashboards tend to grow without bound. Worst I've seen was 600 graphs. It might look impressive, but humans can't deal with that much data at once. (and they take forever to load) Your services will have a rough tree structure, have a dashboard per service and talk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
  • 40. #4: Client-side quantiles aren't aggregatable Many instrumentation systems calculate quantiles/percentiles inside each process, and export it to the TSDB. It is not statistically possible to aggregate these. If you want meaningful quantiles, you should track histogram buckets in each process, aggregate those in your monitoring system and then calculate the quantile. This is done using histogram_quantile() and rate() in Prometheus.
  • 41. #5: Averages are easy to reason about Q: Say you have a service with two backends. If 95th percentile latency goes up due to one of the backends, what will you see in 95th percentile latency for that backend? A: ?
  • 42. #5: Averages are easy to reason about Q: Say you have a service with two backends. If 95th percentile latency goes up due to one of the backends, what will you see in 95th percentile latency for that backend? A: It depends, could be no change. If the latencies are strongly correlated for each request across the backends, you'll see the same latency bump. This is tricky to reason about, especially in an emergency. Averages don't have this problem, as they include all requests.
  • 43. #6: Costs and Benefits 1s resolution monitoring of all metrics would be handy for debugging. But is it ten time more valuable than 10s monitoring? And sixty times more valuable than 60s monitoring? Monitoring isn't free. It costs resources to run, and resources in the services being monitored too. Quantiles and histograms can get expensive fast. 60s resolution is generally a good balance. Reserve 1s granularity or a literal handful of key metrics.
  • 44. #7: Nyquist-Shannon Sampling Theorem To reconstruct a signal you need a resolution that's at least double it's frequency. If you've got a 10s resolution time series, you can't reconstruct patterns that are less than 20s long. Higher frequency patterns can cause effects like aliasing, and mislead you. If you suspect that there's something more to the data, try a higher resolution temporarily or start profiling.
  • 45. #8: Correlation is not Causation - Confirmation Bias Humans are great at spotting patterns. Not all of them are actually there. Always try to look for evidence that'd falsify your hypothesis. If two metrics seem to correlate on a graph that doesn't mean that they're related. They could be independent tasks running on the same schedule. Or if you zoom out there plenty of times when one spikes but not the other. Or one could be causing a slight increase in resource contention, pushing the other over the edge.
  • 46. #9 Know when to use Logs and Metrics You want a metrics time series system for your primary monitoring. Logs have information about every event. This limits the number of fields (<100), but you have unlimited cardinality. Metrics aggregate across events, but you can have many metrics (>10000) with limited cardinality. Metrics help you determine where in the system the problem is. From there, logs can help you pinpoint which requests are tickling the problem.
  • 47. #10 Have a way to deal with non-critical alerts Most alerts don't justify waking up someone at night, but someone needs to look at them sometime. Often they're sent to a mailing list, where everyone promptly filters them away. Better to have some form of ticketing system that'll assign a single owner for each alert. A daily email with all firing alerts that the oncall has to process can also work.
  • 48. Questions? Project Website: prometheus.io Demo: demo.robustperception.io Company Website: www.robustperception.io