SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Brian Brazil
Founder
Monitoring
Kubernetes
with Prometheus
Who am I?
Engineer passionate about running software reliably in production.
● TCD CS Degree
● Google SRE for 7 years, working on high-scale reliable systems such as
Adwords, Adsense, Ad Exchange, Billing, Database
● Boxever TL Systems&Infrastructure, applied processes and technology to let
allow company to scale and reduce operational load
● Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
● Founder of Robust Perception, making scalability and efficiency available to
everyone
Why monitor?
● Know when things go wrong
○ To call in a human to prevent a business-level issue, or prevent an issue in advance
● Be able to debug and gain insight
● Trending to see changes over time, and drive technical/business decisions
● To feed into other systems/processes (e.g. QA, security, automation)
Common Monitoring Challenges
Themes common among companies I’ve talk to:
● Monitoring tools are limited, both technically and conceptually
● Tools don’t scale well and are unwieldy to manage
● Operational practices don’t align with the business
For example:
Your customers care about increased latency and it’s in your SLAs. You can only
alert on individual machine CPU usage.
Result: Engineers continuously woken up for non-issues, get fatigued
Fundamental Challenge is Limited Visibility
Prometheus
Inspired by Google’s Borgmon monitoring system.
(Kubernetes is the next version of Google’s Borg.)
Started in 2012 by ex-Googlers working in Soundcloud as an open source project,
mainly written in Go. Publically launched in early 2015, and continues to be
independent of any one company.
Over 100 companies have started relying on it since then.
What does Prometheus offer?
● Inclusive Monitoring
● Powerful data model
● Powerful query language
● Manageable and Reliable
● Efficient
● Scalable
● Easy to integrate with
● Dashboards
Services have Internals
Monitor the Internals
Monitor as a Service, not as Machines
Inclusive Monitoring
Don’t monitor just at the edges:
● Instrument client libraries
● Instrument server libraries (e.g. HTTP/RPC)
● Instrument business logic
Library authors get information about usage.
Application developers get monitoring of common components for free.
Dashboards and alerting can be provided out of the box, customised for your
organisation!
How to instrument your code?
Several common approaches:
● Custom endpoint/interface to dump stats (usually JSON or CSV)
● Use one mildly standard instrumentation system (e.g. JMX in Java)
● For libraries, have some type of hooks
● Don’t
This isn’t great
Users run more than just your software project, with a variety of monitoring tools.
As a user you’re left with a choice: Have N monitoring systems, or run extra
services to act as shims to translate.
As a monitoring project, we have to write code and/or configuration for every
individual library/application.
This is a sub-optimal for everyone.
Open ecosystem
Prometheus client libraries don’t tie you into Prometheus.
For example the Python and Java clients can output to Graphite, with no need to
run any Prometheus components.
This means that you as a library author can instrument with Prometheus, and your
users with just a few lines of code can output to whatever monitoring system they
want. No need for users to worry about a new standard.
This can be done incrementally.
It goes the other way too
It’s unlikely that everyone is going to switch to Prometheus all at once, so we’ve
integrations that can take in data from other monitoring systems and make it
useful.
Graphite, Collectd, Statsd, SNMP, JMX, Dropwizard, AWS Cloudwatch, New
Relic, Rsyslog and Scollector (Bosun) are some examples.
Prometheus and its client libraries can act a clearinghouse to convert between
monitoring systems.
For example Zalando’s Zmon uses the Python client to parse Prometheus metrics
from directly instrumented binaries.
Instrumentation made easy
Prometheus clients don’t just marshall data.
They take care of the nitty gritty details like concurrency and state tracking.
We take advantage of the strengths of each language.
In Python for example that means context managers and decorators.
Python Instrumentation Example
pip install prometheus_client
from prometheus_client import Summary, start_http_server
REQUEST_DURATION = Summary('request_duration_seconds',
'Request duration in seconds')
@REQUEST_DURATION.time()
def my_handler(request):
pass # Your code here
start_http_server(8000)
Exceptional Circumstances In Progress
from prometheus_client import Counter, Gauge
EXCEPTIONS = Counter('exceptions_total', 'Total exceptions')
IN_PROGRESS = Gauge('inprogress_requests', 'In progress')
@EXCEPTIONS.count_exceptions()
@IN_PROGRESS.track_inprogress()
def my_handler(request):
pass // Your code here
Data and Query Language: Labels
Prometheus doesn’t use dotted.strings like metric.kubernetes.dublin.
Multi-dimensional labels instead like metric{tech=”kubernetes”,city=”dublin”}
Similar to Kubernetes' labels.
Can aggregate, cut, and slice along them.
Can come from instrumentation, or be added based on the service you are
monitoring.
Example: Labels from Node Exporter
Adding Dimensions (No Evil Twins Please)
from prometheus_client import Counter
REQUESTS = Counter('requests_total',
'Total requests', ['method'])
def my_handler(request):
REQUESTS.labels(request.method).inc()
pass // Your code here
Powerful Query Language
Can multiply, add, aggregate, join, predict, take quantiles across many metrics in
the same query. Can evaluate right now, and graph back in time.
Answer questions like:
● What’s the 95th percentile latency in the European datacenter?
● How full will the disks be in 4 hours?
● Which services are the top 5 users of CPU?
Can alert based on any query.
Example: Top 5 Docker images by CPU
topk(5,
sum by (image)(
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)
Exactly how powerful is the query language?
In August 2015 it was demonstrated to be Turing Complete.
I did this by implementing Conway’s Life in Prometheus.
Don’t try this in production :)
Manageable and Reliable
Core Prometheus server is a single binary.
Doesn’t depend on Zookeeper, Consul, Cassandra, Hadoop or the Internet.
Only requires local disk (SSD recommended). No potential for cascading failure.
Pull based, so easy to on run a workstation for testing and rogue servers can’t
push bad metrics.
Advanced service discovery finds what to monitor.
Running Prometheus under Kubernetes
Docker images at https://hub.docker.com/r/prom/
To run Prometheus:
kubectl run prometheus --image=prom/prometheus:latest 
--port=9090
This monitors itself. With a service around it, you can easily access it.
Easy to integrate with
Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine,
Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul,
HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ,
Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft...
Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition.
It’s so easy, most of the above were written without the core team even knowing
about them!
Kubernetes natively supports Prometheus
If you go to the Kubernetes apiserver, the /metrics endpoint has Prometheus
metrics
Example: http://localhost:8080/metrics
Cadvisor natively supports Prometheus
http://localhost:4194/metrics will have metrics for all your containers
If you’re not already running it, you can do:
docker run -v /var/run/:/var/run -v /sys:/sys -p 4194:8080
google/cadvisor
Or use Kubernetes DaemonSets to run it everywhere.
Service Discovery
It’s vital to know where each of your instances is meant to be running.
Systems such as Kubernetes spread applications across machines, so we need
some form of service discovery to find them.
Prometheus supports Kubernetes as a discovery mechanism
Prometheus configuration to monitor kubelets
scrape_configs:
- job_name: 'kubelet'
kubernetes_sd_configs:
- api_servers: 'http://127.0.0.1:8500'
relabel_configs:
- source_labels: [__meta_kubernetes_role]
action: keep
regex: node
Prometheus configuration to monitor nodes
scrape_configs:
- job_name: 'node'
kubernetes_sd_configs:
- api_servers: 'http://127.0.0.1:8500'
relabel_configs:
- source_labels: [__meta_kubernetes_role]
action: keep
regex: node
- source_labels: [__address__]
regex: (.*):.*
target_label: __address__
replacement: ${1}:9100
Discovering Services
Can also work via service endpoints.
Can use annotations and labels of the service to select and set labels.
How this works in terms of containers/pods/services/endpoints is currently under
active discussion, and how it works will likely change in the next week or two.
Efficient
Instrumenting everything means a lot of data.
Prometheus is best in class for lossless storage efficiency, 4.5 bytes per datapoint.
A single server can handle:
● millions of metrics
● hundreds of thousands of datapoints per second
Scalable
Prometheus is easy to run, can give one to each team in each datacenter.
Federation allows pulling key metrics from other Prometheus servers.
When one job is too big for a single Prometheus server, can use
sharding+federation to scale out. Needed with thousands of machines.
Dashboards
What does Prometheus offer?
● Inclusive Monitoring
● Powerful data model
● Powerful query language
● Manageable and Reliable
● Efficient
● Scalable
● Easy to integrate with
● Dashboards
What do we do?
Robust Perception provides consulting and training to give you confidence in your
production service's ability to run efficiently and scale with your business.
We can help you:
● Decide if Prometheus is for you
● Manage your transition to Prometheus and resolve issues that arise
● With capacity planning, debugging, infrastructure etc.
We are proud to be among the core contributors to the Prometheus project.
Resources
Official Project Website: prometheus.io
Official Mailing List: prometheus-developers@googlegroups.com
Demo: demo.robustperception.io
Robust Perception Website: www.robustperception.io
Queries: prometheus@robustperception.io

Contenu connexe

Tendances

Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
Tim Vaillancourt
 

Tendances (20)

Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Terraform modules and some of best-practices - March 2019
Terraform modules and some of best-practices - March 2019Terraform modules and some of best-practices - March 2019
Terraform modules and some of best-practices - March 2019
 
Introduction to helm
Introduction to helmIntroduction to helm
Introduction to helm
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
 
Azure Container Apps
Azure Container AppsAzure Container Apps
Azure Container Apps
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Prometheus workshop
Prometheus workshopPrometheus workshop
Prometheus workshop
 
Building Repeatable Infrastructure using Terraform
Building Repeatable Infrastructure using TerraformBuilding Repeatable Infrastructure using Terraform
Building Repeatable Infrastructure using Terraform
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform Training
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Terraform
TerraformTerraform
Terraform
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 

Similaire à Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)

Similaire à Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016) (20)

Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
Monitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with PrometheusMonitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with Prometheus
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
 
Cloud APIs Overview Tucker
Cloud APIs Overview   TuckerCloud APIs Overview   Tucker
Cloud APIs Overview Tucker
 
Monitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operatorMonitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operator
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
System monitoring
System monitoringSystem monitoring
System monitoring
 
Slide DevSecOps Microservices
Slide DevSecOps Microservices Slide DevSecOps Microservices
Slide DevSecOps Microservices
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Consul and Consul Pusher
Consul and Consul PusherConsul and Consul Pusher
Consul and Consul Pusher
 
Cloud Native Development
Cloud Native DevelopmentCloud Native Development
Cloud Native Development
 

Plus de Brian Brazil

Plus de Brian Brazil (20)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
 

Dernier

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)

  • 2. Who am I? Engineer passionate about running software reliably in production. ● TCD CS Degree ● Google SRE for 7 years, working on high-scale reliable systems such as Adwords, Adsense, Ad Exchange, Billing, Database ● Boxever TL Systems&Infrastructure, applied processes and technology to let allow company to scale and reduce operational load ● Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. ● Founder of Robust Perception, making scalability and efficiency available to everyone
  • 3. Why monitor? ● Know when things go wrong ○ To call in a human to prevent a business-level issue, or prevent an issue in advance ● Be able to debug and gain insight ● Trending to see changes over time, and drive technical/business decisions ● To feed into other systems/processes (e.g. QA, security, automation)
  • 4. Common Monitoring Challenges Themes common among companies I’ve talk to: ● Monitoring tools are limited, both technically and conceptually ● Tools don’t scale well and are unwieldy to manage ● Operational practices don’t align with the business For example: Your customers care about increased latency and it’s in your SLAs. You can only alert on individual machine CPU usage. Result: Engineers continuously woken up for non-issues, get fatigued
  • 5. Fundamental Challenge is Limited Visibility
  • 6. Prometheus Inspired by Google’s Borgmon monitoring system. (Kubernetes is the next version of Google’s Borg.) Started in 2012 by ex-Googlers working in Soundcloud as an open source project, mainly written in Go. Publically launched in early 2015, and continues to be independent of any one company. Over 100 companies have started relying on it since then.
  • 7. What does Prometheus offer? ● Inclusive Monitoring ● Powerful data model ● Powerful query language ● Manageable and Reliable ● Efficient ● Scalable ● Easy to integrate with ● Dashboards
  • 10. Monitor as a Service, not as Machines
  • 11. Inclusive Monitoring Don’t monitor just at the edges: ● Instrument client libraries ● Instrument server libraries (e.g. HTTP/RPC) ● Instrument business logic Library authors get information about usage. Application developers get monitoring of common components for free. Dashboards and alerting can be provided out of the box, customised for your organisation!
  • 12. How to instrument your code? Several common approaches: ● Custom endpoint/interface to dump stats (usually JSON or CSV) ● Use one mildly standard instrumentation system (e.g. JMX in Java) ● For libraries, have some type of hooks ● Don’t
  • 13. This isn’t great Users run more than just your software project, with a variety of monitoring tools. As a user you’re left with a choice: Have N monitoring systems, or run extra services to act as shims to translate. As a monitoring project, we have to write code and/or configuration for every individual library/application. This is a sub-optimal for everyone.
  • 14. Open ecosystem Prometheus client libraries don’t tie you into Prometheus. For example the Python and Java clients can output to Graphite, with no need to run any Prometheus components. This means that you as a library author can instrument with Prometheus, and your users with just a few lines of code can output to whatever monitoring system they want. No need for users to worry about a new standard. This can be done incrementally.
  • 15. It goes the other way too It’s unlikely that everyone is going to switch to Prometheus all at once, so we’ve integrations that can take in data from other monitoring systems and make it useful. Graphite, Collectd, Statsd, SNMP, JMX, Dropwizard, AWS Cloudwatch, New Relic, Rsyslog and Scollector (Bosun) are some examples. Prometheus and its client libraries can act a clearinghouse to convert between monitoring systems. For example Zalando’s Zmon uses the Python client to parse Prometheus metrics from directly instrumented binaries.
  • 16. Instrumentation made easy Prometheus clients don’t just marshall data. They take care of the nitty gritty details like concurrency and state tracking. We take advantage of the strengths of each language. In Python for example that means context managers and decorators.
  • 17. Python Instrumentation Example pip install prometheus_client from prometheus_client import Summary, start_http_server REQUEST_DURATION = Summary('request_duration_seconds', 'Request duration in seconds') @REQUEST_DURATION.time() def my_handler(request): pass # Your code here start_http_server(8000)
  • 18. Exceptional Circumstances In Progress from prometheus_client import Counter, Gauge EXCEPTIONS = Counter('exceptions_total', 'Total exceptions') IN_PROGRESS = Gauge('inprogress_requests', 'In progress') @EXCEPTIONS.count_exceptions() @IN_PROGRESS.track_inprogress() def my_handler(request): pass // Your code here
  • 19. Data and Query Language: Labels Prometheus doesn’t use dotted.strings like metric.kubernetes.dublin. Multi-dimensional labels instead like metric{tech=”kubernetes”,city=”dublin”} Similar to Kubernetes' labels. Can aggregate, cut, and slice along them. Can come from instrumentation, or be added based on the service you are monitoring.
  • 20. Example: Labels from Node Exporter
  • 21. Adding Dimensions (No Evil Twins Please) from prometheus_client import Counter REQUESTS = Counter('requests_total', 'Total requests', ['method']) def my_handler(request): REQUESTS.labels(request.method).inc() pass // Your code here
  • 22. Powerful Query Language Can multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time. Answer questions like: ● What’s the 95th percentile latency in the European datacenter? ● How full will the disks be in 4 hours? ● Which services are the top 5 users of CPU? Can alert based on any query.
  • 23. Example: Top 5 Docker images by CPU topk(5, sum by (image)( rate(container_cpu_usage_seconds_total{ id=~"/system.slice/docker.*"}[5m] ) ) )
  • 24. Exactly how powerful is the query language? In August 2015 it was demonstrated to be Turing Complete. I did this by implementing Conway’s Life in Prometheus. Don’t try this in production :)
  • 25. Manageable and Reliable Core Prometheus server is a single binary. Doesn’t depend on Zookeeper, Consul, Cassandra, Hadoop or the Internet. Only requires local disk (SSD recommended). No potential for cascading failure. Pull based, so easy to on run a workstation for testing and rogue servers can’t push bad metrics. Advanced service discovery finds what to monitor.
  • 26. Running Prometheus under Kubernetes Docker images at https://hub.docker.com/r/prom/ To run Prometheus: kubectl run prometheus --image=prom/prometheus:latest --port=9090 This monitors itself. With a service around it, you can easily access it.
  • 27. Easy to integrate with Many existing integrations: Java, JMX, Python, Go, Ruby, .Net, Machine, Cloudwatch, EC2, MySQL, PostgreSQL, Haskell, Bash, Node.js, SNMP, Consul, HAProxy, Mesos, Bind, CouchDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, Meteor.js, Minecraft... Graphite, Statsd, Collectd, Scollector, Munin, Nagios integrations aid transition. It’s so easy, most of the above were written without the core team even knowing about them!
  • 28. Kubernetes natively supports Prometheus If you go to the Kubernetes apiserver, the /metrics endpoint has Prometheus metrics Example: http://localhost:8080/metrics
  • 29. Cadvisor natively supports Prometheus http://localhost:4194/metrics will have metrics for all your containers If you’re not already running it, you can do: docker run -v /var/run/:/var/run -v /sys:/sys -p 4194:8080 google/cadvisor Or use Kubernetes DaemonSets to run it everywhere.
  • 30. Service Discovery It’s vital to know where each of your instances is meant to be running. Systems such as Kubernetes spread applications across machines, so we need some form of service discovery to find them. Prometheus supports Kubernetes as a discovery mechanism
  • 31. Prometheus configuration to monitor kubelets scrape_configs: - job_name: 'kubelet' kubernetes_sd_configs: - api_servers: 'http://127.0.0.1:8500' relabel_configs: - source_labels: [__meta_kubernetes_role] action: keep regex: node
  • 32. Prometheus configuration to monitor nodes scrape_configs: - job_name: 'node' kubernetes_sd_configs: - api_servers: 'http://127.0.0.1:8500' relabel_configs: - source_labels: [__meta_kubernetes_role] action: keep regex: node - source_labels: [__address__] regex: (.*):.* target_label: __address__ replacement: ${1}:9100
  • 33. Discovering Services Can also work via service endpoints. Can use annotations and labels of the service to select and set labels. How this works in terms of containers/pods/services/endpoints is currently under active discussion, and how it works will likely change in the next week or two.
  • 34. Efficient Instrumenting everything means a lot of data. Prometheus is best in class for lossless storage efficiency, 4.5 bytes per datapoint. A single server can handle: ● millions of metrics ● hundreds of thousands of datapoints per second
  • 35. Scalable Prometheus is easy to run, can give one to each team in each datacenter. Federation allows pulling key metrics from other Prometheus servers. When one job is too big for a single Prometheus server, can use sharding+federation to scale out. Needed with thousands of machines.
  • 37. What does Prometheus offer? ● Inclusive Monitoring ● Powerful data model ● Powerful query language ● Manageable and Reliable ● Efficient ● Scalable ● Easy to integrate with ● Dashboards
  • 38. What do we do? Robust Perception provides consulting and training to give you confidence in your production service's ability to run efficiently and scale with your business. We can help you: ● Decide if Prometheus is for you ● Manage your transition to Prometheus and resolve issues that arise ● With capacity planning, debugging, infrastructure etc. We are proud to be among the core contributors to the Prometheus project.
  • 39. Resources Official Project Website: prometheus.io Official Mailing List: prometheus-developers@googlegroups.com Demo: demo.robustperception.io Robust Perception Website: www.robustperception.io Queries: prometheus@robustperception.io