SlideShare une entreprise Scribd logo
1  sur  33
Lessons from Cloud
Scaling Prometheus
metrics in
Kubernetes with
Telegraf
The curious case of the missing metrics
One Label too far...
© 2019 InfluxData. All rights reserved. 3
The Suspects
● Prometheus
● Kubernetes
● Gateway
● Queryd
© 2019 InfluxData. All rights reserved. 4
Prometheus
http://gateway.twodotoh.svc.cluster.local:9999/metrics
© 2019 InfluxData. All rights reserved. 5
Prometheus
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
© 2019 InfluxData. All rights reserved. 6
Kubernetes
© 2019 InfluxData. All rights reserved. 7
InfluxCloud
Gateway Gateway
Queryd
Gateway
Queryd Queryd
Ingress
© 2019 InfluxData. All rights reserved. 8
Problem: Prometheus Debugging is Hard
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.01"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.05"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.5"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.9"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.99"} 0.012562015
prometheus_target_sync_length_seconds_sum{scrape_job="prod_twodotoh"} 0.012562015
prometheus_target_sync_length_seconds_count{scrape_job="prod_twodotoh"} 1
© 2019 InfluxData. All rights reserved. 9
Problem: Prometheus Scaling is Hard
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh_ns_a
kubernetes_sd_configs:
- role: service
namespaces:
names:
- a
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh_ns_a
kubernetes_sd_configs:
- role: service
namespaces:
names:
- b
© 2019 InfluxData. All rights reserved. 10
Solution: Isolatation with Telegraf Sidecar
© 2019 InfluxData. All rights reserved. 11
Solution: Isolation with Telegraf Sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: "gateway"
labels:
spec:
serviceName: "gateway"
replicas: 100
template:
metadata:
name: "gateway"
labels:
app: "gateway"
spec:
containers:
- name: "telegraf"
image: "docker.io/library/telegraf:1.12"
- name: "gateway"
image: "quay.io/influxdb/gateway:latest"
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.c
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
© 2019 InfluxData. All rights reserved. 12
Solution: Isolatation with Telegraf Sidecar
© 2019 InfluxData. All rights reserved. 13
Problem: Prom has 1 and only 1 value
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
metric_relabel_configs:
- regex: user_agent
action: labeldrop
© 2019 InfluxData. All rights reserved. 14
Solution: Influx for more context
http://gateway.twodotoh.svc.cluster.local:9999/metrics
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[processors.converter]]
[processors.converter.tags]
string = ["user_agent"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
© 2019 InfluxData. All rights reserved. 15
Problem: Is there a way to prevent?
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
metric_relabel_configs:
- regex: user_agent
action: labeldrop
© 2019 InfluxData. All rights reserved. 16
Solution: Telegraf Guard Rails
http://gateway.twodotoh.svc.cluster.local:9999/metrics
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[processors.tag_limit]]
limit = 4
## List of tags to preferentially preserve
keep = ["handler", "method", "status"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
© 2019 InfluxData. All rights reserved. 17
Problem: Hard to Rotate Prom Passwords
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
bearer_token_file: /etc/hunter2
© 2019 InfluxData. All rights reserved. 18
Solution: Per Pod Credentials
http://gateway.twodotoh.svc.cluster.local:9999/metrics
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
bearer_token = "/etc/telegraf/hunter2"
© 2019 InfluxData. All rights reserved. 19
Lessons
Scaling is NOT More Manual Processes
Scaling is NOT saying “You’re Doing it Wrong”
Scaling IS Empowering Developers
Scaling IS Predictability of Failure Modes
The time when we were
Watching the watchers...
© 2019 InfluxData. All rights reserved. 21
Problem: Am I scraping all the pods?
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
© 2019 InfluxData. All rights reserved. 22
Solution: Telegraf K8s Inventory
[[inputs.internal]]
[[inputs.kube_inventory]]
url = "http://1.1.1.1:10255"
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
Prometheus Scraping Designs
© 2019 InfluxData. All rights reserved. 24
Scaling even more
© 2019 InfluxData. All rights reserved. 25
Scaling even more with Influx Enterprise
Load
Balancer
© 2019 InfluxData. All rights reserved. 26
Scaling even more with Kafka and Influx
Enterprise
Kafka
© 2019 InfluxData. All rights reserved. 27
Core Idea
● Measure and test metrics scaling
○ Are you missing metrics?
● Decentralize metrics gathering
○ Consider metrics as part of the program
● Empower Developers
○ They know their metrics the best. Allow them local tooling control
© 2019 InfluxData. All rights reserved. 28
First Order Conclusion
● Too easy to shoot yourself in the foot with prometheus metrics.
● Too much in prometheus needs operation heroes.
● Too difficult to express vital information in prometheus about your
program without a ton of centralized control.
● One mistake can impact everyone.
© 2019 InfluxData. All rights reserved. 29
Second Order Conclusion
● Prometheus is not descriptive enough.
● Extremely difficult to change over time.
● The metrics game is not a solved problem.
○ Opentelemetry?
○ SNMP?
● Probably not one answer to everything.
© 2019 InfluxData. All rights reserved. 30
Future
● Flux into Telegraf
○ Processor for transformation
○ Moving the program near the data
○ Flux Output
○ Monitoring and alerting at edge
● Telegraf Flux scripts hosted in InfluxDB API
○ Runtime plugins without re-compiling
○ Sampling rules from server-side
■ Aggregation on server with input to client
● What else?
© 2019 InfluxData. All rights reserved. 31
Thank You!
The time when collecting metrics impacted storage...
Measure, measure, measure
© 2019 InfluxData. All rights reserved. 33
Problem: Prometheus metrics are heavy
weight

Contenu connexe

Tendances

High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 

Tendances (20)

macvlan and ipvlan
macvlan and ipvlanmacvlan and ipvlan
macvlan and ipvlan
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Comparison of SRv6 Extensions uSID, SRv6+, C-SRH
Comparison of SRv6 Extensions uSID, SRv6+, C-SRHComparison of SRv6 Extensions uSID, SRv6+, C-SRH
Comparison of SRv6 Extensions uSID, SRv6+, C-SRH
 
MongoDB on EC2 and EBS
MongoDB on EC2 and EBSMongoDB on EC2 and EBS
MongoDB on EC2 and EBS
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
Zookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance
Zookeeper vs Raft: Stateful distributed coordination with HA and Fault ToleranceZookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance
Zookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservices
 
Overview of kubernetes network functions
Overview of kubernetes network functionsOverview of kubernetes network functions
Overview of kubernetes network functions
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizer
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
 
IPTABLES Introduction
IPTABLES IntroductionIPTABLES Introduction
IPTABLES Introduction
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
KubeCon 2022 EU Flux Security.pdf
KubeCon 2022 EU Flux Security.pdfKubeCon 2022 EU Flux Security.pdf
KubeCon 2022 EU Flux Security.pdf
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTOClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 

Similaire à Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

Similaire à Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData (20)

The rise of microservices
The rise of microservicesThe rise of microservices
The rise of microservices
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stack
 
InfluxDB Live Product Training
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product Training
 
PRO TALK - Kubernetes Security Workshop.pdf
PRO TALK - Kubernetes Security Workshop.pdfPRO TALK - Kubernetes Security Workshop.pdf
PRO TALK - Kubernetes Security Workshop.pdf
 
Kubernetes Security Workshop
Kubernetes Security WorkshopKubernetes Security Workshop
Kubernetes Security Workshop
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
 
P4 Introduction
P4 Introduction P4 Introduction
P4 Introduction
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
RTBkit Meetup - Developer Spotlight, Behind the Scenes of RTBkit and Intro to...
RTBkit Meetup - Developer Spotlight, Behind the Scenes of RTBkit and Intro to...RTBkit Meetup - Developer Spotlight, Behind the Scenes of RTBkit and Intro to...
RTBkit Meetup - Developer Spotlight, Behind the Scenes of RTBkit and Intro to...
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 
Running your Spring Apps in the Cloud Javaone 2014
Running your Spring Apps in the Cloud Javaone 2014Running your Spring Apps in the Cloud Javaone 2014
Running your Spring Apps in the Cloud Javaone 2014
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
 
OSDC 2019 | Introducing Kudo – Kubernetes Operators the easy way by Matt Jarvis
OSDC 2019 | Introducing Kudo – Kubernetes Operators the easy way by Matt JarvisOSDC 2019 | Introducing Kudo – Kubernetes Operators the easy way by Matt Jarvis
OSDC 2019 | Introducing Kudo – Kubernetes Operators the easy way by Matt Jarvis
 
Dockerize a Django app elegantly
Dockerize a Django app elegantlyDockerize a Django app elegantly
Dockerize a Django app elegantly
 

Plus de InfluxData

How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 

Plus de InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

  • 1. Lessons from Cloud Scaling Prometheus metrics in Kubernetes with Telegraf
  • 2. The curious case of the missing metrics One Label too far...
  • 3. © 2019 InfluxData. All rights reserved. 3 The Suspects ● Prometheus ● Kubernetes ● Gateway ● Queryd
  • 4. © 2019 InfluxData. All rights reserved. 4 Prometheus http://gateway.twodotoh.svc.cluster.local:9999/metrics
  • 5. © 2019 InfluxData. All rights reserved. 5 Prometheus http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service
  • 6. © 2019 InfluxData. All rights reserved. 6 Kubernetes
  • 7. © 2019 InfluxData. All rights reserved. 7 InfluxCloud Gateway Gateway Queryd Gateway Queryd Queryd Ingress
  • 8. © 2019 InfluxData. All rights reserved. 8 Problem: Prometheus Debugging is Hard prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.01"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.05"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.5"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.9"} 0.012562015 prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.99"} 0.012562015 prometheus_target_sync_length_seconds_sum{scrape_job="prod_twodotoh"} 0.012562015 prometheus_target_sync_length_seconds_count{scrape_job="prod_twodotoh"} 1
  • 9. © 2019 InfluxData. All rights reserved. 9 Problem: Prometheus Scaling is Hard global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh_ns_a kubernetes_sd_configs: - role: service namespaces: names: - a global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh_ns_a kubernetes_sd_configs: - role: service namespaces: names: - b
  • 10. © 2019 InfluxData. All rights reserved. 10 Solution: Isolatation with Telegraf Sidecar
  • 11. © 2019 InfluxData. All rights reserved. 11 Solution: Isolation with Telegraf Sidecar apiVersion: apps/v1 kind: Deployment metadata: name: "gateway" labels: spec: serviceName: "gateway" replicas: 100 template: metadata: name: "gateway" labels: app: "gateway" spec: containers: - name: "telegraf" image: "docker.io/library/telegraf:1.12" - name: "gateway" image: "quay.io/influxdb/gateway:latest" [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.c token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  • 12. © 2019 InfluxData. All rights reserved. 12 Solution: Isolatation with Telegraf Sidecar
  • 13. © 2019 InfluxData. All rights reserved. 13 Problem: Prom has 1 and only 1 value http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service metric_relabel_configs: - regex: user_agent action: labeldrop
  • 14. © 2019 InfluxData. All rights reserved. 14 Solution: Influx for more context http://gateway.twodotoh.svc.cluster.local:9999/metrics [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[processors.converter]] [processors.converter.tags] string = ["user_agent"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  • 15. © 2019 InfluxData. All rights reserved. 15 Problem: Is there a way to prevent? http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service metric_relabel_configs: - regex: user_agent action: labeldrop
  • 16. © 2019 InfluxData. All rights reserved. 16 Solution: Telegraf Guard Rails http://gateway.twodotoh.svc.cluster.local:9999/metrics [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[processors.tag_limit]] limit = 4 ## List of tags to preferentially preserve keep = ["handler", "method", "status"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  • 17. © 2019 InfluxData. All rights reserved. 17 Problem: Hard to Rotate Prom Passwords http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service bearer_token_file: /etc/hunter2
  • 18. © 2019 InfluxData. All rights reserved. 18 Solution: Per Pod Credentials http://gateway.twodotoh.svc.cluster.local:9999/metrics [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] bearer_token = "/etc/telegraf/hunter2"
  • 19. © 2019 InfluxData. All rights reserved. 19 Lessons Scaling is NOT More Manual Processes Scaling is NOT saying “You’re Doing it Wrong” Scaling IS Empowering Developers Scaling IS Predictability of Failure Modes
  • 20. The time when we were Watching the watchers...
  • 21. © 2019 InfluxData. All rights reserved. 21 Problem: Am I scraping all the pods? http://gateway.twodotoh.svc.cluster.local:9999/metrics global: scrape_interval: 15s scrape_configs: - job_name: prod_twodotoh kubernetes_sd_configs: - role: service
  • 22. © 2019 InfluxData. All rights reserved. 22 Solution: Telegraf K8s Inventory [[inputs.internal]] [[inputs.kube_inventory]] url = "http://1.1.1.1:10255" [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  • 24. © 2019 InfluxData. All rights reserved. 24 Scaling even more
  • 25. © 2019 InfluxData. All rights reserved. 25 Scaling even more with Influx Enterprise Load Balancer
  • 26. © 2019 InfluxData. All rights reserved. 26 Scaling even more with Kafka and Influx Enterprise Kafka
  • 27. © 2019 InfluxData. All rights reserved. 27 Core Idea ● Measure and test metrics scaling ○ Are you missing metrics? ● Decentralize metrics gathering ○ Consider metrics as part of the program ● Empower Developers ○ They know their metrics the best. Allow them local tooling control
  • 28. © 2019 InfluxData. All rights reserved. 28 First Order Conclusion ● Too easy to shoot yourself in the foot with prometheus metrics. ● Too much in prometheus needs operation heroes. ● Too difficult to express vital information in prometheus about your program without a ton of centralized control. ● One mistake can impact everyone.
  • 29. © 2019 InfluxData. All rights reserved. 29 Second Order Conclusion ● Prometheus is not descriptive enough. ● Extremely difficult to change over time. ● The metrics game is not a solved problem. ○ Opentelemetry? ○ SNMP? ● Probably not one answer to everything.
  • 30. © 2019 InfluxData. All rights reserved. 30 Future ● Flux into Telegraf ○ Processor for transformation ○ Moving the program near the data ○ Flux Output ○ Monitoring and alerting at edge ● Telegraf Flux scripts hosted in InfluxDB API ○ Runtime plugins without re-compiling ○ Sampling rules from server-side ■ Aggregation on server with input to client ● What else?
  • 31. © 2019 InfluxData. All rights reserved. 31 Thank You!
  • 32. The time when collecting metrics impacted storage... Measure, measure, measure
  • 33. © 2019 InfluxData. All rights reserved. 33 Problem: Prometheus metrics are heavy weight