Contenu connexe Similaire à Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData (20) Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData3. © 2019 InfluxData. All rights reserved. 3
The Suspects
● Prometheus
● Kubernetes
● Gateway
● Queryd
4. © 2019 InfluxData. All rights reserved. 4
Prometheus
http://gateway.twodotoh.svc.cluster.local:9999/metrics
5. © 2019 InfluxData. All rights reserved. 5
Prometheus
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
7. © 2019 InfluxData. All rights reserved. 7
InfluxCloud
Gateway Gateway
Queryd
Gateway
Queryd Queryd
Ingress
8. © 2019 InfluxData. All rights reserved. 8
Problem: Prometheus Debugging is Hard
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.01"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.05"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.5"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.9"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.99"} 0.012562015
prometheus_target_sync_length_seconds_sum{scrape_job="prod_twodotoh"} 0.012562015
prometheus_target_sync_length_seconds_count{scrape_job="prod_twodotoh"} 1
9. © 2019 InfluxData. All rights reserved. 9
Problem: Prometheus Scaling is Hard
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh_ns_a
kubernetes_sd_configs:
- role: service
namespaces:
names:
- a
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh_ns_a
kubernetes_sd_configs:
- role: service
namespaces:
names:
- b
10. © 2019 InfluxData. All rights reserved. 10
Solution: Isolatation with Telegraf Sidecar
11. © 2019 InfluxData. All rights reserved. 11
Solution: Isolation with Telegraf Sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: "gateway"
labels:
spec:
serviceName: "gateway"
replicas: 100
template:
metadata:
name: "gateway"
labels:
app: "gateway"
spec:
containers:
- name: "telegraf"
image: "docker.io/library/telegraf:1.12"
- name: "gateway"
image: "quay.io/influxdb/gateway:latest"
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.c
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
12. © 2019 InfluxData. All rights reserved. 12
Solution: Isolatation with Telegraf Sidecar
13. © 2019 InfluxData. All rights reserved. 13
Problem: Prom has 1 and only 1 value
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
metric_relabel_configs:
- regex: user_agent
action: labeldrop
14. © 2019 InfluxData. All rights reserved. 14
Solution: Influx for more context
http://gateway.twodotoh.svc.cluster.local:9999/metrics
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[processors.converter]]
[processors.converter.tags]
string = ["user_agent"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
15. © 2019 InfluxData. All rights reserved. 15
Problem: Is there a way to prevent?
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
metric_relabel_configs:
- regex: user_agent
action: labeldrop
16. © 2019 InfluxData. All rights reserved. 16
Solution: Telegraf Guard Rails
http://gateway.twodotoh.svc.cluster.local:9999/metrics
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[processors.tag_limit]]
limit = 4
## List of tags to preferentially preserve
keep = ["handler", "method", "status"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
17. © 2019 InfluxData. All rights reserved. 17
Problem: Hard to Rotate Prom Passwords
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
bearer_token_file: /etc/hunter2
18. © 2019 InfluxData. All rights reserved. 18
Solution: Per Pod Credentials
http://gateway.twodotoh.svc.cluster.local:9999/metrics
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
bearer_token = "/etc/telegraf/hunter2"
19. © 2019 InfluxData. All rights reserved. 19
Lessons
Scaling is NOT More Manual Processes
Scaling is NOT saying “You’re Doing it Wrong”
Scaling IS Empowering Developers
Scaling IS Predictability of Failure Modes
21. © 2019 InfluxData. All rights reserved. 21
Problem: Am I scraping all the pods?
http://gateway.twodotoh.svc.cluster.local:9999/metrics
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service
22. © 2019 InfluxData. All rights reserved. 22
Solution: Telegraf K8s Inventory
[[inputs.internal]]
[[inputs.kube_inventory]]
url = "http://1.1.1.1:10255"
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
25. © 2019 InfluxData. All rights reserved. 25
Scaling even more with Influx Enterprise
Load
Balancer
26. © 2019 InfluxData. All rights reserved. 26
Scaling even more with Kafka and Influx
Enterprise
Kafka
27. © 2019 InfluxData. All rights reserved. 27
Core Idea
● Measure and test metrics scaling
○ Are you missing metrics?
● Decentralize metrics gathering
○ Consider metrics as part of the program
● Empower Developers
○ They know their metrics the best. Allow them local tooling control
28. © 2019 InfluxData. All rights reserved. 28
First Order Conclusion
● Too easy to shoot yourself in the foot with prometheus metrics.
● Too much in prometheus needs operation heroes.
● Too difficult to express vital information in prometheus about your
program without a ton of centralized control.
● One mistake can impact everyone.
29. © 2019 InfluxData. All rights reserved. 29
Second Order Conclusion
● Prometheus is not descriptive enough.
● Extremely difficult to change over time.
● The metrics game is not a solved problem.
○ Opentelemetry?
○ SNMP?
● Probably not one answer to everything.
30. © 2019 InfluxData. All rights reserved. 30
Future
● Flux into Telegraf
○ Processor for transformation
○ Moving the program near the data
○ Flux Output
○ Monitoring and alerting at edge
● Telegraf Flux scripts hosted in InfluxDB API
○ Runtime plugins without re-compiling
○ Sampling rules from server-side
■ Aggregation on server with input to client
● What else?
32. The time when collecting metrics impacted storage...
Measure, measure, measure
33. © 2019 InfluxData. All rights reserved. 33
Problem: Prometheus metrics are heavy
weight