How to reduce expenses on monitoring

How to reduce expenses on monitoring
with VictoriaMetrics
Roman Khavronenko | github.com/hagen1778
Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778
What this talk is about
1. Best ways for storing and processing metrics
2. Open source tools only
3. For people familiar with Prometheus,
Thanos, Mimir, VictoriaMetrics
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
Expenses!
You can either have a faster car…
…or be a smarter driver!
What can you get from simple replacing?
How to reduce expenses on monitoring
Prometheus remote-write benchmark
Prometheus vs VictoriaMetrics benchmark
# the number of nodeexporter instances to scrape
targetsCount: 1000
# how frequently to scrape nodeexporter targets
scrapeInterval: 15s
# rules evaluation interval
# https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1
queryInterval: 30s
# scrapeConfigUpdatePercent is a churn rate generated once
# per scrapeConfigUpdateInterval
scrapeConfigUpdatePercent: 5
scrapeConfigUpdateInterval: 10m
Prometheus vs VictoriaMetrics benchmark
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
x16 times faster!
x1.9 times faster!
x1.7 less memory!
x2.5 times less!
How to reduce expenses on monitoring
Summary after 7d benchmark (1k nodeexporter targets)
Prometheus:
CPU avg used: 0.79 / 3 cores
Disk occupied: 83.5 GiB
Mem max used: 8.12 GiB / 12 GiB
Read latency avg:
50th - 70.5ms
99th - 7s
VictoriaMetrics:
CPU avg used: 0.76 / 3 cores
Disk occupied: 33 GiB
Mem max used: 4.5 GiB / 12 GiB
Read latency avg:
50th - 4.3ms
99th - 3.6s
Data transfer costs
Network Data transfer costs
x4.5 times less!
Improving network compression
1. Increase compression level, trade CPU for network savings:
a. -remoteWrite.vmProtoCompressLevel
2. Increase batch size, trade latency for compression:
a. -remoteWrite.maxBlockSize
b. -remoteWrite.maxRowsPerBlock
c. -remoteWrite.flushInterval
3. Reduce entropy to improve compression:
a. -remoteWrite.significantFigures
b. -remoteWrite.roundDigits
How to be smarter about data
Keeping only significant figures
instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781
instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236
rules:
- record: instance:cpu_utilization:ratio_avg
expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
Keeping only significant figures
Applying --vm-significant-figures=8 to recording rules
0.05055757575781
0.050557576
changed compression ratio from 1.2B to 0.8B per sample
See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
Understanding the data - query tracing
VictoriaMetrics supports query tracing for detecting bottlenecks during query processing.
This is like EXPLAIN ANALYZE from Postgresql!
https://play.victoriametrics.com
Query tracing demo!
If query tracing demo didn't work…
Typical query takes 4s to execute… Why?
If query tracing demo didn't work…
Let's check the trace!
If query tracing demo didn't work…
91% of the time was spent on vmselect while aggregating
9.4k series, 13Mil data samples!
How to improve query speed?
1. Add more resources to monitoring.
2. Or… be smarter about data!
Cardinality explorer demo!
https://play.victoriametrics.com
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
Cardinality explorer: summary
VictoriaMetrics allows exploring time series cardinality to identify:
● Metric names with the highest number of series
● Labels with the highest number of series
● Values with the highest number of series for the selected label
● label=name pairs with the highest number of series
● Labels with the highest number of unique values
➔ Available built-in in VictoriaMetrics components
➔ Supports specifying Prometheus URL
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is Data-in + Recording Rules results
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is only what needs to be persisted
How to use streaming aggregation
- match: "grpc_server_handled_total" # time series selector
interval: "2m" # on 2m interval
outputs: ["total"] # aggregate as counter
without: ["grpc_method"] # group without label
Result:
grpc_server_handled_total:2m_without_grpc_method_total
How to use streaming aggregation
https://play.victoriametrics.com
Streaming aggregation: summary
1. Aggregate incoming samples in streaming mode before data is written to remote
storage
2. Aggregation is applied to all the metrics received via any supported data
ingestion protocol and/or scraped from Prometheus-compatible targets
3. Statsd alternative
4. Recording rules alternative
5. Reducing the number of stored samples
6. Reducing the number of stored series
7. Compatible with tools supporting Prometheus remote write protocol
Complexity penalty
Cortex architecture
Mimir architecture
VictoriaMetrics architecture
Complexity penalty
● Complex systems are harder to maintain
● Complex systems are harder to educate about
● Complex systems are more expensive to scale
Additional materials
1. Snapshot of Grafana dashboard from the benchmark
2. Benchmark repo for reproducing the test
3. Save network costs with VictoriaMetrics remote write protocol
4. VictoriaMetrics: achieving better compression than Gorilla for time series data
5. Streaming aggregation
6. VictoriaMetrics playground
Questions?
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778
1 sur 54

Recommandé

OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali... par
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
512 vues99 diapositives
Cloud Native PostgreSQL par
Cloud Native PostgreSQLCloud Native PostgreSQL
Cloud Native PostgreSQLEDB
700 vues56 diapositives
VictoriaMetrics 15/12 Meet Up: 2022 Features Highlights par
VictoriaMetrics 15/12 Meet Up: 2022 Features HighlightsVictoriaMetrics 15/12 Meet Up: 2022 Features Highlights
VictoriaMetrics 15/12 Meet Up: 2022 Features HighlightsVictoriaMetrics
127 vues57 diapositives
VictoriaLogs: Open Source Log Management System - Preview par
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
2.1K vues98 diapositives
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx par
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxRomanKhavronenko
246 vues48 diapositives
Serving ML easily with FastAPI par
Serving ML easily with FastAPIServing ML easily with FastAPI
Serving ML easily with FastAPISebastián Ramírez Montaño
1K vues35 diapositives

Contenu connexe

Tendances

Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16 par
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16AppDynamics
3.6K vues38 diapositives
Introduction to Prometheus par
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
6.7K vues55 diapositives
Prometheus Storage par
Prometheus StoragePrometheus Storage
Prometheus StorageFabian Reinartz
9.7K vues23 diapositives
Prometheus par
PrometheusPrometheus
Prometheuswyukawa
1.3K vues11 diapositives
Infrastructure & System Monitoring using Prometheus par
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
4.8K vues63 diapositives
KFServing and Kubeflow Pipelines par
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesAnimesh Singh
377 vues9 diapositives

Tendances(20)

Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16 par AppDynamics
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
AppDynamics3.6K vues
Prometheus par wyukawa
PrometheusPrometheus
Prometheus
wyukawa 1.3K vues
Infrastructure & System Monitoring using Prometheus par Marco Pas
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
Marco Pas4.8K vues
KFServing and Kubeflow Pipelines par Animesh Singh
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow Pipelines
Animesh Singh377 vues
Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train... par Edureka!
Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train...Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train...
Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train...
Edureka!327 vues
Provisioning Datadog with Terraform par Matt Spurlin
Provisioning Datadog with TerraformProvisioning Datadog with Terraform
Provisioning Datadog with Terraform
Matt Spurlin328 vues
Exploring the power of OpenTelemetry on Kubernetes par Red Hat Developers
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers1.6K vues
Semmle Codeql par M. S.
Semmle Codeql Semmle Codeql
Semmle Codeql
M. S.1.2K vues
Getting Started Monitoring with Prometheus and Grafana par Syah Dwi Prihatmoko
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
Monitoring with prometheus par Kasper Nissen
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen6.7K vues
Monitoring kubernetes with prometheus par Brice Fernandes
Monitoring kubernetes with prometheusMonitoring kubernetes with prometheus
Monitoring kubernetes with prometheus
Brice Fernandes1.2K vues
Python for the Network Nerd par Matt Bynum
Python for the Network NerdPython for the Network Nerd
Python for the Network Nerd
Matt Bynum2.8K vues
Data modeling for Elasticsearch par Florian Hopf
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
Florian Hopf12.7K vues
Distributed tracing using open tracing & jaeger 2 par Chandresh Pancholi
Distributed tracing using open tracing & jaeger 2Distributed tracing using open tracing & jaeger 2
Distributed tracing using open tracing & jaeger 2
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023 par VictoriaMetrics
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
VictoriaMetrics123 vues
TypeScript for Java Developers par Yakov Fain
TypeScript for Java DevelopersTypeScript for Java Developers
TypeScript for Java Developers
Yakov Fain3.9K vues

Similaire à How to reduce expenses on monitoring

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by... par
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...NETWAYS
28 vues55 diapositives
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P... par
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte
618 vues20 diapositives
observability pre-release: using prometheus to test and fix new software par
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareSneha Inguva
516 vues79 diapositives
Kafka monitoring and metrics par
Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metricsTouraj Ebrahimi
2.1K vues20 diapositives
Prometheus Everything, Observing Kubernetes in the Cloud par
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudSneha Inguva
1.9K vues50 diapositives
Performance eng prakash.sahu par
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
113 vues40 diapositives

Similaire à How to reduce expenses on monitoring(20)

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by... par NETWAYS
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
NETWAYS28 vues
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P... par DiscoveredByte
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte618 vues
observability pre-release: using prometheus to test and fix new software par Sneha Inguva
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new software
Sneha Inguva516 vues
Prometheus Everything, Observing Kubernetes in the Cloud par Sneha Inguva
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
Sneha Inguva1.9K vues
Prelim Slides par smpant
Prelim SlidesPrelim Slides
Prelim Slides
smpant347 vues
Overcoming (organizational) scalability issues in your Prometheus ecosystem par QAware GmbH
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH215 vues
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics par Jaime Crespo
Query Optimization with MySQL 8.0 and MariaDB 10.3: The BasicsQuery Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Jaime Crespo1.6K vues
Overcoming scalability issues in your prometheus ecosystem par Nebulaworks
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystem
Nebulaworks78 vues
DevoxxUK: Optimizating Application Performance on Kubernetes par Dinakar Guniguntala
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
So You Want to Write an Exporter par Brian Brazil
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil4.2K vues
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System par Accumulo Summit
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit521 vues
Monitoring using Prometheus and Grafana par Arvind Kumar G.S
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.5K vues
A Framework for Scene Recognition Using Convolutional Neural Network as Featu... par Tahmid Abtahi
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi2.4K vues

Dernier

Throughput par
ThroughputThroughput
ThroughputMoisés Armani Ramírez
32 vues11 diapositives
Understanding GenAI/LLM and What is Google Offering - Felix Goh par
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
39 vues33 diapositives
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM par
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMCXL Forum
105 vues7 diapositives
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy par
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy NakonechnyyFwdays
40 vues21 diapositives
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure par
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI InfrastructureCXL Forum
125 vues16 diapositives
Future of Learning - Yap Aye Wee.pdf par
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfNUS-ISS
38 vues11 diapositives

Dernier(20)

Understanding GenAI/LLM and What is Google Offering - Felix Goh par NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS39 vues
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM par CXL Forum
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum105 vues
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy par Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays40 vues
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure par CXL Forum
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum125 vues
Future of Learning - Yap Aye Wee.pdf par NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 vues
Photowave Presentation Slides - 11.8.23.pptx par CXL Forum
Photowave Presentation Slides - 11.8.23.pptxPhotowave Presentation Slides - 11.8.23.pptx
Photowave Presentation Slides - 11.8.23.pptx
CXL Forum126 vues
MemVerge: Gismo (Global IO-free Shared Memory Objects) par CXL Forum
MemVerge: Gismo (Global IO-free Shared Memory Objects)MemVerge: Gismo (Global IO-free Shared Memory Objects)
MemVerge: Gismo (Global IO-free Shared Memory Objects)
CXL Forum112 vues
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV par Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk86 vues
Liqid: Composable CXL Preview par CXL Forum
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum121 vues
PharoJS - Zürich Smalltalk Group Meetup November 2023 par Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 vues
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur par Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays40 vues
The details of description: Techniques, tips, and tangents on alternative tex... par BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 vues
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... par NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS23 vues
AI: mind, matter, meaning, metaphors, being, becoming, life values par Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
Micron CXL product and architecture update par CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 vues
Empathic Computing: Delivering the Potential of the Metaverse par Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Web Dev - 1 PPT.pdf par gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet52 vues
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... par Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays40 vues

How to reduce expenses on monitoring

  • 1. How to reduce expenses on monitoring with VictoriaMetrics Roman Khavronenko | github.com/hagen1778
  • 2. Roman Khavronenko Co-founder of VictoriaMetrics Software engineer with experience in distributed systems, monitoring and high-performance services. https://github.com/hagen1778 https://twitter.com/hagen1778
  • 3. What this talk is about 1. Best ways for storing and processing metrics 2. Open source tools only 3. For people familiar with Prometheus, Thanos, Mimir, VictoriaMetrics
  • 10. You can either have a faster car… …or be a smarter driver!
  • 11. What can you get from simple replacing?
  • 15. # the number of nodeexporter instances to scrape targetsCount: 1000 # how frequently to scrape nodeexporter targets scrapeInterval: 15s # rules evaluation interval # https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1 queryInterval: 30s # scrapeConfigUpdatePercent is a churn rate generated once # per scrapeConfigUpdateInterval scrapeConfigUpdatePercent: 5 scrapeConfigUpdateInterval: 10m Prometheus vs VictoriaMetrics benchmark
  • 24. Summary after 7d benchmark (1k nodeexporter targets) Prometheus: CPU avg used: 0.79 / 3 cores Disk occupied: 83.5 GiB Mem max used: 8.12 GiB / 12 GiB Read latency avg: 50th - 70.5ms 99th - 7s VictoriaMetrics: CPU avg used: 0.76 / 3 cores Disk occupied: 33 GiB Mem max used: 4.5 GiB / 12 GiB Read latency avg: 50th - 4.3ms 99th - 3.6s
  • 28. Improving network compression 1. Increase compression level, trade CPU for network savings: a. -remoteWrite.vmProtoCompressLevel 2. Increase batch size, trade latency for compression: a. -remoteWrite.maxBlockSize b. -remoteWrite.maxRowsPerBlock c. -remoteWrite.flushInterval 3. Reduce entropy to improve compression: a. -remoteWrite.significantFigures b. -remoteWrite.roundDigits
  • 29. How to be smarter about data
  • 30. Keeping only significant figures instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781 instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236 rules: - record: instance:cpu_utilization:ratio_avg expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
  • 31. Keeping only significant figures Applying --vm-significant-figures=8 to recording rules 0.05055757575781 0.050557576 changed compression ratio from 1.2B to 0.8B per sample See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
  • 32. Understanding the data - query tracing VictoriaMetrics supports query tracing for detecting bottlenecks during query processing. This is like EXPLAIN ANALYZE from Postgresql!
  • 34. If query tracing demo didn't work… Typical query takes 4s to execute… Why?
  • 35. If query tracing demo didn't work… Let's check the trace!
  • 36. If query tracing demo didn't work… 91% of the time was spent on vmselect while aggregating 9.4k series, 13Mil data samples!
  • 37. How to improve query speed? 1. Add more resources to monitoring. 2. Or… be smarter about data!
  • 39. If cardinality explorer demo didn't work…
  • 40. If cardinality explorer demo didn't work…
  • 41. If cardinality explorer demo didn't work…
  • 42. Cardinality explorer: summary VictoriaMetrics allows exploring time series cardinality to identify: ● Metric names with the highest number of series ● Labels with the highest number of series ● Values with the highest number of series for the selected label ● label=name pairs with the highest number of series ● Labels with the highest number of unique values ➔ Available built-in in VictoriaMetrics components ➔ Supports specifying Prometheus URL
  • 43. Streaming aggregation vs Recording rules The number of time series stored in TSDB is Data-in + Recording Rules results
  • 44. Streaming aggregation vs Recording rules The number of time series stored in TSDB is only what needs to be persisted
  • 45. How to use streaming aggregation - match: "grpc_server_handled_total" # time series selector interval: "2m" # on 2m interval outputs: ["total"] # aggregate as counter without: ["grpc_method"] # group without label Result: grpc_server_handled_total:2m_without_grpc_method_total
  • 46. How to use streaming aggregation https://play.victoriametrics.com
  • 47. Streaming aggregation: summary 1. Aggregate incoming samples in streaming mode before data is written to remote storage 2. Aggregation is applied to all the metrics received via any supported data ingestion protocol and/or scraped from Prometheus-compatible targets 3. Statsd alternative 4. Recording rules alternative 5. Reducing the number of stored samples 6. Reducing the number of stored series 7. Compatible with tools supporting Prometheus remote write protocol
  • 52. Complexity penalty ● Complex systems are harder to maintain ● Complex systems are harder to educate about ● Complex systems are more expensive to scale
  • 53. Additional materials 1. Snapshot of Grafana dashboard from the benchmark 2. Benchmark repo for reproducing the test 3. Save network costs with VictoriaMetrics remote write protocol 4. VictoriaMetrics: achieving better compression than Gorilla for time series data 5. Streaming aggregation 6. VictoriaMetrics playground