Delivering a database service is not a simple job but to ensure that everything is working correctly your platform needs to be observable. In this talk, I’ll talk about how we make the MySQL/MariaDB databases observable. We’ll talk about the RED, USE methods, and the golden signals. You’ll discover how we dealt with the following questions “We think the database is slow”. This talk will allow you to make your databases discoverable with open source solutions.
2. About me
● Senior Site Reliability Engineer at Criteo
● Working on monitoring topics since few years
● Currently providing the (open source) database service
at Criteo
● Previously worked on the observability stack at Criteo
● @Charles_JUDITH on Twitter
12. Why a system needs to be observable?
● Is it working as expected by the users?
● To answer basic questions about your service/platform
● Increase the visibility for you and your users/customers
● Long term tends analysis
● “If can’t measure it, you can’t manage it”
19. USE method
● USE was introduced by @brendangregg
● Utilization: disk,CPU usage …
● Saturation: disk I/O
● Errors: network interface errors
20. The four golden signals
● Introduced in the Google SRE book
● Latency: response time, queue/wait time
● Traffic: A measure of how much demand is being placed on the service
● Errors: The rate of requests that fail
● Saturation: How “full” is the service
21. RED method
● RED was introduced by @tom_wilkie
● (Request) Rate - the number of requests, per second, you services are serving.
● (Request) Errors - the number of failed requests per second.
● (Request) Duration - distributions of the amount of time each request takes.
● Subset of “The Four Golden Signals”
22. The seven golden signals
● CELT + USE introduced by @xaprb
● Concurrency: number of simultaneous requests
● Error rate
● Latency: response time
● Throughput: query per seconds (QPS)
23. CASE method
● CASE was introduced by @gphat
● Context-heavy
● Actionnable
● Symptom-based
● Evaluated
24.
25. Preferred approach
● The seven golden signals
● Good to measure the service quality
● System and application metrics are valuable in our case
26. How to collect the metrics?
● Collectd
● Node exporter
● MySQLD exporter
● Python MySQL plugin for CollectD
● Few others
27. What to do with all these metrics?
● Pick some useful “indicators” like:
○ thread usage
○ service status
○ backup status, duration, size
○ replication lag
41. Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom fields”
● Make the logs available for our users
42. Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom fields”
● Configure MySQL/MariaDB to log the slow queries
● Use Rsyslog with a custom template!
● Make the logs available for our users
47. Benefits
● The DBA is not a blocker for the developers
● Better visibility on the database service
● Happy customers/developers/users
48. Conclusions
● The visibility and transparency
● Effective monitoring
● Shipping slow queries is not easy
● In that case metrics and logs is a good combo but we want more!
49. Next steps
● Continue to improve the SQL logging
● Leverage the usage of sys_schema
● Metrics per database
● Publish the SLA
● Open source our probe for MySQL/MariaDB