2. Our Agenda for today
• Basics of Monitoring Scylla
• Monitoring Infrastructure
• Understanding Scylla metrics
3. Linux tools
• Linux tools are familiar, widely available, no setup needed
▪iostat, top, sar, netstat, etc.
•Good for tier-1 analysis and overviews
▪but often don’t tell the whole story,
▪and are limited to a node only.
4. The top example
• Scylla uses a polling architecture
▪Scylla running at < 100 % CPU -> definitely underloaded.
▪Scylla running at = 100 % CPU -> impossible to determine.
CPU in use CPU idle
request
poll
period
5. The top example
• Scylla uses a polling architecture
▪Scylla running at < 100 % CPU -> definitely underloaded.
▪Scylla running at = 100 % CPU -> impossible to determine.
CPU in use
poll
period
poll
period
poll
period
9. Not all issues are database issues
• Client can introduce latencies as well
▪most notably, cassandra-stress will do.
▪JHiccup - client instrumentation for client-side hiccups.
10. Our Agenda for today
• Basics of Monitoring Scylla
• Monitoring Infrastructure
• Understanding Scylla metrics
13. How to use those metrics?
• your own infrastructure
▪Whatever works for collectd, works for Scylla
• scyllatop
• prometheus + grafana
14. scyllatop
• easy to use, top-like interface.
• very high resolution
• good for ad-hoc probing
▪not very good for cluster-wide view or time progression
15. List of metrics available
• RESTful API:
$ curl http://scylla-server:10000/collectd | json_reformat
[
…
{
"enable": true,
"id": {
"plugin_instance": "#cpu",
"type_instance": "load",
"type": "gauge",
"plugin": "reactor"
}
},
• scyllatop -l:
▪ includes host metrics
# scylla running with --smp 1
$ scyllatop -l | wc -l
145
16. prometheus + grafana
•easy cluster-wide view, with pre-configured dashboards
•easy system progression view
•easy metric correlation
•adding composite metrics
•harder to setup,
-but we try to make it easier, docker images, pre-loaded dashboards.
-https://github.com/scylladb/scylla-grafana-monitoring
19. Our Agenda for today
• Basics of Monitoring Scylla
• Monitoring Infrastructure
• Understanding Scylla metrics
20. Naming of metrics
Collectd naming scheme:
{host}/{plugin}-{plugin instance}/{type}-{type instance}
• plugin - name of the component
• plugin instance - instance of the component
• type - type of metric’s value
• type instance - name of the metric of given component
22. Naming of metrics
• plugin instances usually correspond to shard numbers.
▪ Example --smp 3:
node1/reactor-0/gauge-load
node1/reactor-1/gauge-load
node1/reactor-2/gauge-load
23. • GAUGE - value as is
▪ collectd types: gauge, bytes, pending_operations, ...
▪ reactor-*/gauge-load, lsa-*/bytes-total_space, ...
• DERIVE - change over time
▪ collectd types: total_operations, derive, ...
▪ database-*/total_operations-total_reads
Data source types
24. Naming of metrics
When exported to prometheus:
collectd_{plugin}_{type} { {plugin}={plugin instance},type={type instance},instance={host} }
E.g.:
collectd_reactor_gauge{reactor=”0”,type=”load”,instance=”node1”}
28. Best reflected by reactor-*/gauge-load
• percentage of time Scylla was executing tasks
▪ excludes busy polling, execution of on-idle tasks, sleeping
▪ Updated every second and reflects past 5 seconds.
• 100 means the server is CPU-bound
CPU Utilization
30. Memory utilization metrics
total memory
standard
allocations
(non-LSA)
LSA free
memtables
(dirty)
cache
lsa-*/bytes-non_lsa_used_space
memory-*/memory-total_memory
lsa-*/bytes-total_space
memory-*/bytes-dirty cache-*/bytes-total
31. Memory utilization metrics
• Useful for detecting:
▪cache getting shrunk down due to pressure from std allocations
▪requests blocking
-only 50 % of memory is allowed to be dirty.
-Requests will block if we can’t clean fast enough.