SlideShare une entreprise Scribd logo
1  sur  36
Performance Monitoring
Understanding your Scylla Cluster
Glauber Costa & Tomasz Grabiec
Our Agenda for today
• Basics of Monitoring Scylla
• Monitoring Infrastructure
• Understanding Scylla metrics
Linux tools
• Linux tools are familiar, widely available, no setup needed
▪iostat, top, sar, netstat, etc.
•Good for tier-1 analysis and overviews
▪but often don’t tell the whole story,
▪and are limited to a node only.
The top example
• Scylla uses a polling architecture
▪Scylla running at < 100 % CPU -> definitely underloaded.
▪Scylla running at = 100 % CPU -> impossible to determine.
CPU in use CPU idle
request
poll
period
The top example
• Scylla uses a polling architecture
▪Scylla running at < 100 % CPU -> definitely underloaded.
▪Scylla running at = 100 % CPU -> impossible to determine.
CPU in use
poll
period
poll
period
poll
period
iostat
• iostat: useful to find disk bottlenecks
$ iostat -x -m 1
[...]
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.00 0.00 0.50 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 1291.00 0.00 3690.50 453.00 234.12 51.34 141.09 8.07 1.95 1.99 1.61 0.23 94.05
xvdc 1332.50 0.00 3808.00 456.00 236.65 51.31 138.31 8.38 1.96 2.01 1.56 0.22 94.70
xvdd 1308.50 0.00 3704.50 449.50 233.14 50.78 139.98 6.83 1.65 1.69 1.27 0.23 93.95
xvde 1285.50 0.00 3632.50 454.50 229.48 51.53 140.81 7.74 1.89 1.94 1.53 0.23 93.40
xvdf 1281.50 0.00 3524.00 459.50 227.91 51.95 143.88 8.08 2.04 2.06 1.86 0.23 93.25
xvdg 1306.00 0.00 3576.50 453.50 231.10 51.70 143.71 7.58 1.89 1.92 1.64 0.23 93.50
xvdh 1302.00 0.00 3566.50 451.50 231.58 51.53 144.30 6.77 1.67 1.72 1.28 0.23 92.90
xvdi 1279.00 0.00 3627.00 448.00 235.86 51.11 144.22 7.92 1.95 1.97 1.73 0.23 93.45
md0 0.00 0.00 34234.50 3570.50 1860.41 411.33 123.07 0.00 0.00 0.00 0.00 0.00 0.00
Linux & Client side metrics
• iostat: useful to find disk bottlenecks
$ iostat -x -m 1
[...]
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.00 0.00 0.50 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 1291.00 0.00 3690.50 453.00 234.12 51.34 141.09 8.07 1.95 1.99 1.61 0.23 94.05
xvdc 1332.50 0.00 3808.00 456.00 236.65 51.31 138.31 8.38 1.96 2.01 1.56 0.22 94.70
xvdd 1308.50 0.00 3704.50 449.50 233.14 50.78 139.98 6.83 1.65 1.69 1.27 0.23 93.95
xvde 1285.50 0.00 3632.50 454.50 229.48 51.53 140.81 7.74 1.89 1.94 1.53 0.23 93.40
xvdf 1281.50 0.00 3524.00 459.50 227.91 51.95 143.88 8.08 2.04 2.06 1.86 0.23 93.25
xvdg 1306.00 0.00 3576.50 453.50 231.10 51.70 143.71 7.58 1.89 1.92 1.64 0.23 93.50
xvdh 1302.00 0.00 3566.50 451.50 231.58 51.53 144.30 6.77 1.67 1.72 1.28 0.23 92.90
xvdi 1279.00 0.00 3627.00 448.00 235.86 51.11 144.22 7.92 1.95 1.97 1.73 0.23 93.45
md0 0.00 0.00 34234.50 3570.50 1860.41 411.33 123.07 0.00 0.00 0.00 0.00 0.00 0.00
Linux & Client side metrics
• iostat: useful to find disk bottlenecks
$ iostat -x -m 1
[...]
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.00 0.00 0.50 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 1291.00 0.00 3690.50 453.00 234.12 51.34 141.09 8.07 1.95 1.99 1.61 0.23 94.05
xvdc 1332.50 0.00 3808.00 456.00 236.65 51.31 138.31 8.38 1.96 2.01 1.56 0.22 94.70
xvdd 1308.50 0.00 3704.50 449.50 233.14 50.78 139.98 6.83 1.65 1.69 1.27 0.23 93.95
xvde 1285.50 0.00 3632.50 454.50 229.48 51.53 140.81 7.74 1.89 1.94 1.53 0.23 93.40
xvdf 1281.50 0.00 3524.00 459.50 227.91 51.95 143.88 8.08 2.04 2.06 1.86 0.23 93.25
xvdg 1306.00 0.00 3576.50 453.50 231.10 51.70 143.71 7.58 1.89 1.92 1.64 0.23 93.50
xvdh 1302.00 0.00 3566.50 451.50 231.58 51.53 144.30 6.77 1.67 1.72 1.28 0.23 92.90
xvdi 1279.00 0.00 3627.00 448.00 235.86 51.11 144.22 7.92 1.95 1.97 1.73 0.23 93.45
md0 0.00 0.00 34234.50 3570.50 1860.41 411.33 123.07 0.00 0.00 0.00 0.00 0.00 0.00
Not all issues are database issues
• Client can introduce latencies as well
▪most notably, cassandra-stress will do.
▪JHiccup - client instrumentation for client-side hiccups.
Our Agenda for today
• Basics of Monitoring Scylla
• Monitoring Infrastructure
• Understanding Scylla metrics
collectd metrics
Prometheus
Scylla / Agent Browserip:9103
Grafana
ip:65534
ip:3000
ip:9103
ip:9103
HTTP
Scylla / Agent
Scylla / Agent
Scylla & Agent
Scylla Monitoring
collectd collectd_exporter
ip:65534
Scylla metrics
scyllatop
Scylla
ip:25826
Scylla + OS metrics
Ip:9103
HTTP
How to use those metrics?
• your own infrastructure
▪Whatever works for collectd, works for Scylla
• scyllatop
• prometheus + grafana
scyllatop
• easy to use, top-like interface.
• very high resolution
• good for ad-hoc probing
▪not very good for cluster-wide view or time progression
List of metrics available
• RESTful API:
$ curl http://scylla-server:10000/collectd | json_reformat
[
…
{
"enable": true,
"id": {
"plugin_instance": "#cpu",
"type_instance": "load",
"type": "gauge",
"plugin": "reactor"
}
},
• scyllatop -l:
▪ includes host metrics
# scylla running with --smp 1
$ scyllatop -l | wc -l
145
prometheus + grafana
•easy cluster-wide view, with pre-configured dashboards
•easy system progression view
•easy metric correlation
•adding composite metrics
•harder to setup,
-but we try to make it easier, docker images, pre-loaded dashboards.
-https://github.com/scylladb/scylla-grafana-monitoring
prometheus + grafana
• prometheus/grafana imgs, pre-loaded with dashboards:
▪https://github.com/scylladb/scylla-grafana-monitoring
Correlating metrics
Our Agenda for today
• Basics of Monitoring Scylla
• Monitoring Infrastructure
• Understanding Scylla metrics
Naming of metrics
Collectd naming scheme:
{host}/{plugin}-{plugin instance}/{type}-{type instance}
• plugin - name of the component
• plugin instance - instance of the component
• type - type of metric’s value
• type instance - name of the metric of given component
Naming of metrics
Collectd naming scheme:
{host}/{plugin}-{plugin instance}/{type}-{type instance}
E.g.:
node1/reactor-0/gauge-load
Naming of metrics
• plugin instances usually correspond to shard numbers.
▪ Example --smp 3:
node1/reactor-0/gauge-load
node1/reactor-1/gauge-load
node1/reactor-2/gauge-load
• GAUGE - value as is
▪ collectd types: gauge, bytes, pending_operations, ...
▪ reactor-*/gauge-load, lsa-*/bytes-total_space, ...
• DERIVE - change over time
▪ collectd types: total_operations, derive, ...
▪ database-*/total_operations-total_reads
Data source types
Naming of metrics
When exported to prometheus:
collectd_{plugin}_{type} { {plugin}={plugin instance},type={type instance},instance={host} }
E.g.:
collectd_reactor_gauge{reactor=”0”,type=”load”,instance=”node1”}
Metric plugins
coordinator replica
transport
(CQL server)
thrift
storage_proxy
database
memtables cachecommitlog
seastar framework
reactor memory io_queue
lsa
smp
compaction_manager
• transport-*/total_operations-requests_served
▪ counts incoming CQL requests
▪ coordinator-side
• database-*/total_operations-total_{reads|writes}
▪ counts incoming replica read/write requests
• both are DERIVE-typed
Throughput metrics
• storage_proxy-*/total_operations-{read|write} timeouts
▪ count number of timeouted read and write requests
▪ coordinator-side
• check coordinator logs
• check replica logs
• check for overload
Error metrics
Best reflected by reactor-*/gauge-load
• percentage of time Scylla was executing tasks
▪ excludes busy polling, execution of on-idle tasks, sleeping
▪ Updated every second and reflects past 5 seconds.
• 100 means the server is CPU-bound
CPU Utilization
Memory utilization metrics
total memory
standard
allocations
(non-LSA)
LSA free
memtables
(dirty)
cache
Memory utilization metrics
total memory
standard
allocations
(non-LSA)
LSA free
memtables
(dirty)
cache
lsa-*/bytes-non_lsa_used_space
memory-*/memory-total_memory
lsa-*/bytes-total_space
memory-*/bytes-dirty cache-*/bytes-total
Memory utilization metrics
• Useful for detecting:
▪cache getting shrunk down due to pressure from std allocations
▪requests blocking
-only 50 % of memory is allowed to be dirty.
-Requests will block if we can’t clean fast enough.
Memory utilization metrics
Cache metrics
• cache-*/total_operations-*:
▪ hits, misses - entries found/not found in cache during read
▪ merges - entries updated during memtable flush
▪ insertions - entries added (on miss, memtable flush)
▪ evictions - entries removed due to memory pressure
▪ removals - entries invalidated (ring ownership change)
• currently entries are per-partition
Cache metrics
I/O Queue metrics
• Scylla uses the I/O Queue to provide fairness among:
▪ commitlog, memtables, query, etc
io_queue-*/derive-{class name} bandwidth (bps)
io_queue-*/delay-{class name} queue latency, not counting disk access
(s)
io_queue-*/queue_length-{class name} # requests waiting
io_queue-*/total_operations-{class name} IOPS
Thank You!
github.com/scylladb/scylla-grafana-monitoring
Tomasz: tgrabiec@scylladb.com / @tgrabiec
Glauber: glauber@scylladb.com / @glcst

Contenu connexe

Tendances

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

Tendances (20)

Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
Elk
Elk Elk
Elk
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 

Similaire à Performance Monitoring: Understanding Your Scylla Cluster

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 

Similaire à Performance Monitoring: Understanding Your Scylla Cluster (20)

Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
200.1,2-Capacity Planning
200.1,2-Capacity Planning200.1,2-Capacity Planning
200.1,2-Capacity Planning
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Broken Performance Tools
Broken Performance ToolsBroken Performance Tools
Broken Performance Tools
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
hacking-embedded-devices.pptx
hacking-embedded-devices.pptxhacking-embedded-devices.pptx
hacking-embedded-devices.pptx
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 

Plus de ScyllaDB

Plus de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Dernier

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Performance Monitoring: Understanding Your Scylla Cluster

  • 1. Performance Monitoring Understanding your Scylla Cluster Glauber Costa & Tomasz Grabiec
  • 2. Our Agenda for today • Basics of Monitoring Scylla • Monitoring Infrastructure • Understanding Scylla metrics
  • 3. Linux tools • Linux tools are familiar, widely available, no setup needed ▪iostat, top, sar, netstat, etc. •Good for tier-1 analysis and overviews ▪but often don’t tell the whole story, ▪and are limited to a node only.
  • 4. The top example • Scylla uses a polling architecture ▪Scylla running at < 100 % CPU -> definitely underloaded. ▪Scylla running at = 100 % CPU -> impossible to determine. CPU in use CPU idle request poll period
  • 5. The top example • Scylla uses a polling architecture ▪Scylla running at < 100 % CPU -> definitely underloaded. ▪Scylla running at = 100 % CPU -> impossible to determine. CPU in use poll period poll period poll period
  • 6. iostat • iostat: useful to find disk bottlenecks $ iostat -x -m 1 [...] Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 0.00 0.50 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 1291.00 0.00 3690.50 453.00 234.12 51.34 141.09 8.07 1.95 1.99 1.61 0.23 94.05 xvdc 1332.50 0.00 3808.00 456.00 236.65 51.31 138.31 8.38 1.96 2.01 1.56 0.22 94.70 xvdd 1308.50 0.00 3704.50 449.50 233.14 50.78 139.98 6.83 1.65 1.69 1.27 0.23 93.95 xvde 1285.50 0.00 3632.50 454.50 229.48 51.53 140.81 7.74 1.89 1.94 1.53 0.23 93.40 xvdf 1281.50 0.00 3524.00 459.50 227.91 51.95 143.88 8.08 2.04 2.06 1.86 0.23 93.25 xvdg 1306.00 0.00 3576.50 453.50 231.10 51.70 143.71 7.58 1.89 1.92 1.64 0.23 93.50 xvdh 1302.00 0.00 3566.50 451.50 231.58 51.53 144.30 6.77 1.67 1.72 1.28 0.23 92.90 xvdi 1279.00 0.00 3627.00 448.00 235.86 51.11 144.22 7.92 1.95 1.97 1.73 0.23 93.45 md0 0.00 0.00 34234.50 3570.50 1860.41 411.33 123.07 0.00 0.00 0.00 0.00 0.00 0.00
  • 7. Linux & Client side metrics • iostat: useful to find disk bottlenecks $ iostat -x -m 1 [...] Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 0.00 0.50 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 1291.00 0.00 3690.50 453.00 234.12 51.34 141.09 8.07 1.95 1.99 1.61 0.23 94.05 xvdc 1332.50 0.00 3808.00 456.00 236.65 51.31 138.31 8.38 1.96 2.01 1.56 0.22 94.70 xvdd 1308.50 0.00 3704.50 449.50 233.14 50.78 139.98 6.83 1.65 1.69 1.27 0.23 93.95 xvde 1285.50 0.00 3632.50 454.50 229.48 51.53 140.81 7.74 1.89 1.94 1.53 0.23 93.40 xvdf 1281.50 0.00 3524.00 459.50 227.91 51.95 143.88 8.08 2.04 2.06 1.86 0.23 93.25 xvdg 1306.00 0.00 3576.50 453.50 231.10 51.70 143.71 7.58 1.89 1.92 1.64 0.23 93.50 xvdh 1302.00 0.00 3566.50 451.50 231.58 51.53 144.30 6.77 1.67 1.72 1.28 0.23 92.90 xvdi 1279.00 0.00 3627.00 448.00 235.86 51.11 144.22 7.92 1.95 1.97 1.73 0.23 93.45 md0 0.00 0.00 34234.50 3570.50 1860.41 411.33 123.07 0.00 0.00 0.00 0.00 0.00 0.00
  • 8. Linux & Client side metrics • iostat: useful to find disk bottlenecks $ iostat -x -m 1 [...] Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 0.00 0.50 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 1291.00 0.00 3690.50 453.00 234.12 51.34 141.09 8.07 1.95 1.99 1.61 0.23 94.05 xvdc 1332.50 0.00 3808.00 456.00 236.65 51.31 138.31 8.38 1.96 2.01 1.56 0.22 94.70 xvdd 1308.50 0.00 3704.50 449.50 233.14 50.78 139.98 6.83 1.65 1.69 1.27 0.23 93.95 xvde 1285.50 0.00 3632.50 454.50 229.48 51.53 140.81 7.74 1.89 1.94 1.53 0.23 93.40 xvdf 1281.50 0.00 3524.00 459.50 227.91 51.95 143.88 8.08 2.04 2.06 1.86 0.23 93.25 xvdg 1306.00 0.00 3576.50 453.50 231.10 51.70 143.71 7.58 1.89 1.92 1.64 0.23 93.50 xvdh 1302.00 0.00 3566.50 451.50 231.58 51.53 144.30 6.77 1.67 1.72 1.28 0.23 92.90 xvdi 1279.00 0.00 3627.00 448.00 235.86 51.11 144.22 7.92 1.95 1.97 1.73 0.23 93.45 md0 0.00 0.00 34234.50 3570.50 1860.41 411.33 123.07 0.00 0.00 0.00 0.00 0.00 0.00
  • 9. Not all issues are database issues • Client can introduce latencies as well ▪most notably, cassandra-stress will do. ▪JHiccup - client instrumentation for client-side hiccups.
  • 10. Our Agenda for today • Basics of Monitoring Scylla • Monitoring Infrastructure • Understanding Scylla metrics
  • 11. collectd metrics Prometheus Scylla / Agent Browserip:9103 Grafana ip:65534 ip:3000 ip:9103 ip:9103 HTTP Scylla / Agent Scylla / Agent
  • 12. Scylla & Agent Scylla Monitoring collectd collectd_exporter ip:65534 Scylla metrics scyllatop Scylla ip:25826 Scylla + OS metrics Ip:9103 HTTP
  • 13. How to use those metrics? • your own infrastructure ▪Whatever works for collectd, works for Scylla • scyllatop • prometheus + grafana
  • 14. scyllatop • easy to use, top-like interface. • very high resolution • good for ad-hoc probing ▪not very good for cluster-wide view or time progression
  • 15. List of metrics available • RESTful API: $ curl http://scylla-server:10000/collectd | json_reformat [ … { "enable": true, "id": { "plugin_instance": "#cpu", "type_instance": "load", "type": "gauge", "plugin": "reactor" } }, • scyllatop -l: ▪ includes host metrics # scylla running with --smp 1 $ scyllatop -l | wc -l 145
  • 16. prometheus + grafana •easy cluster-wide view, with pre-configured dashboards •easy system progression view •easy metric correlation •adding composite metrics •harder to setup, -but we try to make it easier, docker images, pre-loaded dashboards. -https://github.com/scylladb/scylla-grafana-monitoring
  • 17. prometheus + grafana • prometheus/grafana imgs, pre-loaded with dashboards: ▪https://github.com/scylladb/scylla-grafana-monitoring
  • 19. Our Agenda for today • Basics of Monitoring Scylla • Monitoring Infrastructure • Understanding Scylla metrics
  • 20. Naming of metrics Collectd naming scheme: {host}/{plugin}-{plugin instance}/{type}-{type instance} • plugin - name of the component • plugin instance - instance of the component • type - type of metric’s value • type instance - name of the metric of given component
  • 21. Naming of metrics Collectd naming scheme: {host}/{plugin}-{plugin instance}/{type}-{type instance} E.g.: node1/reactor-0/gauge-load
  • 22. Naming of metrics • plugin instances usually correspond to shard numbers. ▪ Example --smp 3: node1/reactor-0/gauge-load node1/reactor-1/gauge-load node1/reactor-2/gauge-load
  • 23. • GAUGE - value as is ▪ collectd types: gauge, bytes, pending_operations, ... ▪ reactor-*/gauge-load, lsa-*/bytes-total_space, ... • DERIVE - change over time ▪ collectd types: total_operations, derive, ... ▪ database-*/total_operations-total_reads Data source types
  • 24. Naming of metrics When exported to prometheus: collectd_{plugin}_{type} { {plugin}={plugin instance},type={type instance},instance={host} } E.g.: collectd_reactor_gauge{reactor=”0”,type=”load”,instance=”node1”}
  • 25. Metric plugins coordinator replica transport (CQL server) thrift storage_proxy database memtables cachecommitlog seastar framework reactor memory io_queue lsa smp compaction_manager
  • 26. • transport-*/total_operations-requests_served ▪ counts incoming CQL requests ▪ coordinator-side • database-*/total_operations-total_{reads|writes} ▪ counts incoming replica read/write requests • both are DERIVE-typed Throughput metrics
  • 27. • storage_proxy-*/total_operations-{read|write} timeouts ▪ count number of timeouted read and write requests ▪ coordinator-side • check coordinator logs • check replica logs • check for overload Error metrics
  • 28. Best reflected by reactor-*/gauge-load • percentage of time Scylla was executing tasks ▪ excludes busy polling, execution of on-idle tasks, sleeping ▪ Updated every second and reflects past 5 seconds. • 100 means the server is CPU-bound CPU Utilization
  • 29. Memory utilization metrics total memory standard allocations (non-LSA) LSA free memtables (dirty) cache
  • 30. Memory utilization metrics total memory standard allocations (non-LSA) LSA free memtables (dirty) cache lsa-*/bytes-non_lsa_used_space memory-*/memory-total_memory lsa-*/bytes-total_space memory-*/bytes-dirty cache-*/bytes-total
  • 31. Memory utilization metrics • Useful for detecting: ▪cache getting shrunk down due to pressure from std allocations ▪requests blocking -only 50 % of memory is allowed to be dirty. -Requests will block if we can’t clean fast enough.
  • 33. Cache metrics • cache-*/total_operations-*: ▪ hits, misses - entries found/not found in cache during read ▪ merges - entries updated during memtable flush ▪ insertions - entries added (on miss, memtable flush) ▪ evictions - entries removed due to memory pressure ▪ removals - entries invalidated (ring ownership change) • currently entries are per-partition
  • 35. I/O Queue metrics • Scylla uses the I/O Queue to provide fairness among: ▪ commitlog, memtables, query, etc io_queue-*/derive-{class name} bandwidth (bps) io_queue-*/delay-{class name} queue latency, not counting disk access (s) io_queue-*/queue_length-{class name} # requests waiting io_queue-*/total_operations-{class name} IOPS