In this presentation, I explain what Ceph Telemtetry is, and why it matters to users of the Ceph project. We look at lesson's learned, early analysis of the data, and future roadmap items.
This was meant to be presented at Cephalocon'2020 in Seoul and SUSECON in Dublin. A video recording on YouTube: https://youtu.be/P2nq2dIiZSk
I look forward to discussing the content with you online!
2. 2
Agenda
1. Goals and Motivation
2. Data collection methodology
3. Scope and limitations
4. “Pretty” pictures
5. Q&A
3. 3
Goals and Motivation (developer side)
Improve product/project decisions
Understand actual deployments
Detect anomalies and trends pro-actively
4. 4
Automated telemetry augments support
Support cases only opened once an issue has escalated to human attention
Data from support incidents biased towards unhealthy environments
We want to identify issues before they escalate to support incidents
& better understand impact of a reported support incident
5. 5
Goals and Motivation (user/customer PoV)
Improve product/project decisions to reflect your usage
Make sure developers understand your deployments
Detect anomalies and trends pro-actively before they affect your systems
6. 6
Automated telemetry vs surveys
Surveys are limited in scope and depth
Survey provides qualitative data and human insights
Telemetry is automated and delivers more frequent updates
Telemetry has fewer typos :-)
Automated telemetry + surveys: <3
7. 7
Sneak peek: Community Survey’19
404 responses
Total capacity reported: ~1184 PB
– Uncertain, since obviously not all units were consistent
33% say they have enabled Telemetry already <3
… does this match the reports?
Full(er) analysis upcoming
8. 8
Why users have not enabled Telemetry
84 Weren’t aware the feature existed
74 Wish to understand data privacy better
54 Run Ceph versions that do not support it yet
33 Are in firewalled or airgapped environments
9. 9
Telemetry methodology
●
Clusters securely report aggregate statistics
– Data is anonymized, no IP addresses/hostnames/... stored!
●
“Upstream first” via the Ceph Foundation
– Community Data License Agreement – Sharing, Version 1.0
– Shared data corpus improves outcomes
●
Opt-in, not (yet) enabled by default
# ceph telemetry on
10. 10
Ceph community support for telemetry
Upstream support began in Ceph Mimic
Significant enhancements in Nautilus
Backported to Luminous
Supported in all current commercial releases
11. 11
Examples of data included with telemetry
Basic data:
– Total aggregates for capacity and usage
– Number of OSDs, MONs, hosts
– Versions (Ceph, kernel, distribution) aggregates
– CephFS metrics, number of RBDs, pool data
Crashes (can be disabled separately)
Device metrics (can be disabled separately)
# ceph telemetry show
12. 12
Limitations – Caveat, emptor
Biased sample!
– “Recent” versions only
– Not enabled by default, users need to actively enable
– Environments need access to Internet for upload
– Enterprise environments likely under-represented
Thus: not representative of whole population, treat with care!
Trends, don’t worry about exact numbers
13. 13
Exploratory Data Analysis
Python (ipython, pandas)
Data preparation – clean-up, flatten into table
Resample to common intervals (daily, extrapolated)
Start evaluating the data
Find errors in data set, go back to start
14. 14
Time for pretty pictures
●
Overall trends
●
Example of finding a bug
●
Version and feature adoption
●
Identifying most common practices
●
Sizing in the real world
17. 17
Cross-checking this with the survey results:
In [183]: t_on = survey[
survey['Is telemetry enabled in your cluster?'] == 'Yes']
In [184]: t_on['Total raw capacity'].agg('sum')/10**3
Out[184]: 280.126
In [185]: t_on['How many clusters ...'].agg('sum')
Out[185]: 308.0
21. 22
When do people update?
Important for staff planning etc
Compute rate of change per version for every day
– Excursion: total flow through versions
Aggregate the absolute values per day for total rate of change
Aggregate by day of week
… also a good example of the caveats to be mindful of:
23. 24
Placement Groups: How many per pool?
●
Quite important for the even balancing of data
●
Rule of thumb is to have ~100 PGs per OSD
●
Should be rounded to a power of two
●
Exact formula is a bit more difficult as it varies with the data
distribution between pools, pool “size”, ...
●
What do users do?
26. 27
How did the Ceph project remedy this?
Improve documentation, remove bad example, clarify impact
Improve UI/UX experience
Add HEALTH_WARN if state is detected
Introduce pg_autoscaler to fully automate this
– Available in SUSE Enterprise Storage 6 MU
https://ceph.io/community/the-first-telemetry-results-are-in/
29. 30
Prioritization
What is the actual usage pattern?
How significant would an issue in a specific feature/area be?
Focus QA and assess support incident impact
But also: understand why some users are holding out on a “legacy” feature
Are we ready to depreciate something?
35. 36
Erasure code k+m trade-offs
Space overhead and write amplification:
– Larger k: more efficient
– m: durability and availability
– More shards mean more network traffic
Data blocks tend to be power-of-two in size (4K, 4M, etc)
– Divisible by k?
Is this what users really intend? Better docs, guidance?
m
k
37. 38
Let’s talk real world sizing
Everyone wants to know what other people do
Reflects market sweet spots
Currently only a snapshot, not enough data to identify hardware trends
41. 42
Future enhancements
Support different telemetry transport methods (with registration?)
Include more relevant metrics as identified by yet unanswerable questions
– Performance metrics, OSD variance, per-pool usage, client versions/numbers …
– Device and fault data for predictive failure analysis
– Data mining crash data
Automated dashboards on Ceph website
Consider how to enable this by default once acceptance is up
42. 43
Questions? Answers!
# ceph telemetry on
Help Ceph serve you better.
https://ceph.io/resources/
mailto: lmb@suse.com
https://twitter.com/larsmb
https://www.linkedin.com/in/larsmb/
43. 44
General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market a
product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making
purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document,
and specifically disclaims any express or implied warranties of merchantability or fitness for any particular
purpose. The development, release, and timing of features or functionality described for SUSE products remains at the
sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content,
at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced
in this presentation are trademarks or registered trademarks of SUSE, LLC, Inc. in the United States and other
countries. All third-party trademarks are the property of their respective owners.