SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
1
Improving Software-Defined-Storage outcomes
through telemetry insights
Ceph Telemetry: why it’s useful to you and
why you should enable it
Lars Marowsky-Brée
Architect Software-Defined-Storage
lmb@suse.com
2
Agenda
1. Goals and Motivation
2. Data collection methodology
3. Scope and limitations
4. “Pretty” pictures
5. Q&A
3
Goals and Motivation (developer side)
Improve product/project decisions
Understand actual deployments
Detect anomalies and trends pro-actively
4
Automated telemetry augments support
Support cases only opened once an issue has escalated to human attention
Data from support incidents biased towards unhealthy environments
We want to identify issues before they escalate to support incidents
& better understand impact of a reported support incident
5
Goals and Motivation (user/customer PoV)
Improve product/project decisions to reflect your usage
Make sure developers understand your deployments
Detect anomalies and trends pro-actively before they affect your systems
6
Automated telemetry vs surveys
Surveys are limited in scope and depth
Survey provides qualitative data and human insights
Telemetry is automated and delivers more frequent updates
Telemetry has fewer typos :-)
Automated telemetry + surveys: <3
7
Sneak peek: Community Survey’19
404 responses
Total capacity reported: ~1184 PB
– Uncertain, since obviously not all units were consistent
33% say they have enabled Telemetry already <3
… does this match the reports?
Full(er) analysis upcoming
8
Why users have not enabled Telemetry
84 Weren’t aware the feature existed
74 Wish to understand data privacy better
54 Run Ceph versions that do not support it yet
33 Are in firewalled or airgapped environments
9
Telemetry methodology
●
Clusters securely report aggregate statistics
– Data is anonymized, no IP addresses/hostnames/... stored!
●
“Upstream first” via the Ceph Foundation
– Community Data License Agreement – Sharing, Version 1.0
– Shared data corpus improves outcomes
●
Opt-in, not (yet) enabled by default
# ceph telemetry on
10
Ceph community support for telemetry
Upstream support began in Ceph Mimic
Significant enhancements in Nautilus
Backported to Luminous
Supported in all current commercial releases
11
Examples of data included with telemetry
Basic data:
– Total aggregates for capacity and usage
– Number of OSDs, MONs, hosts
– Versions (Ceph, kernel, distribution) aggregates
– CephFS metrics, number of RBDs, pool data
Crashes (can be disabled separately)
Device metrics (can be disabled separately)
# ceph telemetry show
12
Limitations – Caveat, emptor
Biased sample!
– “Recent” versions only
– Not enabled by default, users need to actively enable
– Environments need access to Internet for upload
– Enterprise environments likely under-represented
Thus: not representative of whole population, treat with care!
Trends, don’t worry about exact numbers
13
Exploratory Data Analysis
Python (ipython, pandas)
Data preparation – clean-up, flatten into table
Resample to common intervals (daily, extrapolated)
Start evaluating the data
Find errors in data set, go back to start
14
Time for pretty pictures
●
Overall trends
●
Example of finding a bug
●
Version and feature adoption
●
Identifying most common practices
●
Sizing in the real world
15
How many clusters are reporting in?
16
Total capacity reporting (Petabytes)
17
Cross-checking this with the survey results:
In [183]: t_on = survey[
survey['Is telemetry enabled in your cluster?'] == 'Yes']
In [184]: t_on['Total raw capacity'].agg('sum')/10**3
Out[184]: 280.126
In [185]: t_on['How many clusters ...'].agg('sum')
Out[185]: 308.0
18
Major Ceph versions in the field
19
Breakdown of Ceph v14.x.y on OSDs in the field
20
Clusters running at least one node at 14.x.y.z
22
When do people update?
Important for staff planning etc
Compute rate of change per version for every day
– Excursion: total flow through versions
Aggregate the absolute values per day for total rate of change
Aggregate by day of week
… also a good example of the caveats to be mindful of:
23
Versions change aggregated by day-of-week
24
Placement Groups: How many per pool?
●
Quite important for the even balancing of data
●
Rule of thumb is to have ~100 PGs per OSD
●
Should be rounded to a power of two
●
Exact formula is a bit more difficult as it varies with the data
distribution between pools, pool “size”, ...
●
What do users do?
25
Top 20 pg_num values across all pools …?!
26
pg_num – power of two or not
27
How did the Ceph project remedy this?
Improve documentation, remove bad example, clarify impact
Improve UI/UX experience
Add HEALTH_WARN if state is detected
Introduce pg_autoscaler to fully automate this
– Available in SUSE Enterprise Storage 6 MU
https://ceph.io/community/the-first-telemetry-results-are-in/
28
Adoption of pg_autoscaler functionality
29
Power of two pg_num with pg_autoscaler on:
30
Prioritization
What is the actual usage pattern?
How significant would an issue in a specific feature/area be?
Focus QA and assess support incident impact
But also: understand why some users are holding out on a “legacy” feature
Are we ready to depreciate something?
31
How many OSDs remain on FileStore?
32
No of Pools: Replicated / Erasure Coding
33
No of Clusters: Replicated / Erasure Coding
34
Which Erasure code plugins are used?
35
EC: which k+m values are chosen?
36
Erasure code k+m trade-offs
Space overhead and write amplification:
– Larger k: more efficient
– m: durability and availability
– More shards mean more network traffic
Data blocks tend to be power-of-two in size (4K, 4M, etc)
– Divisible by k?
Is this what users really intend? Better docs, guidance?
m
k
37
Which defaults do users most frequently change?
38
Let’s talk real world sizing
Everyone wants to know what other people do
Reflects market sweet spots
Currently only a snapshot, not enough data to identify hardware trends
39
Deployed densities, device sizes (quartiles)
40
OSDs: rotational vs flash/SSD/NVMe
41
OSDs: rotational vs flash/SSD/NVMe, >=1PB
42
Future enhancements
Support different telemetry transport methods (with registration?)
Include more relevant metrics as identified by yet unanswerable questions
– Performance metrics, OSD variance, per-pool usage, client versions/numbers …
– Device and fault data for predictive failure analysis
– Data mining crash data
Automated dashboards on Ceph website
Consider how to enable this by default once acceptance is up
43
Questions? Answers!
# ceph telemetry on
Help Ceph serve you better.
https://ceph.io/resources/
mailto: lmb@suse.com
https://twitter.com/larsmb
https://www.linkedin.com/in/larsmb/
44
General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market a
product.  It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making
purchasing decisions.  SUSE makes no representations or warranties with respect to the contents of this document,
and specifically disclaims any express or implied warranties of merchantability or fitness for any particular
purpose.  The development, release, and timing of features or functionality described for SUSE products remains at the
sole discretion of SUSE.  Further, SUSE reserves the right to revise this document and to make changes to its content,
at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced
in this presentation are trademarks or registered trademarks of SUSE, LLC, Inc. in the United States and other
countries.  All third-party trademarks are the property of their respective owners.

Contenu connexe

Similaire à Ceph Telemetry - Improving Software-Defined-Storage Outcomes

Anitha_Resume_BigData
Anitha_Resume_BigDataAnitha_Resume_BigData
Anitha_Resume_BigDataAnitha Bade
 
Unleash Enterprise Innovation with Sogeti’s Industry Solutions
Unleash Enterprise Innovation with Sogeti’s Industry SolutionsUnleash Enterprise Innovation with Sogeti’s Industry Solutions
Unleash Enterprise Innovation with Sogeti’s Industry SolutionsCapgemini
 
Empowering Customers with Personalized Insights
Empowering Customers with Personalized InsightsEmpowering Customers with Personalized Insights
Empowering Customers with Personalized InsightsCloudera, Inc.
 
Migrations, Health Checks, and Support Experiences - Postgres from the Servic...
Migrations, Health Checks, and Support Experiences - Postgres from the Servic...Migrations, Health Checks, and Support Experiences - Postgres from the Servic...
Migrations, Health Checks, and Support Experiences - Postgres from the Servic...EDB
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityAggregage
 
DMM9 - Data Migration Testing
DMM9 - Data Migration TestingDMM9 - Data Migration Testing
DMM9 - Data Migration TestingNick van Beest
 
The IBM dashboard for operational metrics
The IBM dashboard for operational metricsThe IBM dashboard for operational metrics
The IBM dashboard for operational metricsPlatform CF
 
OTS - Everything you wanted to know but didn't ask
OTS - Everything you wanted to know but didn't askOTS - Everything you wanted to know but didn't ask
OTS - Everything you wanted to know but didn't askJeff Hackney
 
Creating a Solid EPM Punch List
Creating a Solid EPM Punch ListCreating a Solid EPM Punch List
Creating a Solid EPM Punch ListDatavail
 
VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...
VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...
VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...VMworld
 
Software Operation Knowledge
Software Operation KnowledgeSoftware Operation Knowledge
Software Operation KnowledgeDevnology
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education ITKaseya
 
Troux Presentation Austin Texas
Troux Presentation Austin TexasTroux Presentation Austin Texas
Troux Presentation Austin TexasJoeFaghani
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERGanesan Narayanasamy
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
Eliciting Non-Functional Requirements
Eliciting Non-Functional RequirementsEliciting Non-Functional Requirements
Eliciting Non-Functional RequirementsLisa Combest
 
Case study: Building a Holistic View of Data - Big Data Expo 2019
Case study: Building a Holistic View of Data - Big Data Expo 2019Case study: Building a Holistic View of Data - Big Data Expo 2019
Case study: Building a Holistic View of Data - Big Data Expo 2019webwinkelvakdag
 

Similaire à Ceph Telemetry - Improving Software-Defined-Storage Outcomes (20)

Anitha_Resume_BigData
Anitha_Resume_BigDataAnitha_Resume_BigData
Anitha_Resume_BigData
 
Unleash Enterprise Innovation with Sogeti’s Industry Solutions
Unleash Enterprise Innovation with Sogeti’s Industry SolutionsUnleash Enterprise Innovation with Sogeti’s Industry Solutions
Unleash Enterprise Innovation with Sogeti’s Industry Solutions
 
BIS Ch 4.ppt
BIS Ch 4.pptBIS Ch 4.ppt
BIS Ch 4.ppt
 
Empowering Customers with Personalized Insights
Empowering Customers with Personalized InsightsEmpowering Customers with Personalized Insights
Empowering Customers with Personalized Insights
 
Migrations, Health Checks, and Support Experiences - Postgres from the Servic...
Migrations, Health Checks, and Support Experiences - Postgres from the Servic...Migrations, Health Checks, and Support Experiences - Postgres from the Servic...
Migrations, Health Checks, and Support Experiences - Postgres from the Servic...
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
 
DMM9 - Data Migration Testing
DMM9 - Data Migration TestingDMM9 - Data Migration Testing
DMM9 - Data Migration Testing
 
The IBM dashboard for operational metrics
The IBM dashboard for operational metricsThe IBM dashboard for operational metrics
The IBM dashboard for operational metrics
 
OTS - Everything you wanted to know but didn't ask
OTS - Everything you wanted to know but didn't askOTS - Everything you wanted to know but didn't ask
OTS - Everything you wanted to know but didn't ask
 
Creating a Solid EPM Punch List
Creating a Solid EPM Punch ListCreating a Solid EPM Punch List
Creating a Solid EPM Punch List
 
VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...
VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...
VMworld 2013: Gaining Insight in Meditech with vCenter Operations Management ...
 
Software Operation Knowledge
Software Operation KnowledgeSoftware Operation Knowledge
Software Operation Knowledge
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT
 
Troux Presentation Austin Texas
Troux Presentation Austin TexasTroux Presentation Austin Texas
Troux Presentation Austin Texas
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Eliciting Non-Functional Requirements
Eliciting Non-Functional RequirementsEliciting Non-Functional Requirements
Eliciting Non-Functional Requirements
 
Case study: Building a Holistic View of Data - Big Data Expo 2019
Case study: Building a Holistic View of Data - Big Data Expo 2019Case study: Building a Holistic View of Data - Big Data Expo 2019
Case study: Building a Holistic View of Data - Big Data Expo 2019
 

Dernier

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Dernier (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

Ceph Telemetry - Improving Software-Defined-Storage Outcomes

  • 1. 1 Improving Software-Defined-Storage outcomes through telemetry insights Ceph Telemetry: why it’s useful to you and why you should enable it Lars Marowsky-Brée Architect Software-Defined-Storage lmb@suse.com
  • 2. 2 Agenda 1. Goals and Motivation 2. Data collection methodology 3. Scope and limitations 4. “Pretty” pictures 5. Q&A
  • 3. 3 Goals and Motivation (developer side) Improve product/project decisions Understand actual deployments Detect anomalies and trends pro-actively
  • 4. 4 Automated telemetry augments support Support cases only opened once an issue has escalated to human attention Data from support incidents biased towards unhealthy environments We want to identify issues before they escalate to support incidents & better understand impact of a reported support incident
  • 5. 5 Goals and Motivation (user/customer PoV) Improve product/project decisions to reflect your usage Make sure developers understand your deployments Detect anomalies and trends pro-actively before they affect your systems
  • 6. 6 Automated telemetry vs surveys Surveys are limited in scope and depth Survey provides qualitative data and human insights Telemetry is automated and delivers more frequent updates Telemetry has fewer typos :-) Automated telemetry + surveys: <3
  • 7. 7 Sneak peek: Community Survey’19 404 responses Total capacity reported: ~1184 PB – Uncertain, since obviously not all units were consistent 33% say they have enabled Telemetry already <3 … does this match the reports? Full(er) analysis upcoming
  • 8. 8 Why users have not enabled Telemetry 84 Weren’t aware the feature existed 74 Wish to understand data privacy better 54 Run Ceph versions that do not support it yet 33 Are in firewalled or airgapped environments
  • 9. 9 Telemetry methodology ● Clusters securely report aggregate statistics – Data is anonymized, no IP addresses/hostnames/... stored! ● “Upstream first” via the Ceph Foundation – Community Data License Agreement – Sharing, Version 1.0 – Shared data corpus improves outcomes ● Opt-in, not (yet) enabled by default # ceph telemetry on
  • 10. 10 Ceph community support for telemetry Upstream support began in Ceph Mimic Significant enhancements in Nautilus Backported to Luminous Supported in all current commercial releases
  • 11. 11 Examples of data included with telemetry Basic data: – Total aggregates for capacity and usage – Number of OSDs, MONs, hosts – Versions (Ceph, kernel, distribution) aggregates – CephFS metrics, number of RBDs, pool data Crashes (can be disabled separately) Device metrics (can be disabled separately) # ceph telemetry show
  • 12. 12 Limitations – Caveat, emptor Biased sample! – “Recent” versions only – Not enabled by default, users need to actively enable – Environments need access to Internet for upload – Enterprise environments likely under-represented Thus: not representative of whole population, treat with care! Trends, don’t worry about exact numbers
  • 13. 13 Exploratory Data Analysis Python (ipython, pandas) Data preparation – clean-up, flatten into table Resample to common intervals (daily, extrapolated) Start evaluating the data Find errors in data set, go back to start
  • 14. 14 Time for pretty pictures ● Overall trends ● Example of finding a bug ● Version and feature adoption ● Identifying most common practices ● Sizing in the real world
  • 15. 15 How many clusters are reporting in?
  • 17. 17 Cross-checking this with the survey results: In [183]: t_on = survey[ survey['Is telemetry enabled in your cluster?'] == 'Yes'] In [184]: t_on['Total raw capacity'].agg('sum')/10**3 Out[184]: 280.126 In [185]: t_on['How many clusters ...'].agg('sum') Out[185]: 308.0
  • 18. 18 Major Ceph versions in the field
  • 19. 19 Breakdown of Ceph v14.x.y on OSDs in the field
  • 20. 20 Clusters running at least one node at 14.x.y.z
  • 21. 22 When do people update? Important for staff planning etc Compute rate of change per version for every day – Excursion: total flow through versions Aggregate the absolute values per day for total rate of change Aggregate by day of week … also a good example of the caveats to be mindful of:
  • 23. 24 Placement Groups: How many per pool? ● Quite important for the even balancing of data ● Rule of thumb is to have ~100 PGs per OSD ● Should be rounded to a power of two ● Exact formula is a bit more difficult as it varies with the data distribution between pools, pool “size”, ... ● What do users do?
  • 24. 25 Top 20 pg_num values across all pools …?!
  • 25. 26 pg_num – power of two or not
  • 26. 27 How did the Ceph project remedy this? Improve documentation, remove bad example, clarify impact Improve UI/UX experience Add HEALTH_WARN if state is detected Introduce pg_autoscaler to fully automate this – Available in SUSE Enterprise Storage 6 MU https://ceph.io/community/the-first-telemetry-results-are-in/
  • 28. 29 Power of two pg_num with pg_autoscaler on:
  • 29. 30 Prioritization What is the actual usage pattern? How significant would an issue in a specific feature/area be? Focus QA and assess support incident impact But also: understand why some users are holding out on a “legacy” feature Are we ready to depreciate something?
  • 30. 31 How many OSDs remain on FileStore?
  • 31. 32 No of Pools: Replicated / Erasure Coding
  • 32. 33 No of Clusters: Replicated / Erasure Coding
  • 33. 34 Which Erasure code plugins are used?
  • 34. 35 EC: which k+m values are chosen?
  • 35. 36 Erasure code k+m trade-offs Space overhead and write amplification: – Larger k: more efficient – m: durability and availability – More shards mean more network traffic Data blocks tend to be power-of-two in size (4K, 4M, etc) – Divisible by k? Is this what users really intend? Better docs, guidance? m k
  • 36. 37 Which defaults do users most frequently change?
  • 37. 38 Let’s talk real world sizing Everyone wants to know what other people do Reflects market sweet spots Currently only a snapshot, not enough data to identify hardware trends
  • 38. 39 Deployed densities, device sizes (quartiles)
  • 39. 40 OSDs: rotational vs flash/SSD/NVMe
  • 40. 41 OSDs: rotational vs flash/SSD/NVMe, >=1PB
  • 41. 42 Future enhancements Support different telemetry transport methods (with registration?) Include more relevant metrics as identified by yet unanswerable questions – Performance metrics, OSD variance, per-pool usage, client versions/numbers … – Device and fault data for predictive failure analysis – Data mining crash data Automated dashboards on Ceph website Consider how to enable this by default once acceptance is up
  • 42. 43 Questions? Answers! # ceph telemetry on Help Ceph serve you better. https://ceph.io/resources/ mailto: lmb@suse.com https://twitter.com/larsmb https://www.linkedin.com/in/larsmb/
  • 43. 44 General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product.  It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.  SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose.  The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE.  Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE, LLC, Inc. in the United States and other countries.  All third-party trademarks are the property of their respective owners.