Ceph Community Talk on High-Performance Solid Sate Ceph

Ceph Community Talk on
High-Performance Solid State Ceph
Warren Wang, Reddy Chagam, Gunna Marripudi, Allen Samuels
Oct 2015

2
DISCLAIMER
The following presentation includes discussions about proposals
that may not yet accepted in the upstream community. There is no
guarantee that all of the forward looking items will make it through
the acceptance process, nor is there a guarantee on timing of the
proposals.
Likewise, there is no guarantee on performance, as it may vary for a
number of reasons. Any discussions about configs should be
validated before used in production.
Cloud Powered

3
Introductions
• Warren Wang - Walmart Technology
• Reddy Chagam - Intel
• Gunna Marripudi - Samsung
• Allen Samuels - SanDisk
Cloud Powered

4
Growing High Performance Block Workloads in OpenStack
• Increasing trend for high performance, large capacity block workloads
– NoSQL and more traditional databases
• Many OpenStack operators already using Ceph
– Can we continue this trend with high performance block?
– Linear scaling performance?
• During Giant timeframe, many read improvements were made
– What about write performance?
• 90% read stats are boring and unrealistic
– Lots of talk and experimentation on user list about performance changes
– Amount of work going on was evident, and rate of change was improving on
performance characteristics
• Work directly with some of the contributors of performance changes
Cloud Powered

5
Test workload
• Researched a real workload moving to OpenStack, which amounted to:
– 200K read / 200K write IOPS @ ~7 KB avg
– 100 TB data
– ~ 16 gigabits/sec read
– ~ 16 gigabits/sec write
• Borrowed some solid state compute nodes and formed a big SSD Ceph cluster for
performance testing
– 50 OSD nodes
– 3 separate MON servers
– 400 SATA SSD OSDs
– Single 10Gbe
– 2x replication
– 4096 placement groups
– Clients: Bare metal with kernel client running fio
Cloud Powered

6
Results
• Actual results, yours may vary. 10 minute runs
– 283,000 Read IOPS @ 2.5ms avg
– 280,000 Write IOPS @ 4.3ms avg
– Over 500,000 client IOPS, and over 1 million backend Ceph IOPS
– Performance scaled linearly with the addition of OSD nodes and OSDs
• Is this good enough?
– Reduce avg latency and spikes. 95th+ percentile starts to exceed 20ms
– Improve single threaded perf
– Better utilize each available IOP by reducing write amp in Ceph
– RGW performance
• Improvements from Dumpling days are astonishing
– Not just performance, but overall maturity
– Great time to be involved with the Ceph community
Cloud Powered

88
• Intel plans in this presentation do not constitute Intel plan of record product roadmaps.
• All products, dates, and figures specified are preliminary based on current expectations,
and are subject to change without notice. Intel may make changes to specifications and
product descriptions at any time, without notice.
• Intel technologies’ features and benefits depend on system configuration and may
require enabled hardware, software or service activation. Performance varies depending
on system configuration. No computer system can be absolutely secure. Software and
workloads used in performance tests may have been optimized for performance only on
Intel microprocessors.
• Copyright © 2015 Intel Corporation. All rights reserved. Intel, Intel Inside, the Intel logo
are trademarks of Intel Corporation in the United States and other countries.
Intel Disclaimers

99
Intel: Ceph Community Performance Contributions
• First ever Intel-hosted Ceph Hackathon with focus on performance optimization
• Intel donated 8 node Ceph community performance cluster named ‘Incerta’
• One common baseline for performance regression tests and trend analysis
• Accessible to community contributors
• Periodic automated performance regression tests with latest builds
• Performance as a gate (desired end state)
From Mark Nelson @ RedHat: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/26635
• High performance hardware - 3rd Generation Intel
Xeon™ E5 Processors, 3.2TB NVMe, 40GbE
Networking
• Supports All HDD, Hybrid (HDD+PCIe SSD), or All
PCIe SSD configs

1010
• Performance tools
• Worked with RedHat to designate ‘CBT’ – Ceph Benchmarking Toolkit as the open source Ceph
benchmarking solution (one common tool for Ceph performance testing, analysis)
• Helped to develop standard workloads for Block and Object for integration into CBT (VDI, VOD,
Backup, etc)
• CBT tool hardening (e.g., error handling, reporting) in progress
• Intel up streamed COSBench Integration into CBT for Rados Gateway testing
• CeTune for end user friendly GUI and visualization for Ceph clusters
• open source repo - https://github.com/01org/CeTune
• Performance analysis
• Developed additional function-level LTTng tracing methodology
• Created post-processing scripts that build a workload focused per IO breakdown of latency to find
areas for optimization
• Virtual Storage Manager (VSM)
• Open source Ceph management software to simplify deployments and speed up enterprise adoption
• Focus on how to deploy flash optimized configurations
Intel: Ceph Community Performance Contributions

1111
Intel: Ceph Optimization Focus Areas
• Upstream PMstore Ceph backend for persistent memory support
• Ceph client side caching enhancement (blue-print submission underway)
- Two tiers for caching (DRAM and SSD)
- Configurable cache partitions for sequential and random IO
- Shared cache on host side (accelerate VDI workloads)
- 3rd party cache integration via pluggable architecture
• Upstream lockless C++ wrapper classes for queue, hash
• Client RBD, RADOS data-path optimization (with reduced locking, lockless queues)
• “Cache Tier” optimizations

12
Gunna Marripudi, Principal Storage Architect

1313
• The information provided in the following
presentation describes features that are still in
development. Statements regarding features or
performance are forecasts and do not constitute
guarantees of actual results, which may vary
depending on a number factors.
DISCLAIMER

1414
Latency is ~60%
lower
IOPS is 2 to 8x
better
Throughput is 2 to
6x better
Samsung: SSD Interface Improvements
High-performance networking
and higher performance SSD
devices
New design considerations
to achieve high performance!

1515
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
181
186
191
196
201
206
211
216
221
226
231
236
241
246
251
256
261
266
271
276
281
286
291
296
301
IOPS
Time (s)
IOPS from Page Cache IOPS from SM951 NVMe SSDAverage IOPS per
SSD: ~65K
SSD Spec: ~250K
Samsung: Ceph Performance : 4 OSDs – 4 SSDs 100% Random Reads 4KB on RBD
Test configuration for the scenario:
Ceph Hammer
40GbE RDMA; XIO Messenger
Samsung SM951 NVMe SSDs
FIO on RBD

1616
0
100000
200000
300000
400000
500000
600000
5% 10% 15% 20% 25% 30% 50% 70% 80% 90% 100%
IOPS
Page Cache Hit Rate
Cumulative IOPS
Samsung: Ceph Performance : 4 OSDs – 1 SSD100% Random Reads 4KB on RBD
Average IOPS from
SSD: ~225K
SSD Spec: ~250K
Test configuration for the scenario:
Ceph Hammer
40GbE RDMA; XIO Messenger
Samsung SM951 NVMe SSDs
FIO on RBD

1717
• Various options under Ceph architecture
– Increase number of PGs per pool
– Increase shards per OSD
– Etc.
• Existing read path is synchronous in OSD layer
• Extend it to support asynchronous read in OSD layer
Samsung: Increase Parallelism at SSD

1818
• Ceph architecture supports multiple messenger layers
– SimpleMessenger
– AsyncMessenger
– XIO Messenger
• On 40GbE RDMA capable NIC – 4K Random Read performance with IOs
served from RAM
• XIO Messenger is still experimental
• Enabling XIO Messenger to support multiple RDMA NIC ports available
on a system
Samsung: Messenger Performance Enhancements
XIO Messenger w/RDMA SimpleMessenger w/TCP
~540K ~320K

1919
Commodity Hardware
High-performance
SSDs and Networking
Community
Collaboration
Ceph enhancements
for performance
High–performance
workloads on Ceph!
Samsung: Summary

20
Allen Samuels
Software Architect, Software and Systems Solutions
October 28, 2015

2121
• Began in summer of ‘13 with the Ceph Dumpling release
• Ceph optimized for HDD
– Tuning AND algorithm changes needed for Flash optimization
– Leave defaults for HDD
• Quickly determined that the OSD was the major bottleneck
– OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)
• Examined and rejected multiple OSDs per SSD
– Failure Domain / Crush rules would be a nightmare
SanDisk: Optimizing Ceph for the all-flash Future

2222
• Dumpling OSD was a good design for HDD I/O rates
– Parallelism with a single HDD head in mind
– Heavy CPU / IOP – who cares???
• Need more parallelism and less CPU / IOP
• Evolution not revolution
– Eliminate bottlenecks iteratively
• Initially focused on read-path optimizations for block and object
SanDisk: OSD Optimization

2323
• Context switches matter at flash rates
– Too much “put it in a queue for a another thread”
– Too much lock contention
• Socket handling matters too!
– Too many “get 1 byte” calls to the kernel for sockets
– Disable Nagle’s algorithm to shorten operation latency
• Lots of other simple things
– Eliminate repeated look-ups in maps, caches, etc.
– Eliminate Redundant string copies (especially return of string)
– Large variables passed by value, not const reference
• Contributed improvements to Emperor, Firefly and Giant releases
• Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD
(Hammer) *
SanDisk: OSD Read path Optimization
* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs

2424
• Write path strategy was classic HDD
– Journal writes for minimum foreground latency
– Process journal in batches in the background
• The batch oriented processing was very inefficient on flash
• Modified buffering/writing strategy for Flash
– Recently committed to Infernalis release
– Yields 2.5x write throughput improvement over Hammer
– Average latency is ½ of Hammer
SanDisk: OSD Write path Optimization

2525
• RDMA intra-cluster communication
– Significant reduction in CPU / IOP
• NewStore
– Significant reduction in write amplification -> even higher write performance
• Memory allocation
– tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs. default *
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
SanDisk: Potential Future Improvements

Thank you Ceph and OpenStack community!
Footer goes here

Ceph Community Talk on High-Performance Solid Sate Ceph

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ceph Community Talk on High-Performance Solid Sate Ceph

Similar to Ceph Community Talk on High-Performance Solid Sate Ceph (20)

Recently uploaded

Recently uploaded (20)

Ceph Community Talk on High-Performance Solid Sate Ceph