The document summarizes a presentation given by representatives from various companies on optimizing Ceph for high-performance solid state drives. It discusses testing a real workload on a Ceph cluster with 50 SSD nodes that achieved over 280,000 read and write IOPS. Areas for further optimization were identified, such as reducing latency spikes and improving single-threaded performance. Various companies then described their contributions to Ceph performance, such as Intel providing hardware for testing and Samsung discussing SSD interface improvements.
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
Ceph Community Talk on High-Performance Solid Sate Ceph
1. Ceph Community Talk on
High-Performance Solid State Ceph
Warren Wang, Reddy Chagam, Gunna Marripudi, Allen Samuels
Oct 2015
2. 2
DISCLAIMER
The following presentation includes discussions about proposals
that may not yet accepted in the upstream community. There is no
guarantee that all of the forward looking items will make it through
the acceptance process, nor is there a guarantee on timing of the
proposals.
Likewise, there is no guarantee on performance, as it may vary for a
number of reasons. Any discussions about configs should be
validated before used in production.
Cloud Powered
4. 4
Growing High Performance Block Workloads in OpenStack
• Increasing trend for high performance, large capacity block workloads
– NoSQL and more traditional databases
• Many OpenStack operators already using Ceph
– Can we continue this trend with high performance block?
– Linear scaling performance?
• During Giant timeframe, many read improvements were made
– What about write performance?
• 90% read stats are boring and unrealistic
– Lots of talk and experimentation on user list about performance changes
– Amount of work going on was evident, and rate of change was improving on
performance characteristics
• Work directly with some of the contributors of performance changes
Cloud Powered
5. 5
Test workload
• Researched a real workload moving to OpenStack, which amounted to:
– 200K read / 200K write IOPS @ ~7 KB avg
– 100 TB data
– ~ 16 gigabits/sec read
– ~ 16 gigabits/sec write
• Borrowed some solid state compute nodes and formed a big SSD Ceph cluster for
performance testing
– 50 OSD nodes
– 3 separate MON servers
– 400 SATA SSD OSDs
– Single 10Gbe
– 2x replication
– 4096 placement groups
– Clients: Bare metal with kernel client running fio
Cloud Powered
6. 6
Results
• Actual results, yours may vary. 10 minute runs
– 283,000 Read IOPS @ 2.5ms avg
– 280,000 Write IOPS @ 4.3ms avg
– Over 500,000 client IOPS, and over 1 million backend Ceph IOPS
– Performance scaled linearly with the addition of OSD nodes and OSDs
• Is this good enough?
– Reduce avg latency and spikes. 95th+ percentile starts to exceed 20ms
– Improve single threaded perf
– Better utilize each available IOP by reducing write amp in Ceph
– RGW performance
• Improvements from Dumpling days are astonishing
– Not just performance, but overall maturity
– Great time to be involved with the Ceph community
Cloud Powered
9. 99
Intel: Ceph Community Performance Contributions
• First ever Intel-hosted Ceph Hackathon with focus on performance optimization
• Intel donated 8 node Ceph community performance cluster named ‘Incerta’
• One common baseline for performance regression tests and trend analysis
• Accessible to community contributors
• Periodic automated performance regression tests with latest builds
• Performance as a gate (desired end state)
From Mark Nelson @ RedHat: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/26635
• High performance hardware - 3rd Generation Intel
Xeon™ E5 Processors, 3.2TB NVMe, 40GbE
Networking
• Supports All HDD, Hybrid (HDD+PCIe SSD), or All
PCIe SSD configs
10. 1010
• Performance tools
• Worked with RedHat to designate ‘CBT’ – Ceph Benchmarking Toolkit as the open source Ceph
benchmarking solution (one common tool for Ceph performance testing, analysis)
• Helped to develop standard workloads for Block and Object for integration into CBT (VDI, VOD,
Backup, etc)
• CBT tool hardening (e.g., error handling, reporting) in progress
• Intel up streamed COSBench Integration into CBT for Rados Gateway testing
• CeTune for end user friendly GUI and visualization for Ceph clusters
• open source repo - https://github.com/01org/CeTune
• Performance analysis
• Developed additional function-level LTTng tracing methodology
• Created post-processing scripts that build a workload focused per IO breakdown of latency to find
areas for optimization
• Virtual Storage Manager (VSM)
• Open source Ceph management software to simplify deployments and speed up enterprise adoption
• Focus on how to deploy flash optimized configurations
Intel: Ceph Community Performance Contributions
11. 1111
Intel: Ceph Optimization Focus Areas
• Upstream PMstore Ceph backend for persistent memory support
• Ceph client side caching enhancement (blue-print submission underway)
- Two tiers for caching (DRAM and SSD)
- Configurable cache partitions for sequential and random IO
- Shared cache on host side (accelerate VDI workloads)
- 3rd party cache integration via pluggable architecture
• Upstream lockless C++ wrapper classes for queue, hash
• Client RBD, RADOS data-path optimization (with reduced locking, lockless queues)
• “Cache Tier” optimizations
13. 1313
• The information provided in the following
presentation describes features that are still in
development. Statements regarding features or
performance are forecasts and do not constitute
guarantees of actual results, which may vary
depending on a number factors.
DISCLAIMER
14. 1414
Latency is ~60%
lower
IOPS is 2 to 8x
better
Throughput is 2 to
6x better
Samsung: SSD Interface Improvements
High-performance networking
and higher performance SSD
devices
New design considerations
to achieve high performance!
16. 1616
0
100000
200000
300000
400000
500000
600000
5% 10% 15% 20% 25% 30% 50% 70% 80% 90% 100%
IOPS
Page Cache Hit Rate
Cumulative IOPS
Samsung: Ceph Performance : 4 OSDs – 1 SSD100% Random Reads 4KB on RBD
Average IOPS from
SSD: ~225K
SSD Spec: ~250K
Test configuration for the scenario:
Ceph Hammer
40GbE RDMA; XIO Messenger
Samsung SM951 NVMe SSDs
FIO on RBD
17. 1717
• Various options under Ceph architecture
– Increase number of PGs per pool
– Increase shards per OSD
– Etc.
• Existing read path is synchronous in OSD layer
• Extend it to support asynchronous read in OSD layer
Samsung: Increase Parallelism at SSD
18. 1818
• Ceph architecture supports multiple messenger layers
– SimpleMessenger
– AsyncMessenger
– XIO Messenger
• On 40GbE RDMA capable NIC – 4K Random Read performance with IOs
served from RAM
• XIO Messenger is still experimental
• Enabling XIO Messenger to support multiple RDMA NIC ports available
on a system
Samsung: Messenger Performance Enhancements
XIO Messenger w/RDMA SimpleMessenger w/TCP
~540K ~320K
21. 2121
• Began in summer of ‘13 with the Ceph Dumpling release
• Ceph optimized for HDD
– Tuning AND algorithm changes needed for Flash optimization
– Leave defaults for HDD
• Quickly determined that the OSD was the major bottleneck
– OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)
• Examined and rejected multiple OSDs per SSD
– Failure Domain / Crush rules would be a nightmare
SanDisk: Optimizing Ceph for the all-flash Future
22. 2222
• Dumpling OSD was a good design for HDD I/O rates
– Parallelism with a single HDD head in mind
– Heavy CPU / IOP – who cares???
• Need more parallelism and less CPU / IOP
• Evolution not revolution
– Eliminate bottlenecks iteratively
• Initially focused on read-path optimizations for block and object
SanDisk: OSD Optimization
23. 2323
• Context switches matter at flash rates
– Too much “put it in a queue for a another thread”
– Too much lock contention
• Socket handling matters too!
– Too many “get 1 byte” calls to the kernel for sockets
– Disable Nagle’s algorithm to shorten operation latency
• Lots of other simple things
– Eliminate repeated look-ups in maps, caches, etc.
– Eliminate Redundant string copies (especially return of string)
– Large variables passed by value, not const reference
• Contributed improvements to Emperor, Firefly and Giant releases
• Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD
(Hammer) *
SanDisk: OSD Read path Optimization
* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
24. 2424
• Write path strategy was classic HDD
– Journal writes for minimum foreground latency
– Process journal in batches in the background
• The batch oriented processing was very inefficient on flash
• Modified buffering/writing strategy for Flash
– Recently committed to Infernalis release
– Yields 2.5x write throughput improvement over Hammer
– Average latency is ½ of Hammer
SanDisk: OSD Write path Optimization
25. 2525
• RDMA intra-cluster communication
– Significant reduction in CPU / IOP
• NewStore
– Significant reduction in write amplification -> even higher write performance
• Memory allocation
– tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs. default *
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
SanDisk: Potential Future Improvements
26. Thank you Ceph and OpenStack community!
Footer goes here