SlideShare a Scribd company logo
1 of 26
Ceph Community Talk on
High-Performance Solid State Ceph
Warren Wang, Reddy Chagam, Gunna Marripudi, Allen Samuels
Oct 2015
2
DISCLAIMER
The following presentation includes discussions about proposals
that may not yet accepted in the upstream community. There is no
guarantee that all of the forward looking items will make it through
the acceptance process, nor is there a guarantee on timing of the
proposals.
Likewise, there is no guarantee on performance, as it may vary for a
number of reasons. Any discussions about configs should be
validated before used in production.
Cloud Powered
3
Introductions
• Warren Wang - Walmart Technology
• Reddy Chagam - Intel
• Gunna Marripudi - Samsung
• Allen Samuels - SanDisk
Cloud Powered
4
Growing High Performance Block Workloads in OpenStack
• Increasing trend for high performance, large capacity block workloads
– NoSQL and more traditional databases
• Many OpenStack operators already using Ceph
– Can we continue this trend with high performance block?
– Linear scaling performance?
• During Giant timeframe, many read improvements were made
– What about write performance?
• 90% read stats are boring and unrealistic
– Lots of talk and experimentation on user list about performance changes
– Amount of work going on was evident, and rate of change was improving on
performance characteristics
• Work directly with some of the contributors of performance changes
Cloud Powered
5
Test workload
• Researched a real workload moving to OpenStack, which amounted to:
– 200K read / 200K write IOPS @ ~7 KB avg
– 100 TB data
– ~ 16 gigabits/sec read
– ~ 16 gigabits/sec write
• Borrowed some solid state compute nodes and formed a big SSD Ceph cluster for
performance testing
– 50 OSD nodes
– 3 separate MON servers
– 400 SATA SSD OSDs
– Single 10Gbe
– 2x replication
– 4096 placement groups
– Clients: Bare metal with kernel client running fio
Cloud Powered
6
Results
• Actual results, yours may vary. 10 minute runs
– 283,000 Read IOPS @ 2.5ms avg
– 280,000 Write IOPS @ 4.3ms avg
– Over 500,000 client IOPS, and over 1 million backend Ceph IOPS
– Performance scaled linearly with the addition of OSD nodes and OSDs
• Is this good enough?
– Reduce avg latency and spikes. 95th+ percentile starts to exceed 20ms
– Improve single threaded perf
– Better utilize each available IOP by reducing write amp in Ceph
– RGW performance
• Improvements from Dumpling days are astonishing
– Not just performance, but overall maturity
– Great time to be involved with the Ceph community
Cloud Powered
7 Cloud Powered
Reddy Chagam
88
• Intel plans in this presentation do not constitute Intel plan of record product roadmaps.
• All products, dates, and figures specified are preliminary based on current expectations,
and are subject to change without notice. Intel may make changes to specifications and
product descriptions at any time, without notice.
• Intel technologies’ features and benefits depend on system configuration and may
require enabled hardware, software or service activation. Performance varies depending
on system configuration. No computer system can be absolutely secure. Software and
workloads used in performance tests may have been optimized for performance only on
Intel microprocessors.
• Copyright © 2015 Intel Corporation. All rights reserved. Intel, Intel Inside, the Intel logo
are trademarks of Intel Corporation in the United States and other countries.
Intel Disclaimers
99
Intel: Ceph Community Performance Contributions
• First ever Intel-hosted Ceph Hackathon with focus on performance optimization
• Intel donated 8 node Ceph community performance cluster named ‘Incerta’
• One common baseline for performance regression tests and trend analysis
• Accessible to community contributors
• Periodic automated performance regression tests with latest builds
• Performance as a gate (desired end state)
From Mark Nelson @ RedHat: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/26635
• High performance hardware - 3rd Generation Intel
Xeon™ E5 Processors, 3.2TB NVMe, 40GbE
Networking
• Supports All HDD, Hybrid (HDD+PCIe SSD), or All
PCIe SSD configs
1010
• Performance tools
• Worked with RedHat to designate ‘CBT’ – Ceph Benchmarking Toolkit as the open source Ceph
benchmarking solution (one common tool for Ceph performance testing, analysis)
• Helped to develop standard workloads for Block and Object for integration into CBT (VDI, VOD,
Backup, etc)
• CBT tool hardening (e.g., error handling, reporting) in progress
• Intel up streamed COSBench Integration into CBT for Rados Gateway testing
• CeTune for end user friendly GUI and visualization for Ceph clusters
• open source repo - https://github.com/01org/CeTune
• Performance analysis
• Developed additional function-level LTTng tracing methodology
• Created post-processing scripts that build a workload focused per IO breakdown of latency to find
areas for optimization
• Virtual Storage Manager (VSM)
• Open source Ceph management software to simplify deployments and speed up enterprise adoption
• Focus on how to deploy flash optimized configurations
Intel: Ceph Community Performance Contributions
1111
Intel: Ceph Optimization Focus Areas
• Upstream PMstore Ceph backend for persistent memory support
• Ceph client side caching enhancement (blue-print submission underway)
- Two tiers for caching (DRAM and SSD)
- Configurable cache partitions for sequential and random IO
- Shared cache on host side (accelerate VDI workloads)
- 3rd party cache integration via pluggable architecture
• Upstream lockless C++ wrapper classes for queue, hash
• Client RBD, RADOS data-path optimization (with reduced locking, lockless queues)
• “Cache Tier” optimizations
12
Gunna Marripudi, Principal Storage Architect
1313
• The information provided in the following
presentation describes features that are still in
development. Statements regarding features or
performance are forecasts and do not constitute
guarantees of actual results, which may vary
depending on a number factors.
DISCLAIMER
1414
Latency is ~60%
lower
IOPS is 2 to 8x
better
Throughput is 2 to
6x better
Samsung: SSD Interface Improvements
High-performance networking
and higher performance SSD
devices
New design considerations
to achieve high performance!
1515
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
181
186
191
196
201
206
211
216
221
226
231
236
241
246
251
256
261
266
271
276
281
286
291
296
301
IOPS
Time (s)
IOPS from Page Cache IOPS from SM951 NVMe SSDAverage IOPS per
SSD: ~65K
SSD Spec: ~250K
Samsung: Ceph Performance : 4 OSDs – 4 SSDs 100% Random Reads 4KB on RBD
Test configuration for the scenario:
Ceph Hammer
40GbE RDMA; XIO Messenger
Samsung SM951 NVMe SSDs
FIO on RBD
1616
0
100000
200000
300000
400000
500000
600000
5% 10% 15% 20% 25% 30% 50% 70% 80% 90% 100%
IOPS
Page Cache Hit Rate
Cumulative IOPS
Samsung: Ceph Performance : 4 OSDs – 1 SSD100% Random Reads 4KB on RBD
Average IOPS from
SSD: ~225K
SSD Spec: ~250K
Test configuration for the scenario:
Ceph Hammer
40GbE RDMA; XIO Messenger
Samsung SM951 NVMe SSDs
FIO on RBD
1717
• Various options under Ceph architecture
– Increase number of PGs per pool
– Increase shards per OSD
– Etc.
• Existing read path is synchronous in OSD layer
• Extend it to support asynchronous read in OSD layer
Samsung: Increase Parallelism at SSD
1818
• Ceph architecture supports multiple messenger layers
– SimpleMessenger
– AsyncMessenger
– XIO Messenger
• On 40GbE RDMA capable NIC – 4K Random Read performance with IOs
served from RAM
• XIO Messenger is still experimental
• Enabling XIO Messenger to support multiple RDMA NIC ports available
on a system
Samsung: Messenger Performance Enhancements
XIO Messenger w/RDMA SimpleMessenger w/TCP
~540K ~320K
1919
Commodity Hardware
High-performance
SSDs and Networking
Community
Collaboration
Ceph enhancements
for performance
High–performance
workloads on Ceph!
Samsung: Summary
20
Allen Samuels
Software Architect, Software and Systems Solutions
October 28, 2015
2121
• Began in summer of ‘13 with the Ceph Dumpling release
• Ceph optimized for HDD
– Tuning AND algorithm changes needed for Flash optimization
– Leave defaults for HDD
• Quickly determined that the OSD was the major bottleneck
– OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)
• Examined and rejected multiple OSDs per SSD
– Failure Domain / Crush rules would be a nightmare
SanDisk: Optimizing Ceph for the all-flash Future
2222
• Dumpling OSD was a good design for HDD I/O rates
– Parallelism with a single HDD head in mind
– Heavy CPU / IOP – who cares???
• Need more parallelism and less CPU / IOP
• Evolution not revolution
– Eliminate bottlenecks iteratively
• Initially focused on read-path optimizations for block and object
SanDisk: OSD Optimization
2323
• Context switches matter at flash rates
– Too much “put it in a queue for a another thread”
– Too much lock contention
• Socket handling matters too!
– Too many “get 1 byte” calls to the kernel for sockets
– Disable Nagle’s algorithm to shorten operation latency
• Lots of other simple things
– Eliminate repeated look-ups in maps, caches, etc.
– Eliminate Redundant string copies (especially return of string)
– Large variables passed by value, not const reference
• Contributed improvements to Emperor, Firefly and Giant releases
• Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD
(Hammer) *
SanDisk: OSD Read path Optimization
* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
2424
• Write path strategy was classic HDD
– Journal writes for minimum foreground latency
– Process journal in batches in the background
• The batch oriented processing was very inefficient on flash
• Modified buffering/writing strategy for Flash
– Recently committed to Infernalis release
– Yields 2.5x write throughput improvement over Hammer
– Average latency is ½ of Hammer
SanDisk: OSD Write path Optimization
2525
• RDMA intra-cluster communication
– Significant reduction in CPU / IOP
• NewStore
– Significant reduction in write amplification -> even higher write performance
• Memory allocation
– tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs. default *
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
SanDisk: Potential Future Improvements
Thank you Ceph and OpenStack community!
Footer goes here

More Related Content

What's hot

Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Community
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 

What's hot (20)

Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Developing a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsDeveloping a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure Environments
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Ceph Day Shanghai - Opening
Ceph Day Shanghai - Opening Ceph Day Shanghai - Opening
Ceph Day Shanghai - Opening
 
Ceph Day Melabourne - Community Update
Ceph Day Melabourne - Community UpdateCeph Day Melabourne - Community Update
Ceph Day Melabourne - Community Update
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFS
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA Update
 
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 

Viewers also liked

Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by WorkloadCeph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Community
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Community
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Community
 

Viewers also liked (20)

Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
 
Ceph Day KL - Bluestore
Ceph Day KL - Bluestore Ceph Day KL - Bluestore
Ceph Day KL - Bluestore
 
Ceph Day Seoul - Community Update
Ceph Day Seoul - Community UpdateCeph Day Seoul - Community Update
Ceph Day Seoul - Community Update
 
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture  Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture
 
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update
 
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
 
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to Enterprise
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
Ceph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in CtripCeph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in Ctrip
 
librados
libradoslibrados
librados
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
 
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by WorkloadCeph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
 
Ceph Day Shanghai - Community Update
Ceph Day Shanghai - Community Update Ceph Day Shanghai - Community Update
Ceph Day Shanghai - Community Update
 

Similar to Ceph Community Talk on High-Performance Solid Sate Ceph

Handling Massive Writes
Handling Massive WritesHandling Massive Writes
Handling Massive Writes
Liran Zelkha
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
solarisyourep
 

Similar to Ceph Community Talk on High-Performance Solid Sate Ceph (20)

Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
 
Handling Massive Writes
Handling Massive WritesHandling Massive Writes
Handling Massive Writes
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
SQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teamsSQLintersection keynote a tale of two teams
SQLintersection keynote a tale of two teams
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Ceph Community Talk on High-Performance Solid Sate Ceph

  • 1. Ceph Community Talk on High-Performance Solid State Ceph Warren Wang, Reddy Chagam, Gunna Marripudi, Allen Samuels Oct 2015
  • 2. 2 DISCLAIMER The following presentation includes discussions about proposals that may not yet accepted in the upstream community. There is no guarantee that all of the forward looking items will make it through the acceptance process, nor is there a guarantee on timing of the proposals. Likewise, there is no guarantee on performance, as it may vary for a number of reasons. Any discussions about configs should be validated before used in production. Cloud Powered
  • 3. 3 Introductions • Warren Wang - Walmart Technology • Reddy Chagam - Intel • Gunna Marripudi - Samsung • Allen Samuels - SanDisk Cloud Powered
  • 4. 4 Growing High Performance Block Workloads in OpenStack • Increasing trend for high performance, large capacity block workloads – NoSQL and more traditional databases • Many OpenStack operators already using Ceph – Can we continue this trend with high performance block? – Linear scaling performance? • During Giant timeframe, many read improvements were made – What about write performance? • 90% read stats are boring and unrealistic – Lots of talk and experimentation on user list about performance changes – Amount of work going on was evident, and rate of change was improving on performance characteristics • Work directly with some of the contributors of performance changes Cloud Powered
  • 5. 5 Test workload • Researched a real workload moving to OpenStack, which amounted to: – 200K read / 200K write IOPS @ ~7 KB avg – 100 TB data – ~ 16 gigabits/sec read – ~ 16 gigabits/sec write • Borrowed some solid state compute nodes and formed a big SSD Ceph cluster for performance testing – 50 OSD nodes – 3 separate MON servers – 400 SATA SSD OSDs – Single 10Gbe – 2x replication – 4096 placement groups – Clients: Bare metal with kernel client running fio Cloud Powered
  • 6. 6 Results • Actual results, yours may vary. 10 minute runs – 283,000 Read IOPS @ 2.5ms avg – 280,000 Write IOPS @ 4.3ms avg – Over 500,000 client IOPS, and over 1 million backend Ceph IOPS – Performance scaled linearly with the addition of OSD nodes and OSDs • Is this good enough? – Reduce avg latency and spikes. 95th+ percentile starts to exceed 20ms – Improve single threaded perf – Better utilize each available IOP by reducing write amp in Ceph – RGW performance • Improvements from Dumpling days are astonishing – Not just performance, but overall maturity – Great time to be involved with the Ceph community Cloud Powered
  • 8. 88 • Intel plans in this presentation do not constitute Intel plan of record product roadmaps. • All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel may make changes to specifications and product descriptions at any time, without notice. • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. • Copyright © 2015 Intel Corporation. All rights reserved. Intel, Intel Inside, the Intel logo are trademarks of Intel Corporation in the United States and other countries. Intel Disclaimers
  • 9. 99 Intel: Ceph Community Performance Contributions • First ever Intel-hosted Ceph Hackathon with focus on performance optimization • Intel donated 8 node Ceph community performance cluster named ‘Incerta’ • One common baseline for performance regression tests and trend analysis • Accessible to community contributors • Periodic automated performance regression tests with latest builds • Performance as a gate (desired end state) From Mark Nelson @ RedHat: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/26635 • High performance hardware - 3rd Generation Intel Xeon™ E5 Processors, 3.2TB NVMe, 40GbE Networking • Supports All HDD, Hybrid (HDD+PCIe SSD), or All PCIe SSD configs
  • 10. 1010 • Performance tools • Worked with RedHat to designate ‘CBT’ – Ceph Benchmarking Toolkit as the open source Ceph benchmarking solution (one common tool for Ceph performance testing, analysis) • Helped to develop standard workloads for Block and Object for integration into CBT (VDI, VOD, Backup, etc) • CBT tool hardening (e.g., error handling, reporting) in progress • Intel up streamed COSBench Integration into CBT for Rados Gateway testing • CeTune for end user friendly GUI and visualization for Ceph clusters • open source repo - https://github.com/01org/CeTune • Performance analysis • Developed additional function-level LTTng tracing methodology • Created post-processing scripts that build a workload focused per IO breakdown of latency to find areas for optimization • Virtual Storage Manager (VSM) • Open source Ceph management software to simplify deployments and speed up enterprise adoption • Focus on how to deploy flash optimized configurations Intel: Ceph Community Performance Contributions
  • 11. 1111 Intel: Ceph Optimization Focus Areas • Upstream PMstore Ceph backend for persistent memory support • Ceph client side caching enhancement (blue-print submission underway) - Two tiers for caching (DRAM and SSD) - Configurable cache partitions for sequential and random IO - Shared cache on host side (accelerate VDI workloads) - 3rd party cache integration via pluggable architecture • Upstream lockless C++ wrapper classes for queue, hash • Client RBD, RADOS data-path optimization (with reduced locking, lockless queues) • “Cache Tier” optimizations
  • 12. 12 Gunna Marripudi, Principal Storage Architect
  • 13. 1313 • The information provided in the following presentation describes features that are still in development. Statements regarding features or performance are forecasts and do not constitute guarantees of actual results, which may vary depending on a number factors. DISCLAIMER
  • 14. 1414 Latency is ~60% lower IOPS is 2 to 8x better Throughput is 2 to 6x better Samsung: SSD Interface Improvements High-performance networking and higher performance SSD devices New design considerations to achieve high performance!
  • 15. 1515 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 286 291 296 301 IOPS Time (s) IOPS from Page Cache IOPS from SM951 NVMe SSDAverage IOPS per SSD: ~65K SSD Spec: ~250K Samsung: Ceph Performance : 4 OSDs – 4 SSDs 100% Random Reads 4KB on RBD Test configuration for the scenario: Ceph Hammer 40GbE RDMA; XIO Messenger Samsung SM951 NVMe SSDs FIO on RBD
  • 16. 1616 0 100000 200000 300000 400000 500000 600000 5% 10% 15% 20% 25% 30% 50% 70% 80% 90% 100% IOPS Page Cache Hit Rate Cumulative IOPS Samsung: Ceph Performance : 4 OSDs – 1 SSD100% Random Reads 4KB on RBD Average IOPS from SSD: ~225K SSD Spec: ~250K Test configuration for the scenario: Ceph Hammer 40GbE RDMA; XIO Messenger Samsung SM951 NVMe SSDs FIO on RBD
  • 17. 1717 • Various options under Ceph architecture – Increase number of PGs per pool – Increase shards per OSD – Etc. • Existing read path is synchronous in OSD layer • Extend it to support asynchronous read in OSD layer Samsung: Increase Parallelism at SSD
  • 18. 1818 • Ceph architecture supports multiple messenger layers – SimpleMessenger – AsyncMessenger – XIO Messenger • On 40GbE RDMA capable NIC – 4K Random Read performance with IOs served from RAM • XIO Messenger is still experimental • Enabling XIO Messenger to support multiple RDMA NIC ports available on a system Samsung: Messenger Performance Enhancements XIO Messenger w/RDMA SimpleMessenger w/TCP ~540K ~320K
  • 19. 1919 Commodity Hardware High-performance SSDs and Networking Community Collaboration Ceph enhancements for performance High–performance workloads on Ceph! Samsung: Summary
  • 20. 20 Allen Samuels Software Architect, Software and Systems Solutions October 28, 2015
  • 21. 2121 • Began in summer of ‘13 with the Ceph Dumpling release • Ceph optimized for HDD – Tuning AND algorithm changes needed for Flash optimization – Leave defaults for HDD • Quickly determined that the OSD was the major bottleneck – OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores) • Examined and rejected multiple OSDs per SSD – Failure Domain / Crush rules would be a nightmare SanDisk: Optimizing Ceph for the all-flash Future
  • 22. 2222 • Dumpling OSD was a good design for HDD I/O rates – Parallelism with a single HDD head in mind – Heavy CPU / IOP – who cares??? • Need more parallelism and less CPU / IOP • Evolution not revolution – Eliminate bottlenecks iteratively • Initially focused on read-path optimizations for block and object SanDisk: OSD Optimization
  • 23. 2323 • Context switches matter at flash rates – Too much “put it in a queue for a another thread” – Too much lock contention • Socket handling matters too! – Too many “get 1 byte” calls to the kernel for sockets – Disable Nagle’s algorithm to shorten operation latency • Lots of other simple things – Eliminate repeated look-ups in maps, caches, etc. – Eliminate Redundant string copies (especially return of string) – Large variables passed by value, not const reference • Contributed improvements to Emperor, Firefly and Giant releases • Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD (Hammer) * SanDisk: OSD Read path Optimization * Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
  • 24. 2424 • Write path strategy was classic HDD – Journal writes for minimum foreground latency – Process journal in batches in the background • The batch oriented processing was very inefficient on flash • Modified buffering/writing strategy for Flash – Recently committed to Infernalis release – Yields 2.5x write throughput improvement over Hammer – Average latency is ½ of Hammer SanDisk: OSD Write path Optimization
  • 25. 2525 • RDMA intra-cluster communication – Significant reduction in CPU / IOP • NewStore – Significant reduction in write amplification -> even higher write performance • Memory allocation – tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs. default * * https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view SanDisk: Potential Future Improvements
  • 26. Thank you Ceph and OpenStack community! Footer goes here