SlideShare a Scribd company logo
1 of 27
Download to read offline
1Investor Day 2014 | May 7, 2014
Optimizing Ceph for
All-Flash
Architectures
Vijayendra Shamanna (Viju)
Senior Manager
Systems and Software
Solutions
Systems and Software Solutions 2
Forward-Looking Statements
During our meeting today we may make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a
forward-looking statement, including those relating to future technologies, products and applications. Actual results
may differ materially from those expressed in these forward-looking statements due to factors detailed under the
caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual
and quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof
Systems and Software Solutions 3
What is Ceph?
 A Distributed/Clustered Storage System
 File, Object and Block interfaces to common storage substrate
 Ceph is a mostly LGPL open source project
– Part of Linux kernel and most distros since 2.6.x
Systems and Software Solutions 4
Ceph Architecture
Client Client Client
IP
Network
STORAGE
NODE
STORAGE
NODE
STORAGE
NODE
…
…
Systems and Software Solutions 5
Ceph Deployment Modes
WebServer
RGW
librados
Guest OS
Block Iface
KVM
/dev/rbd
librados
Systems and Software Solutions 6
Ceph Operation
Object
Hash
PG PG PG PG
Placement Rules
OSD OSD OSD OSD OSD
PG PG …
…
libradosOSDs
Interface (block, object, file)
iface Client / Application
Systems and Software Solutions 7
Ceph Placement Rules and Data Protection
PG[a]
PG[b]
PG[c]
Server Server Server
Rack
Systems and Software Solutions 8
Ceph’s built in Data Integrity & Recovery
 Checksum based “patrol reads”
– Local read of object, exchange of checksum over network
 Journal-based Recovery
– Short down time (tunable)
– Re-syncs only changed objects
 Object-based Recovery
– Long down time OR partial storage loss
– Object checksums exchanged over network
Systems and Software Solutions 9
Enterprise Storage Features
FILE SYSTEM *BLOCK STORAGEOBJECT STORAGE
Keystone authentication
Geo-Replication
Erasure Coding
Striped Objects
Incremental backups
Open Stack integration
Configurable Striping
iSCSI CIFS/NFS
Linux Kernel
Configurable Striping
S3 & Swift
Multi-tenant
RESTful Interface
Thin Provisioning
Copy-on Write Clones
Snapshots
Dynamic Rebalancing
Distributed Metadata
POSIX compliance
* Future
Systems and Software Solutions 10
Software Architecture
– Key Components of System built for Hyper Scale
 Storage Clients
– Standard Interface to use the data (POSIX,
Device, Swift/S3…)
– Transparent for Applications
– Intelligent, Coordinates with peers
 Object Storage Cluster (OSDs)
– Stores all data and metadata into flexible-
sized containers – Objects
 Cluster Monitors
– lightweight process for Authentication,
Cluster Membership, Critical Cluster State
 Clients authenticates with monitors, direct IO to
OSDs for scalable performance
 No gateways, brokers, lookup tables, indirection …
Clients
OSDs
Ceph Key Components
Block,
ObjectIO
Monitors
Cluster
Maps
Compute
Storage
Systems and Software Solutions 11
Ceph Interface Eco System
Native Object Access (librados)
File System (libcephfs)
HDFSKernel Device + Fuse
Rados Gateway (RGW)
S3Swift
Block Interface (librbd)
Kernel
SambaNFSiSCSI-cur
Client
RADOS (Reliable, Autonomous, Distributed Object Store)
Systems and Software Solutions 12
OSD Architecture
work queue
cluster
client
Messenger
Dispatcher OSD thread pools
PG Backend
Replication Erasure Coding
ObjectStore
JournallingObjectStore KeyValueStore MemStore
FileStore
Transactions
Network
messages
Systems and Software Solutions 13
Messenger layer
 Removed Dispatcher and introduced a “fast path” mechanism for
read/write requests
• Same mechanism is now present on client side (librados) as well
 Fine grained locking in message transmit path
 Introduced an efficient buffering mechanism for improved
throughput
 Configuration options to disable message signing, CRC check etc
Systems and Software Solutions 14
OSD Request Processing
 Running with Memstore backend revealed bottleneck in OSD
thread pool code
 OSD worker thread pool mutex heavily contended
 Implemented a sharded worker thread pool. Requests sharded
based on their pg (placement group) identifier
 Configuration options to set number of shards and number of
worker threads per shard
 Optimized OpTracking path (Sharded Queue and removed
redundant locks)
Systems and Software Solutions 15
FileStore improvements
 Eliminated backend storage from picture by using a small workload
(FileStore served data from page cache)
 Severe lock contention in LRU FD (file descriptor) cache. Implemented a
sharded version of LRU cache
 CollectionIndex (per-PG) object was being created upon every IO request.
Implemented a cache for the same as PG info doesn’t change often
 Optimized “Object-name to XFS file name” mapping function
 Removed redundant snapshot related checks in parent read processing
path
Systems and Software Solutions 16
Inconsistent Performance Observation
 Large performance variations on different pools across multiple
clients
 First client after cluster restart gets maximum performance
irrespective of the pool
 Continued degraded performance from clients starting later
 Issue also observed on read I/O with unpopulated RBD images –
Ruled out FS issues
 Performance counters show up to 3x increase in latency through
the I/O path with no particular bottleneck
Systems and Software Solutions 17
Issue with TCmalloc
 Perf top shows rapid increase in time spent in TCmalloc functions
14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans()
7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned
long, int)
 I/O from different client causing new threads in sharded thread pool to process
I/O
 Causing memory movement from thread caches and increasing alloc/free
latency
 JEmalloc and Glibc malloc do not exhibit this behavior
 JEmalloc build option added to Ceph Hammer
 Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to
larger value (64M) alleviates the issue
Systems and Software Solutions 18
Client Optimizations
 Ceph by default, turns Nagle’s algorithm OFF
 RBD kernel driver ignored TCP_NODELAY setting
 Large latency variations at lower queue depths
 Changes to RBD driver submitted upstream
Emerging Storage Solutions (EMS) SanDisk Confidential 19
Results!
Systems and Software Solutions 20
Hardware Topology
Systems and Software Solutions 21
8K Random IOPs Performance -1 RBD /Client
IOPS :1 Lun/Client ( Total 4 Clients)
[Queue Depth]
Read Percent
0
50000
100000
150000
200000
250000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
Lat (ms):1 Lun/Client ( Total 4 Clients)
0
5
10
15
20
25
30
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Systems and Software Solutions 22
64K Random IOPs Performance -1 RBD /Client
IOPS :1 Lun/Client ( Total 4 Clients)
[Queue Depth]
Read Percent
0
20000
40000
60000
80000
100000
120000
140000
160000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
Lat(ms) :1 Lun/Client ( Total 4 Clients)
0
5
10
15
20
25
30
35
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Systems and Software Solutions 23
256K Random IOPs Performance -1 RBD /Client
IOPS :1 Lun/Client ( Total 4 Clients) Lat(ms) :1 Lun/Client ( Total 4 Clients)
[Queue Depth]
Read Percent
0
5000
10000
15000
20000
25000
30000
35000
40000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
0
10
20
30
40
50
60
70
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Systems and Software Solutions 24
8K Random IOPs Performance -2 RBD /Client
IOPS :2 Luns /Client ( Total 4 Clients)
0
50000
100000
150000
200000
250000
300000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
Lat(ms) :2 Luns/Client ( Total 4 Clients)
[Queue Depth]
Read Percent
0
20
40
60
80
100
120
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
SanDisk Confidential 25
SanDisk Emerging Storage Solutions (EMS)
Our goal for EMS
 Expand the usage of flash for newer enterprise workloads (Ex: Ceph, Hadoop, Content
Repositories, Media Streaming, Clustered Database Applications, etc)
 … through disruptive innovation of “disaggregation” of storage from compute
 … by building shared storage solutions
 … thereby allowing customers to lower TCO relative to HDD based offerings through
scaling storage and compute separately
Mission: Provide the best building blocks for “shared storage” flash
solutions.
SanDisk Confidential 26
SanDisk Systems & Software Solutions
Building Blocks Transforming Hyperscale
InfiniFlash™
Capacity & High Density for Big
Data Analytics, Media Services,
Content Repositories
512TB 3U, 780K IOPS, SAS storage
Compelling TCO– Space/Energy
Savings
ION Accelerator™
Accelerate Database, OLTP &
other high performance
workloads
1.7M IOPS, 23 GB/s, 56 us
latency
Lower costs with IT
consolidation
FlashSoft®
Server-side SSD based caching
to reduce I/O Latency
Improve application
performance
Maximize storage infrastructure
investment
ZetaScale™
Lower TCO while retaining
performance for in-memory
databases
Run multiple TB data sets on
SSDs vs. GB data set on DRAM
Deliver up to 5:1 server
consolidation, improving TCO
Massive Capacity Extreme Performance for Applications.. .. Servers/Virtualization
SanDisk Confidential 27
Thank you ! Questions?
vijayendra.shamanna@sandisk.com

More Related Content

What's hot

Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Jan Kalcic
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Community
 

What's hot (20)

Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Implementing distributed mclock in ceph
Implementing distributed mclock in cephImplementing distributed mclock in ceph
Implementing distributed mclock in ceph
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 

Similar to optimizing_ceph_flash

Similar to optimizing_ceph_flash (20)

Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Oracle Exec Summary 7000 Unified Storage
Oracle Exec Summary 7000 Unified StorageOracle Exec Summary 7000 Unified Storage
Oracle Exec Summary 7000 Unified Storage
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph
CephCeph
Ceph
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Ceph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in CephCeph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in Ceph
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)
 
Systore07 V4
Systore07 V4Systore07 V4
Systore07 V4
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Red hat Enterprise Linux 6.4 for IBM System z Technical Highlights
Red hat Enterprise Linux 6.4 for IBM System z Technical HighlightsRed hat Enterprise Linux 6.4 for IBM System z Technical Highlights
Red hat Enterprise Linux 6.4 for IBM System z Technical Highlights
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 

optimizing_ceph_flash

  • 1. 1Investor Day 2014 | May 7, 2014 Optimizing Ceph for All-Flash Architectures Vijayendra Shamanna (Viju) Senior Manager Systems and Software Solutions
  • 2. Systems and Software Solutions 2 Forward-Looking Statements During our meeting today we may make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to future technologies, products and applications. Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof
  • 3. Systems and Software Solutions 3 What is Ceph?  A Distributed/Clustered Storage System  File, Object and Block interfaces to common storage substrate  Ceph is a mostly LGPL open source project – Part of Linux kernel and most distros since 2.6.x
  • 4. Systems and Software Solutions 4 Ceph Architecture Client Client Client IP Network STORAGE NODE STORAGE NODE STORAGE NODE … …
  • 5. Systems and Software Solutions 5 Ceph Deployment Modes WebServer RGW librados Guest OS Block Iface KVM /dev/rbd librados
  • 6. Systems and Software Solutions 6 Ceph Operation Object Hash PG PG PG PG Placement Rules OSD OSD OSD OSD OSD PG PG … … libradosOSDs Interface (block, object, file) iface Client / Application
  • 7. Systems and Software Solutions 7 Ceph Placement Rules and Data Protection PG[a] PG[b] PG[c] Server Server Server Rack
  • 8. Systems and Software Solutions 8 Ceph’s built in Data Integrity & Recovery  Checksum based “patrol reads” – Local read of object, exchange of checksum over network  Journal-based Recovery – Short down time (tunable) – Re-syncs only changed objects  Object-based Recovery – Long down time OR partial storage loss – Object checksums exchanged over network
  • 9. Systems and Software Solutions 9 Enterprise Storage Features FILE SYSTEM *BLOCK STORAGEOBJECT STORAGE Keystone authentication Geo-Replication Erasure Coding Striped Objects Incremental backups Open Stack integration Configurable Striping iSCSI CIFS/NFS Linux Kernel Configurable Striping S3 & Swift Multi-tenant RESTful Interface Thin Provisioning Copy-on Write Clones Snapshots Dynamic Rebalancing Distributed Metadata POSIX compliance * Future
  • 10. Systems and Software Solutions 10 Software Architecture – Key Components of System built for Hyper Scale  Storage Clients – Standard Interface to use the data (POSIX, Device, Swift/S3…) – Transparent for Applications – Intelligent, Coordinates with peers  Object Storage Cluster (OSDs) – Stores all data and metadata into flexible- sized containers – Objects  Cluster Monitors – lightweight process for Authentication, Cluster Membership, Critical Cluster State  Clients authenticates with monitors, direct IO to OSDs for scalable performance  No gateways, brokers, lookup tables, indirection … Clients OSDs Ceph Key Components Block, ObjectIO Monitors Cluster Maps Compute Storage
  • 11. Systems and Software Solutions 11 Ceph Interface Eco System Native Object Access (librados) File System (libcephfs) HDFSKernel Device + Fuse Rados Gateway (RGW) S3Swift Block Interface (librbd) Kernel SambaNFSiSCSI-cur Client RADOS (Reliable, Autonomous, Distributed Object Store)
  • 12. Systems and Software Solutions 12 OSD Architecture work queue cluster client Messenger Dispatcher OSD thread pools PG Backend Replication Erasure Coding ObjectStore JournallingObjectStore KeyValueStore MemStore FileStore Transactions Network messages
  • 13. Systems and Software Solutions 13 Messenger layer  Removed Dispatcher and introduced a “fast path” mechanism for read/write requests • Same mechanism is now present on client side (librados) as well  Fine grained locking in message transmit path  Introduced an efficient buffering mechanism for improved throughput  Configuration options to disable message signing, CRC check etc
  • 14. Systems and Software Solutions 14 OSD Request Processing  Running with Memstore backend revealed bottleneck in OSD thread pool code  OSD worker thread pool mutex heavily contended  Implemented a sharded worker thread pool. Requests sharded based on their pg (placement group) identifier  Configuration options to set number of shards and number of worker threads per shard  Optimized OpTracking path (Sharded Queue and removed redundant locks)
  • 15. Systems and Software Solutions 15 FileStore improvements  Eliminated backend storage from picture by using a small workload (FileStore served data from page cache)  Severe lock contention in LRU FD (file descriptor) cache. Implemented a sharded version of LRU cache  CollectionIndex (per-PG) object was being created upon every IO request. Implemented a cache for the same as PG info doesn’t change often  Optimized “Object-name to XFS file name” mapping function  Removed redundant snapshot related checks in parent read processing path
  • 16. Systems and Software Solutions 16 Inconsistent Performance Observation  Large performance variations on different pools across multiple clients  First client after cluster restart gets maximum performance irrespective of the pool  Continued degraded performance from clients starting later  Issue also observed on read I/O with unpopulated RBD images – Ruled out FS issues  Performance counters show up to 3x increase in latency through the I/O path with no particular bottleneck
  • 17. Systems and Software Solutions 17 Issue with TCmalloc  Perf top shows rapid increase in time spent in TCmalloc functions 14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans() 7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)  I/O from different client causing new threads in sharded thread pool to process I/O  Causing memory movement from thread caches and increasing alloc/free latency  JEmalloc and Glibc malloc do not exhibit this behavior  JEmalloc build option added to Ceph Hammer  Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to larger value (64M) alleviates the issue
  • 18. Systems and Software Solutions 18 Client Optimizations  Ceph by default, turns Nagle’s algorithm OFF  RBD kernel driver ignored TCP_NODELAY setting  Large latency variations at lower queue depths  Changes to RBD driver submitted upstream
  • 19. Emerging Storage Solutions (EMS) SanDisk Confidential 19 Results!
  • 20. Systems and Software Solutions 20 Hardware Topology
  • 21. Systems and Software Solutions 21 8K Random IOPs Performance -1 RBD /Client IOPS :1 Lun/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 50000 100000 150000 200000 250000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant Lat (ms):1 Lun/Client ( Total 4 Clients) 0 5 10 15 20 25 30 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  • 22. Systems and Software Solutions 22 64K Random IOPs Performance -1 RBD /Client IOPS :1 Lun/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 20000 40000 60000 80000 100000 120000 140000 160000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant Lat(ms) :1 Lun/Client ( Total 4 Clients) 0 5 10 15 20 25 30 35 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  • 23. Systems and Software Solutions 23 256K Random IOPs Performance -1 RBD /Client IOPS :1 Lun/Client ( Total 4 Clients) Lat(ms) :1 Lun/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 5000 10000 15000 20000 25000 30000 35000 40000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant 0 10 20 30 40 50 60 70 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  • 24. Systems and Software Solutions 24 8K Random IOPs Performance -2 RBD /Client IOPS :2 Luns /Client ( Total 4 Clients) 0 50000 100000 150000 200000 250000 300000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant Lat(ms) :2 Luns/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 20 40 60 80 100 120 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  • 25. SanDisk Confidential 25 SanDisk Emerging Storage Solutions (EMS) Our goal for EMS  Expand the usage of flash for newer enterprise workloads (Ex: Ceph, Hadoop, Content Repositories, Media Streaming, Clustered Database Applications, etc)  … through disruptive innovation of “disaggregation” of storage from compute  … by building shared storage solutions  … thereby allowing customers to lower TCO relative to HDD based offerings through scaling storage and compute separately Mission: Provide the best building blocks for “shared storage” flash solutions.
  • 26. SanDisk Confidential 26 SanDisk Systems & Software Solutions Building Blocks Transforming Hyperscale InfiniFlash™ Capacity & High Density for Big Data Analytics, Media Services, Content Repositories 512TB 3U, 780K IOPS, SAS storage Compelling TCO– Space/Energy Savings ION Accelerator™ Accelerate Database, OLTP & other high performance workloads 1.7M IOPS, 23 GB/s, 56 us latency Lower costs with IT consolidation FlashSoft® Server-side SSD based caching to reduce I/O Latency Improve application performance Maximize storage infrastructure investment ZetaScale™ Lower TCO while retaining performance for in-memory databases Run multiple TB data sets on SSDs vs. GB data set on DRAM Deliver up to 5:1 server consolidation, improving TCO Massive Capacity Extreme Performance for Applications.. .. Servers/Virtualization
  • 27. SanDisk Confidential 27 Thank you ! Questions? vijayendra.shamanna@sandisk.com