Scaling API-first – The story of a global engineering organization
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make Ceph storage ready!
1. The safe way to make Ceph storage enterprise ready!
Build your own disaster ?
Copyright 2015 FUJITSU
Dieter Kasper
CTO Data Center Infrastructure
Emerging Technologies & Solutions, Global Delivery
2015-03-31
2. 1
The safe way to make Ceph storage enterprise ready
ETERNUS CD10k integrated in OpenStack
mSHEC Erasure Code from Fujitsu
Contribution to performance enhancements
3. 2
Building Storage with Ceph looks simple
Copyright 2015 FUJITSU
Ceph
+ some servers
+ network
= storage
4. 3
Building Storage with Ceph looks simple – but……
Many new Complexities
Rightsizing server, disk types, network
bandwidth
Silos of management tools (HW, SW..)
Keeping Ceph versions with versions of
server HW, OS, connectivity, drivers in sync
Management of maintenance and support
contracts of components
Troubleshooting
Copyright 2015 FUJITSU
Build Ceph source storage yourself
5. 4
The challenges of software defined storage
What users want
Open standards
High scalability
High reliability
Lower costs
No-lock in from a vendor
What users may get
An own developed storage system based on open
/ industry standard HW & SW components
High scalability and reliability ? If the stack works !
Lower investments but higher operational efforts
Lock-in into the own stack
Copyright 2015 FUJITSU
6. 5
ETERNUS CD10000 – Making Ceph enterprise ready
Build Ceph source storage yourself Out of the box ETERNUS CD10000
incl. support
incl. maintenance
ETERNUS CD10000 combines open source storage with enterprise–class quality of service
E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise
Easy Deployment / Management by Fujitsu
+
+
+ Lifecycle Management for Hardware & Software by Fujitsu
+
8. 7
Unlimited Scalability
Cluster of storage nodes
Capacity and performance scales by
adding storage nodes
Three different node types enable
differentiated service levels
Density, capacity optimized
Performance optimized
Optimized for small scale dev & test
1st version of CD10000 (Q3.2014) is
released for a range o 4 to 224 nodes
Scales up to >50 Petabyte
Copyright 2015 FUJITSU
Basic node 12 TB Performance node 35 TB Capacity node 252 TB
9. 8
Immortal System
Copyright 2015 FUJITSU
Node1 Node2 Node(n)
+
Adding nodes
with new generation
of hardware
………+
Adding nodes
Non-disruptive add / remove / exchange of hardware (disks and nodes)
Mix of nodes of different generations, online technology refresh
Very long lifecycle reduces migration efforts and costs
10. 9
TCO optimized
Based on x86 industry standard architectures
Based on open source software (Ceph)
High-availability and self-optimizing functions are part
of the design at no extra costs
Highly automated and fully integrated management
reduces operational efforts
Online maintenance and technology refresh reduce
costs of downtime dramatically
Extreme long lifecycle delivers investment protection
End-to-end design an maintenance from Fujitsu
reduces, evaluation, integration, maintenance costs
Copyright 2015 FUJITSU
Better service levels at reduced costs – business centric storage
11. 10
One storage – seamless management
ETERNUS CD10000 delivers one seamless
management for the complete stack
Central Ceph software deployment
Central storage node management
Central network management
Central log file management
Central cluster management
Central configuration, administration and
maintenance
SNMP integration of all nodes and
network components
Copyright 2015 FUJITSU
12. 11
Seamless management (2)
Dashboard – Overview of cluster statusDashboard – Overview of cluster status
Server Management – Management of cluster hardware – add/remove server
(storage node), replace storage devices
Server Management – Management of cluster hardware – add/remove server
(storage node), replace storage devices
Cluster Management – Management of cluster resources – cluster and pool creationCluster Management – Management of cluster resources – cluster and pool creation
Monitoring the cluster – Monitoring overall capacity, pool utilization, status of OSD,
Monitor, and MDS processes, Placement Group status, and RBD status
Monitoring the cluster – Monitoring overall capacity, pool utilization, status of OSD,
Monitor, and MDS processes, Placement Group status, and RBD status
Managing OpenStack Interoperation: Connection to OpenStack Server, and
placement of pools in Cinder multi-backend
Managing OpenStack Interoperation: Connection to OpenStack Server, and
placement of pools in Cinder multi-backend
14. 13
Example: Replacing an HDD
Plain Ceph
taking the failed disk offline in Ceph
taking the failed disk offline on OS /
Controller Level
identify (right) hard drive in server
exchanging hard drive
partitioning hard drive on OS level
Make and mount file system
bring the disk up in Ceph again
On ETERNUS CD10000
vsm_cli <cluster> replace-disk-out
<node> <dev>
exchange hard drive
vsm_cli <cluster> replace-disk-in
<node> <dev>
15. 14
Example: Adding a Node
Plain Ceph
Install hardware
Install OS
Configure OS
Partition disks (OSDs, Journals)
Make filesystems
Configure network
Configure ssh
Configure Ceph
Add node to cluster
On ETERNUS CD10000
Install hardware
• hardware will automatically PXE boot
and install the current cluster
environment including current
configuration
Make node available to GUI
Add node to cluster with mouse click
on GUI
16. 15
Seamless management drives productivity
Manual Ceph Installation
Setting-up a 4 node Ceph cluster with 15 OSDs: 1,5 – 2 admin days
Adding an additional node: 3 admin hours up to half a day
Automated Installation through ETERNUS CD10000
Setting-up a 4 node Ceph cluster with 15 OSDs: 1 hour
Adding an additional node: 0,5 hour
Copyright 2015 FUJITSU
17. 16
Adding and Integrating Apps
The ETERNUS CD10000 architecture
enables the integration of apps
Fujitsu is working with customers and
software vendors to integrate selected
storage apps
E.g. archiving, sync & share, data
discovery, cloud apps…
Copyright 2015 FUJITSU
Cloud
Services
Sync
& Share
Archive
iRODS
data
discovery
ETERNUSCD10000
Object Level
Access
Block Level
Access
File Level
Access
Central Management
Ceph Storage System S/W and Fujitsu
Extensions
10GbE Frontend Network
Fast Interconnect Network
PerformanceNodes
CapacityNodes
18. 17
ETERNUS CD10000 at University Mainz
Large university in Germany
Uses iRODS Application for library services
iRODS is an open-source data management software in use at research
organizations and government agencies worldwide
Organizes and manages large depots of distributed digital data
Customer has built an interface from iRODS to Ceph
Stores raw data of measurement instruments (e.g. research in chemistry and
physics) for 10+ years meeting compliance rules of the EU
Need to provide extensive and rapidly growing data volumes online at
reasonable costs
Will implement a sync & share service on top of ETERNUS CD10000
19. 18
How ETERNUS CD10000 supports cloud biz
Cloud IT Trading Platform
An European provider operates a trading platform for cloud
resources (CPU, RAM, Storage)
Cloud IT Resources Supplier
The Darmstadt data center (DARZ) offers
storage capacity via the trading platform
Using ETERNUS CD10000 to provide storage
resources for an unpredictable demand
ETERNUS
CD10000
Copyright 2015 FUJITSU
20. 19
Summary ETERNUS CD10k – Key Values
Copyright 2015 FUJITSU
ETERNUS CD10000
ETERNUS
CD10000
Unlimited
Scalability
TCO
optimized
The new
unified
Immortal
System
Zero
Downtime
ETERNUS CD10000 combines open source storage with enterprise–class quality of service
21. 20
The safe way to make Ceph storage enterprise ready
ETERNUS CD10k integrated in OpenStack
mSHEC Erasure Code from Fujitsu
Contribution to performance enhancements
22. 21
What is OpenStack
Free open source (Apache license) software governed by a non-profit foundation
(corporation) with a mission to produce the ubiquitous Open Source Cloud
Computing platform that will meet the needs of public and private clouds
regardless of size, by being simple to implement and massively scalable.
Platin
Gold
Corporate
…
…
Massively scalable cloud operating system that
controls large pools of compute, storage, and
networking resources
Community OSS with contributions from 1000+
developers and 180+ participating organizations
Open web-based API Programmatic IaaS
Plug-in architecture; allows different hypervisors,
block storage systems, network implementations,
hardware agnostic, etc.
http://www.openstack.org/foundation/companies/
23. 22
OpenStack Summit in Paris Nov.2014
OpenStack Momentum
Impressively demonstrated at the OpenStack Summit: more than 5.000
participants from 60+ countries, high profile companies from all industries
– e.g. AT&T, BBVA, BMW, CERN, Expedia, Verizon – sharing their
experience and plans around OpenStack
OpenStack @ BMW: Replacement of a self-built IaaS cloud; covers a pool
of x.000 VMs; rapid growth planned; system is up & running but currently
used productively by selected departments only.
OpenStack @ CERN: In production since July 2013; 4 operational IaaS
clouds, the largest one with 70k cores on 3.000 servers; expected to pass
150k cores by Q1.2015.
24. 23
Attained fast growing customer interest
VMware clouds dominate
OpenStack clouds already #2
Worldwide adoption
Source: OpenStack User Survey and Feedback Nov 3rd 2014
Source: OpenStack User Survey and Feedback May 13th 2014
25. 24
Why are Customers so interested?
Source: OpenStack User Survey and Feedback Nov 3rd 2014
Greatest industry & community support
compared to alternative open platforms:
Eucalyptus, CloudStack, OpenNebula
“Ability to Innovate” jumped from #6 to #1
27. 26
OpenStack Cloud Layers
OpenStack and ETERNUS CD10000
Physical Server (CPU, Memory, SSD, HDD) and Network
Base Operating System (CentOS)
OAM
-dhcp
-Deploy
-LCM
Hypervisor
KVM, ESXi,
Hyper-V
Compute (Nova)
Network
(Neutron) +
plugins
Dashboard (Horizon)
Billing Portal
OpenStack
Cloud APIs
RADOS
Block
(RBD)
S3
(Rados-GW)
Object (Swift)Volume (Cinder)
Authentication (Keystone)
Images (Glance)
EC2 API
Metering (Ceilometer)
Manila (File)
File
(CephFS)
Fujitsu
Open Cloud
Storage
28. 27
The OpenStack – Ceph Ecosystem @Work
OpenStack
Cloud Controller
OpenStack
Compute Node
OpenStack
Compute Node
OpenStack
Compute Node…
Ceph Storage Cluster
VM Template Production VM
VM Template
Replica
VM Template
Replica
Production VM
Replica
Production VM
Replica
create
snapshot / clone
use
move
use
29. 28
The safe way to make Ceph storage enterprise ready
ETERNUS CD10k integrated in OpenStack
mSHEC Erasure Code from Fujitsu
Contribution to performance enhancements
30. 29
Backgrounds (1)
Erasure codes for content data
Content data for ICT services is ever-growing
Demand for higher space efficiency and durability
Reed Solomon code (de facto erasure code) improves both
Reed Solomon Code(Old style)Triple Replication
However, Reed Solomon code is not so recovery-efficient
content data
copy copy
3x space
parity parity
1.5x space
content data
31. 30
Backgrounds (2)
Local parity improves recovery efficiency
Data recovery should be as efficient as possible
• in order to avoid multiple disk failures and data loss
Reed Solomon code was improved by local parity methods
• data read from disks is reduced during recovery
Data Chunks
Parity Chunks
Reed Solomon Code
(No Local Parities) Local Parities
data read from disks
However, multiple disk failures is out of consideration
A Local Parity Method
32. 31
Local parity method for multiple disk failures
Existing methods is optimized for single disk failure
• e.g. Microsoft MS-LRC, Facebook Xorbas
However, Its recovery overhead is large in case of multiple disk failures
• because they have a chance to use global parities for recovery
Our Goal
A Local Parity Method
Our goal is a method efficiently handling multiple disk failures
Multiple Disk Failures
33. 32
SHEC (= Shingled Erasure Code)
An erasure code only with local parity groups
• to improve recovery efficiency in case of multiple disk failures
The calculation ranges of local parities are shifted and partly overlap with each
other (like the shingles on a roof)
• to keep enough durability
Our Proposal Method (SHEC)
k : data chunks (=10)
m :
parity
chunks
(=6)
l : calculation range (=5)
34. 33
SHEC is implemented as an erasure code plugin of Ceph, an open
source scalable object storage
SHEC’s Implementation on Ceph
4MB objects are split
into data/parity chunks,
distributed over OSDs
encode/decode logic is separated
from main part of Ceph Storage
SHEC plugin
35. 34
1. mSHEC is more adjustable than Reed Solomon code,
because SHEC provides many recovery-efficient layouts
including Reed Solomon codes
2. mSHEC’s recovery time was ~20% faster than Reed
Solomon code in case of double disk failures
3. mSHEC erasure-code was add to
Ceph v0.93 = pre-Hammer release
4. For more information see
https://wiki.ceph.com/Planning/Blueprints/Hammer/Shingled_Erasure_Code_(SHEC)
or ask Fujitsu
Summary mSHEC
36. 35
The safe way to make Ceph storage enterprise ready
ETERNUS CD10k integrated in OpenStack
mSHEC Erasure Code from Fujitsu
Contribution to performance enhancements
37. 36
Areas to improve Ceph performance
Ceph has an adequate performance today,
But there are performance issues which prevent us from taking full
advantage of our hardware resources.
Two main goals for improvement:
(1) Decrease latency in the Ceph code path
(2) Enhance large cluster scalability with many nodes / ODS
39. 38
1. LTTng general http://lttng..org/
General
open source tracing framework for Linux
trace Linux kernel and user space applications
low overhead and therefore usable on
production systems
activate tracing at runtime
Ceph code contains LTTng trace points already
Our LTTng based profiling
activate within a function, collect timestamp information at the interesting places
save collected information in a single trace point at the end of the function
transaction profiling instead of function profiling: use Ceph transaction id's to
correlate trace points
focused on primary and secondary write operations
40. 39
2. Test setup
Ceph Cluster
3 storage nodes:
2 CPU sockets, 8 core per socket, Intel E5-2640, 2.00GHz, 128 GB memory
12 OSDs: 4 OSDs per storage node (SAS disks), journals on raw SSD partitions
CentOS 6.6, linux 3.10.32, Ceph v0.91, storage pools with replication 3
Ceph Client
2 CPU sockets, 6 cores per socket, Intel E5-2630, 2.30GHz, 192 GB memory
CentOS 6.6, Linux 3.10.32
Ceph kernel client (rbd.ko + libceph.ko)
Test Program
fio 2.1.10
randwrite, 4kByte buffersize, libaio / iodepth 16
test writes 1 GByte of data (or 262144 I/O requests)
41. 40
3. LTTng trace session
Ceph cluster is up and running: ceph-osd binaries from standard packages
stop one ceph-osd daemon
restart with ceph-osd binary including LTTng based profiling
wait until cluster healthy
start LTTng session
run fio test
stop LTTng session
collect trace data and evaluate
Typical sample size on the osd under test:
22.000 primary writes (approx. 262144 / 12)
44.000 replication writes (approx. (262144 * 2) / 12)
43. 42
4.1. LTTng data evaluation: Replication Write
Observation:
replication write latency suffers from the "large variance problem"
minimum and average differ by a factor of 2
This is a common problem visible for many ceph-osd components.
Why is variance so large?
Observation: No single hotspot visible.
Observation: Active processing steps do not differ between minimum and average
sample as much as the total latency does.
Additional latency penalty mostly at the switch from
sub_op_modify_commit to Pipe::writer
no indication that queue length is the cause
Question: Can the overall thread load on the system and Linux scheduling be the
reason for the delayed start of the Pipe::writer thread?
44. 43
4.1.1 LTTng Microbenchmark Pipe::reader
"decode": fill message MSG_OSD_SUBOP data structure from bytes in the input
buffer. There is no decoding of the data buffer!
Optimizations:
"decode": a project currently restructures some messages to decrease the effort for
message encoding and decoding.
"authenticate": is currently optimized, too. Disable via "cephx sign messages"
45. 44
4.1.2 LTTng Microbenchmark Pipe::writer
"message setup": buffer allocation and encoding of message structure
"enqueue": enqueue at low level socket layer (not quite sure whether this really
covers the write/sendmsg system call to the socket)
48. 47
5. Thread classes and ceph-osd CPU usage
Thread per ceph-osd depends on complexity of Ceph cluster: 3x node with 4 OSDs
each ~700 threads per node; 9x nodes with 40 OSDs each > 100k threads per node
ThreadPool::WorkThread is a hot spot = work in the ObjectStore / FileStore
total CPU usage during test 43.17 CPU seconds
Pipe::Writer 4.59 10.63%
Pipe::Reader 5.81 13.45%
ShardedThreadPool::WorkThreadSharded 8.08 18.70%
ThreadPool::WorkThread 15.56 36.04%
FileJournal::Writer 2.41 5.57%
FileJournal::WriteFinisher 1.01 2.33%
Finisher::finisher_thread_entry 2.86 6.63%
49. 48
5.1. FileStore benchmarking
most of the work is done in FileStore::do_transactions
each write transaction consists of
3 calls to omap_setkeys,
the actual call to write to the file system
2 calls to setattr
Proposal: coalesce calls to omap_setkeys
1 function call instead of 3 calls, set 5 key value pairs instead of 6 (duplicate key)
51. 50
6. With our omap_setkeys coalescing patch
Reduced latency in ThreadPool::WorkThread by 54 microseconds = 25%
Significant reduction of CPU usage at the ceph-osd: 9% for the complete ceph-osd
Approx 5% better performance at the Ceph client
total CPU usage during test 43.17 CPU seconds 39.33 CPU seconds
Pipe::Writer 4.59 10.63% 4.73 12.02%
Pipe::Reader 5.81 13.45% 5.91 15.04%
ShardedThreadPool::WorkThreadSharded 8.08 18.70% 7.94 20.18%
ThreadPool::WorkThread 15.56 36.04% 12.45 31.66%
FileJournal::Writer 2.41 5.57% 2.44 6.22%
FileJournal::WriteFinisher 1.01 2.33% 1.03 2.61%
Finisher::finisher_thread_entry 2.86 6.63% 2.76 7.01%
52. 51
Summary on Performance
Two main goals for improvement:
(1) Decrease latency in the Ceph code path
(2) Enhance large cluster scalability with many nodes / ODS
There is a long path to improve the overall Ceph performance.
Many steps are necessary to get a factor of 2. Actual performance
work focus on (1) decrease latency.
To get an order of magnitude improvement on (2) we have to
master the limits bound to the overall OSD design:
Transaction structure bound across multiple Objects
PG omap data with a high level state logging
54. 53
Summary and Conclusion
ETERNUS CD10k is the safe way to make Ceph enterprise ready
Unlimited Scalability: 4 to 224 nodes, scales up to >50 Petabyte
Immortal System with Zero downtime: Non-disruptive add / remove / exchange of hardware
(disks and nodes) or Software update
TCO optimized: Highly automated and fully integrated management reduces operational efforts
Tight integration in OpenStack with own GUI
Fujitsu mSHEC technology (integrated in Hammer) shortens
recovery time at ~20% compared to Reed Solomon code
We love Ceph! But love is not blind, so we actively contribute in
the performance analysis & code/performance improvements.