Ceph Day London 2014 - The current state of CephFS development

CephFS Update
John Spray
john.spray@redhat.com
Ceph Day London

Agenda
● Introduction to distributed filesystems
● Architectural overview
● Recent development
● Test & QA
2 Ceph Day London - CephFS Update

Distributed filesystems
...and why they are hard.
3 Ceph Day London – CephFS Update

Interfaces to storage
● Object
● Ceph RGW, S3, Swift
● Block (aka SAN)
● Ceph RBD, iSCSI, FC, SAS
● File (aka scale-out NAS)
● Ceph, GlusterFS, Lustre, proprietary filers

Interfaces to storage
S3 & Swift
Multi-tenant
Snapshots
Clones
FILE
SYSTEM
CephFS
BLOCK
STORAGE
RBD
OBJECT
STORAGE
RGW
Keystone
Geo-Replication
Native API
OpenStack
Linux Kernel
iSCSI
POSIX
Linux Kernel
CIFS/NFS
HDFS
Distributed Metadata

Object stores scale out well
● Last writer wins consistency
● Consistency rules only apply to one object at a time
● Clients are stateless (unless explicitly doing lock ops)
● No relationships exist between objects
● Objects have exactly one name
● Scale-out accomplished by mapping objects to nodes
● Single objects may be lost without affecting others

POSIX filesystems are hard to scale out
● Extents written from multiple clients must win or lose
on all-or-nothing basis → locking
● Inodes depend on one another (directory hierarchy)
● Clients are stateful: holding files open
● Users have local-filesystem latency expectations:
applications assume FS client will do lots of metadata
caching for them.
● Scale-out requires spanning inode/dentry relationships
across servers
● Loss of data can damage whole subtrees

Failure cases increase complexity further
● What should we do when... ?
● Filesystem is full
● Client goes dark
● An MDS goes dark
● Memory is running low
● Clients are competing for the same files
● Clients misbehave
● Hard problems in distributed systems generally,
especially hard when we have to uphold POSIX
semantics designed for local systems.

Terminology
● inode: a file. Has unique ID, may be referenced by
one or more dentries.
● dentry: a link between an inode and a directory
● directory: special type of inode that has 0 or more child
dentries
● hard link: many dentries referring to the same inode
● Terms originate form original (local disk) filesystems,
where these were how a filesystem was represented
on disk.

Architectural overview

CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file
system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.
http://ceph.com/papers/weil-ceph-osdi06.pdf

Components
● Client: kernel, fuse, libcephfs
● Server: MDS daemon
● Storage: RADOS cluster (mons & OSDs)

Components
Linux host
ceph.ko
metadata 01 data
10
M M
M
Ceph server daemons

From application to disk
Application
ceph-fuse libcephfs Kernel client
ceph-mds
Client network protocol
RADOS
Disk

Scaling out FS metadata
● Options for distributing metadata?
– by static subvolume
– by path hash
– by dynamic subtree
● Consider performance, ease of implementation

DYNAMIC SUBTREE PARTITIONING

Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative
copies (cached with capabilities just like
clients do)
● In practice work at directory fragment level in order to
handle large dirs

Data placement
● Stripe file contents across RADOS objects
● get full rados cluster bandwidth from clients
● delegate all placement/balancing to RADOS
● Control striping with layout vxattrs
● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files
'stray', RADOS delete ops sent by MDS

Clients
● Two implementations:
● ceph-fuse/libcephfs
● kclient
● Interplay with VFS page cache, efficiency harder with
fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata
locks: include clients in troubleshooting

Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file"
in the metadata pool.
● I/O latency on metadata ops is sum of network latency
and journal commit latency.
● Metadata remains pinned in in-memory cache until
expired from journal.

Journaling and caching in MDS
● In some workloads we expect almost all metadata
always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also
warms up cache. Use standby replay to keep that
cache warm.

Lookup by inode
● Sometimes we need inode → path mapping:
● Hard links
● NFS handles
● Costly to store this: mitigate by piggybacking paths
(backtraces) onto data objects
● Con: storing metadata to data pool
● Con: extra IOs to set backtraces
● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency?

Extra features
● Snapshots:
● Exploit RADOS snapshotting for file data
● … plus some clever code in the MDS
● Fast petabyte snapshots
● Recursive statistics
● Lazily updated
● Access via vxattr
● Avoid spurious client I/O for df

CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph

Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:
● Skip the wait during reconnect phase on MDS restart
● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations

CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:
● Single active MDS, plus one standby
● Dedicated MDS server
● Kernel client
● No snapshots, no inline data

Development update

AAPPPP AAPPPP HHOOSSTT/V/VMM CCLLIEIENNTT
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RADOS
RADOS
RBD
RBD
A reliable and fully-distributed
A reliable and fully-distributed
block
block
device, with a Linux
kernel client and a
QEMU/KVM driver
device, with a Linux
kernel client and a
QEMU/KVM driver
RADOSGW
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
A bucket-based
REST gateway,
compatible with S3
and Swift
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
NEARLY
AWESOME
AWESOME AWESOME
AWESOME
AWESOME

Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS
configuration

Giant → Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache

FSCK and repair
● Recover from damage:
● Loss of data objects (which files are damaged?)
● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:
● Are recursive stats consistent?
● Does metadata on disk match cache?
● Does file size metadata match data on disk?
● Repair:
● Automatic where possible
● Manual tools to enable support

Client management
● Current eviction is not 100% safe against rogue clients
● Update to client protocol to wait for OSD blacklist
● Client metadata
● Initially domain name, mount point
● Extension to other identifiers?

Online diagnostics
● Bugs exposed relate to failures of one client to release
resources for another client: “my filesystem is frozen”.
Introduce new health messages:
● “client xyz is failing to respond to cache pressure”
● “client xyz is ignoring capability release messages”
● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce
`session ls`
● Which clients does MDS think are stale?
● Identify clients to evict with `session evict`

Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes
on startup”:
● Data loss
● Software bugs
● Updated on-disk format to make recovery from
damage easier
● New tool: cephfs-journal-tool
● Inspect the journal, search/filter
● Chop out unwanted entries/regions

Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:
● Require some free memory to make progress
● Require client cooperation to unpin cache objects
● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster
● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:
● Contention between I/O to flush cache and I/O to journal

Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:
● Long running/thrashing test
● Third party FS correctness tests
● Python functional tests
● We dogfood CephFS internally
● Various kclient fixes discovered
● Motivation for new health monitoring metrics
● Third party testing is extremely valuable

What's next?
● You tell us!
● Recent survey highlighted:
● FSCK hardening
● Multi-MDS hardening
● Quota support
● Which use cases will matter to community?
● Backup
● Hadoop
● NFS/Samba gateway
● Other?

Reporting bugs
● Does the most recent development release or kernel
fix your issue?
● What is your configuration? MDS config, Ceph
version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/

Future
● Ceph Developer Summit:
● When: 8 October
● Where: online
● Post-Hammer work:
● Recent survey highlighted multi-MDS, quota support
● Testing with clustered Samba/NFS?

Questions?

Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide
for white space.
● If you use a graphic, make sure text is readable.

Introduce Red Hat
● Create an agenda slide for every presentation.
● Outline what you’re going to tell the audience.
● Prepare them for a call to action after the presentation.
● If this is a confidential presentation, use the
confidential presentation template located on the Corporate >
Templates > Presentation templates page of the PNT Portal.

Introduce Red Hat solutions and services
● Provide product details that specifically solve the
customer pain point you’re addressing.
● These slides explain how Red Hat solutions work, what
makes them unique and valuable.

Learn more
● End with a call to action.
● Let the audience know what can be done next, how
you or Red Hat can help them.

Divider slide

DDiivviiddeerr Sslliiddee

A STORAGE REVOLUTION
SUPPORT &
MAINTENANCE
PROPRIETARY
SOFTWARE
PROPRIETARY
HARDWARE
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
ENTERPRISE
PRODUCTS &
SERVICES
OPEN SOURCE
SOFTWARE
STANDARD
HARDWARE
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK

ARCHITECTURAL COMPONENTS
APP HOST/VM CLIENT
RBD
A reliable, fully-distributed
block
device with cloud
platform integration
A distributed file
system with POSIX
semantics and scale-out
Copyright © 2014 by Inktank | Private and Confidential
58
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
CEPHFS
metadata
management
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors

APP HOST/VM CLIENT
RBD
block
device with cloud
A distributed file
system with POSIX
59
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
CEPHFS
metadata
management
RADOS

OBJECT STORAGE DAEMONS
60
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
DISK
btrfs
xfs
ext4
M
M
M

RADOS CLUSTER
61
APPLICATION
M M
M M
M
RADOS CLUSTER

RADOS COMPONENTS
62
OSDs:
 10s to 10000s in a cluster
 One per disk (or one per SSD, RAID group…)
 Serve stored objects to clients
 Intelligently peer for replication & recovery
Monitors:
 Maintain cluster membership and state
 Provide consensus for distributed decision-making
 Small, odd number
 These do not serve stored objects to clients
M

WHERE DO OBJECTS LIVE?
63
??
APPLICATION
M
M
M
OBJECT

A METADATA SERVER?
64
1
APPLICATION
M
M
M
2

CALCULATED PLACEMENT
65
APPLICATION F
M
M
M
A-G
H-N
O-T
U-Z

EVEN BETTER: CRUSH!
66
01 11
11
01
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
11
01
10
01
01
10
10
01
01 10
10 10 01 01

CRUSH IS A QUICK CALCULATION
67
01 11
11
01
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
01 10
10 10 01 01

CRUSH: DYNAMIC DATA
PLACEMENT
68
CRUSH:
 Pseudo-random placement algorithm
 Fast calculation, no lookup
 Repeatable, deterministic
 Statistically uniform distribution
 Stable mapping
 Limited data migration on change
 Rule-based configuration
 Infrastructure topology aware
 Adjustable replication
 Weighting

APP HOST/VM CLIENT
RBD
block
device with cloud
A distributed file
system with POSIX
69
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
CEPHFS
metadata
management
RADOS

ACCESSING A RADOS CLUSTER
70
APPLICATION
LIBRADOS
OBJECT
socket
M M
M
RADOS CLUSTER

LIBRADOS: RADOS ACCESS FOR
APPS
L
71
LIBRADOS:
 Direct access to RADOS for applications
 C, C++, Python, PHP, Java, Erlang
 Direct access to storage nodes
 No HTTP overhead

APP HOST/VM CLIENT
RBD
block
device with cloud
A distributed file
system with POSIX
72
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
CEPHFS
metadata
management
RADOS

THE RADOS GATEWAY
73
APPLICATION APPLICATION
REST
RADOSGW
LIBRADOS
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket

RADOSGW MAKES RADOS WEBBY
74
RADOSGW:
 REST-based object storage proxy
 Uses RADOS to store objects
 API supports buckets, accounts
 Usage accounting for billing
 Compatible with S3 and Swift applications

APP HOST/VM CLIENT
RBD
block
device with cloud
A distributed file
system with POSIX
75
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
CEPHFS
metadata
management
RADOS

STORING VIRTUAL DISKS
76
VM
HYPERVISOR
LIBRBD
M M
RADOS CLUSTER

SEPARATE COMPUTE FROM
STORAGE
77
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM HYPERVISOR
LIBRBD

KERNEL MODULE FOR MAX
FLEXIBLE!
78
LINUX HOST
KRBD
M M
RADOS CLUSTER

RBD STORES VIRTUAL DISKS
79
RADOS BLOCK DEVICE:
 Storage of disk images in RADOS
 Decouples VMs from host
 Images are striped across the cluster (pool)
 Snapshots
 Copy-on-write clones
 Support in:
 Mainline Linux Kernel (2.6.39+)
 Qemu/KVM, native Xen coming soon
 OpenStack, CloudStack, Nebula, Proxmox

APP HOST/VM CLIENT
RBD
block
device with cloud
A distributed file
system with POSIX
80
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
CEPHFS
metadata
management
RADOS

SEPARATE METADATA SERVER
81
LINUX HOST
KERNEL
MODULE
metadata 01 data
10
M M
M
RADOS CLUSTER

SCALABLE METADATA SERVERS
82
METADATA SERVER
 Manages metadata for a POSIX-compliant
shared filesystem
 Directory hierarchy
 File metadata (owner, timestamps, mode,
etc.)
 Stores metadata in RADOS
 Does not serve file data to clients
 Only required for shared filesystem

CEPH AND OPENSTACK
83
KEYSTONE CINDER GLANC
SWIFT E NOVA
RADOSGW
LIBRADOS
OPENSTACK
LIBRB
D
M M
RADOS CLUSTER
LIBRB
D
HYPER-VISOR
LIBRBD

GETTING STARTED WITH CEPH
 Read about the latest version of Ceph.
 The latest stuff is always at http://ceph.com/get
 Deploy a test cluster using ceph-deploy.
 Read the quick-start guide at http://ceph.com/qsg
 Read the rest of the docs!
 Find docs for the latest release at http://ceph.com/docs
 Ask for help when you get stuck!
 Community volunteers are waiting for you at
http://ceph.com/help
84

Ceph Day London 2014 - The current state of CephFS development

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ceph Day London 2014 - The current state of CephFS development

Similar to Ceph Day London 2014 - The current state of CephFS development (20)

Recently uploaded

Recently uploaded (20)

Ceph Day London 2014 - The current state of CephFS development