4. Interfaces to storage
● Object
● Ceph RGW, S3, Swift
● Block (aka SAN)
● Ceph RBD, iSCSI, FC, SAS
● File (aka scale-out NAS)
● Ceph, GlusterFS, Lustre, proprietary filers
4 Ceph Day London - CephFS Update
5. Interfaces to storage
S3 & Swift
Multi-tenant
Snapshots
Clones
5 Ceph Day London - CephFS Update
FILE
SYSTEM
CephFS
BLOCK
STORAGE
RBD
OBJECT
STORAGE
RGW
Keystone
Geo-Replication
Native API
OpenStack
Linux Kernel
iSCSI
POSIX
Linux Kernel
CIFS/NFS
HDFS
Distributed Metadata
6. Object stores scale out well
● Last writer wins consistency
● Consistency rules only apply to one object at a time
● Clients are stateless (unless explicitly doing lock ops)
● No relationships exist between objects
● Objects have exactly one name
● Scale-out accomplished by mapping objects to nodes
● Single objects may be lost without affecting others
6 Ceph Day London - CephFS Update
7. POSIX filesystems are hard to scale out
● Extents written from multiple clients must win or lose
on all-or-nothing basis → locking
● Inodes depend on one another (directory hierarchy)
● Clients are stateful: holding files open
● Users have local-filesystem latency expectations:
applications assume FS client will do lots of metadata
caching for them.
● Scale-out requires spanning inode/dentry relationships
across servers
● Loss of data can damage whole subtrees
7 Ceph Day London - CephFS Update
8. Failure cases increase complexity further
● What should we do when... ?
● Filesystem is full
● Client goes dark
● An MDS goes dark
● Memory is running low
● Clients are competing for the same files
● Clients misbehave
● Hard problems in distributed systems generally,
especially hard when we have to uphold POSIX
semantics designed for local systems.
8 Ceph Day London - CephFS Update
9. Terminology
● inode: a file. Has unique ID, may be referenced by
one or more dentries.
● dentry: a link between an inode and a directory
● directory: special type of inode that has 0 or more child
dentries
● hard link: many dentries referring to the same inode
● Terms originate form original (local disk) filesystems,
where these were how a filesystem was represented
on disk.
9 Ceph Day London - CephFS Update
11. CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file
system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.
http://ceph.com/papers/weil-ceph-osdi06.pdf
11 Ceph Day London - CephFS Update
13. Components
Linux host
ceph.ko
metadata 01 data
10
M M
M
Ceph server daemons
13 Ceph Day London – CephFS Update
14. From application to disk
Application
ceph-fuse libcephfs Kernel client
ceph-mds
Client network protocol
RADOS
Disk
14 Ceph Day London - CephFS Update
15. Scaling out FS metadata
● Options for distributing metadata?
– by static subvolume
– by path hash
– by dynamic subtree
● Consider performance, ease of implementation
15 Ceph Day London – CephFS Update
17. Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative
copies (cached with capabilities just like
clients do)
● In practice work at directory fragment level in order to
handle large dirs
17 Ceph Day London - CephFS Update
18. Data placement
● Stripe file contents across RADOS objects
● get full rados cluster bandwidth from clients
● delegate all placement/balancing to RADOS
● Control striping with layout vxattrs
● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files
'stray', RADOS delete ops sent by MDS
18 Ceph Day London - CephFS Update
19. Clients
● Two implementations:
● ceph-fuse/libcephfs
● kclient
● Interplay with VFS page cache, efficiency harder with
fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata
locks: include clients in troubleshooting
19 Ceph Day London - CephFS Update
20. Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file"
in the metadata pool.
● I/O latency on metadata ops is sum of network latency
and journal commit latency.
● Metadata remains pinned in in-memory cache until
expired from journal.
20 Ceph Day London - CephFS Update
21. Journaling and caching in MDS
● In some workloads we expect almost all metadata
always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also
warms up cache. Use standby replay to keep that
cache warm.
21 Ceph Day London - CephFS Update
22. Lookup by inode
● Sometimes we need inode → path mapping:
● Hard links
● NFS handles
● Costly to store this: mitigate by piggybacking paths
(backtraces) onto data objects
● Con: storing metadata to data pool
● Con: extra IOs to set backtraces
● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency?
22 Ceph Day London - CephFS Update
23. Extra features
● Snapshots:
● Exploit RADOS snapshotting for file data
● … plus some clever code in the MDS
● Fast petabyte snapshots
● Recursive statistics
● Lazily updated
● Access via vxattr
● Avoid spurious client I/O for df
23 Ceph Day London - CephFS Update
24. Extra features
● Snapshots:
● Exploit RADOS snapshotting for file data
● … plus some clever code in the MDS
● Fast petabyte snapshots
● Recursive statistics
● Lazily updated
● Access via vxattr
● Avoid spurious client I/O for df
24 Ceph Day London - CephFS Update
25. CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
25 Ceph Day London - CephFS Update
26. Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:
● Skip the wait during reconnect phase on MDS restart
● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations
26 Ceph Day London - CephFS Update
27. CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:
● Single active MDS, plus one standby
● Dedicated MDS server
● Kernel client
● No snapshots, no inline data
27 Ceph Day London - CephFS Update
29. AAPPPP AAPPPP HHOOSSTT/V/VMM CCLLIEIENNTT
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RADOS
RADOS
RBD
RBD
A reliable and fully-distributed
A reliable and fully-distributed
block
block
device, with a Linux
kernel client and a
QEMU/KVM driver
device, with a Linux
kernel client and a
QEMU/KVM driver
RADOSGW
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
A bucket-based
REST gateway,
compatible with S3
and Swift
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
29 Ceph Day London – CephFS Update
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
NEARLY
AWESOME
AWESOME AWESOME
AWESOME
AWESOME
30. Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS
configuration
30 Ceph Day London - CephFS Update
31. Giant → Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache
31 Ceph Day London - CephFS Update
32. FSCK and repair
● Recover from damage:
● Loss of data objects (which files are damaged?)
● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:
● Are recursive stats consistent?
● Does metadata on disk match cache?
● Does file size metadata match data on disk?
● Repair:
● Automatic where possible
● Manual tools to enable support
32 Ceph Day London - CephFS Update
33. Client management
● Current eviction is not 100% safe against rogue clients
● Update to client protocol to wait for OSD blacklist
● Client metadata
● Initially domain name, mount point
● Extension to other identifiers?
33 Ceph Day London - CephFS Update
34. Online diagnostics
● Bugs exposed relate to failures of one client to release
resources for another client: “my filesystem is frozen”.
Introduce new health messages:
● “client xyz is failing to respond to cache pressure”
● “client xyz is ignoring capability release messages”
● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce
`session ls`
● Which clients does MDS think are stale?
● Identify clients to evict with `session evict`
34 Ceph Day London - CephFS Update
35. Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes
on startup”:
● Data loss
● Software bugs
● Updated on-disk format to make recovery from
damage easier
● New tool: cephfs-journal-tool
● Inspect the journal, search/filter
● Chop out unwanted entries/regions
35 Ceph Day London - CephFS Update
36. Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:
● Require some free memory to make progress
● Require client cooperation to unpin cache objects
● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster
● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:
● Contention between I/O to flush cache and I/O to journal
36 Ceph Day London - CephFS Update
37. Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:
● Long running/thrashing test
● Third party FS correctness tests
● Python functional tests
● We dogfood CephFS internally
● Various kclient fixes discovered
● Motivation for new health monitoring metrics
● Third party testing is extremely valuable
37 Ceph Day London - CephFS Update
38. What's next?
● You tell us!
● Recent survey highlighted:
● FSCK hardening
● Multi-MDS hardening
● Quota support
● Which use cases will matter to community?
● Backup
● Hadoop
● NFS/Samba gateway
● Other?
38 Ceph Day London - CephFS Update
39. Reporting bugs
● Does the most recent development release or kernel
fix your issue?
● What is your configuration? MDS config, Ceph
version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
39 Ceph Day London - CephFS Update
40. Future
● Ceph Developer Summit:
● When: 8 October
● Where: online
● Post-Hammer work:
● Recent survey highlighted multi-MDS, quota support
● Testing with clustered Samba/NFS?
40 Ceph Day London - CephFS Update
43. Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide
for white space.
● If you use a graphic, make sure text is readable.
43 Ceph Day London - CephFS Update
44. Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide
for white space.
● If you use a graphic, make sure text is readable.
44 Ceph Day London - CephFS Update
45. Introduce Red Hat
● Create an agenda slide for every presentation.
● Outline what you’re going to tell the audience.
● Prepare them for a call to action after the presentation.
● If this is a confidential presentation, use the
confidential presentation template located on the Corporate >
Templates > Presentation templates page of the PNT Portal.
45 Ceph Day London - CephFS Update
46. Introduce Red Hat solutions and services
● Provide product details that specifically solve the
customer pain point you’re addressing.
● These slides explain how Red Hat solutions work, what
makes them unique and valuable.
46 Ceph Day London - CephFS Update
47. Learn more
● End with a call to action.
● Let the audience know what can be done next, how
you or Red Hat can help them.
47 Ceph Day London - CephFS Update
57. A STORAGE REVOLUTION
SUPPORT &
MAINTENANCE
PROPRIETARY
SOFTWARE
PROPRIETARY
HARDWARE
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
ENTERPRISE
PRODUCTS &
SERVICES
OPEN SOURCE
SOFTWARE
STANDARD
HARDWARE
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
62. RADOS COMPONENTS
62
OSDs:
10s to 10000s in a cluster
One per disk (or one per SSD, RAID group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:
Maintain cluster membership and state
Provide consensus for distributed decision-making
Small, odd number
These do not serve stored objects to clients
M
70. ACCESSING A RADOS CLUSTER
70
APPLICATION
LIBRADOS
OBJECT
socket
M M
M
RADOS CLUSTER
71. LIBRADOS: RADOS ACCESS FOR
APPS
L
71
LIBRADOS:
Direct access to RADOS for applications
C, C++, Python, PHP, Java, Erlang
Direct access to storage nodes
No HTTP overhead
73. THE RADOS GATEWAY
73
APPLICATION APPLICATION
REST
RADOSGW
LIBRADOS
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket
74. RADOSGW MAKES RADOS WEBBY
74
RADOSGW:
REST-based object storage proxy
Uses RADOS to store objects
API supports buckets, accounts
Usage accounting for billing
Compatible with S3 and Swift applications
82. SCALABLE METADATA SERVERS
82
METADATA SERVER
Manages metadata for a POSIX-compliant
shared filesystem
Directory hierarchy
File metadata (owner, timestamps, mode,
etc.)
Stores metadata in RADOS
Does not serve file data to clients
Only required for shared filesystem
83. CEPH AND OPENSTACK
83
KEYSTONE CINDER GLANC
SWIFT E NOVA
RADOSGW
LIBRADOS
OPENSTACK
LIBRB
D
M M
RADOS CLUSTER
LIBRB
D
HYPER-VISOR
LIBRBD