SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Scaling Ceph at CERN
Dan van der Ster (email@example.com)
Data and Storage Service Group | CERN IT Department
CERN’s Mission and Tools
● CERN studies the fundamental laws of nature
○ Why do particles have mass?
○ What is our universe made of?
○ Why is there no antimatter left?
○ What was matter like right after the “Big Bang”?
● The Large Hadron Collider (LHC)
○ Built in a 27km long tunnel, ~200m underground
○ Dipole magnets operated at -271°C (1.9K)
○ Particles do ~11’000 turns/sec, 600 million collisions/sec
○ Four main experiments, each the size of a cathedral
○ DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3
Big Data at CERN
Physics Data on CASTOR/EOS
● LHC experiments produce ~10GB/s
User Data on OpenAFS & DFS
● Home directories for 30k users
● Physics analysis development
● Project spaces for applications
Service Data on AFS/NFS
● Databases, admin applications
Tape archival with CASTOR/TSM
● RAW physics outputs
● Desktop/Server backups
Scaling Ceph at CERN - D. van der Ster 4
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M
IT Evolution at CERN
Scaling Ceph at CERN - D. van der Ster 5
Cloudifying CERN’s IT infrastructure ...
● Centrally-managed and uniform hardware
○ No more service-specific storage boxes
● OpenStack VMs for most services
○ Building for 100k nodes (mostly for batch processing)
● Attractive desktop storage services
○ Huge demand for a local Dropbox, Google Drive …
● Remote data centre in Budapest
○ More rack space and power, plus disaster recovery
… brings new storage requirements
● Block storage for OpenStack VMs
○ Images and volumes
● Backend storage for existing and new services
○ AFS, NFS, OwnCloud, Data Preservation, ...
● Regional storage
○ Use of our new data centre in Hungary
● Failure tolerance, data checksumming, easy to operate, security, ...
Ceph at CERN
Scaling Ceph at CERN - D. van der Ster 6
12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
Our 3PB Ceph Cluster
Dual Intel Xeon L5640
24 threads incl. HT
Dual 1Gig-E NICs
Only one connected
2x 2TB Hitachi system disks
1x 240GB OCZ Deneva 2
Scaling Ceph at CERN - D. van der Ster 8
Dual Intel Xeon E5-2650
32 threads incl. HT
Dual 10Gig-E NICs
Only one connected
24x 3TB Hitachi disks
Eco drive, ~5900 RPM
3x 2TB Hitachi system disks
47 disk servers/1128 OSDs 5 monitors
Use-Cases Being Evaluated
1. Images and Volumes for OpenStack
2. S3 Storage for Data Preservation / Public
3. Physics data storage for archival and/or
Scaling Ceph at CERN - D. van der Ster 9
#1 is moving into production. #2 and #3 are more
exploratory at the moment.
OpenStack Volumes & Images
• Glance: using RBD for ~3 months now.
• Only issue was to increase ulimit -n above 1024 (10k
• Cinder: testing with close colleagues.
• 126 Cinder Volumes attached today – 56TB used
Scaling Ceph at CERN - D. van der Ster 10
Growing # of volumes/images Usual traffic is ~50-100MB/s
with current usage. (~idle)
RBD for OpenStack Volumes
• Before general availability, we need to test and
enable qemu iops/bps throttling
• Otherwise VMs with many IOs can disrupt other
• One ongoing issue is that a few clients are
getting an (infrequent) segfault of qemu during
a VM reboot.
• Happens on VMs with many attached RBD’s.
• Difficult to get a complete (16GB) core dump.
Scaling Ceph at CERN - D. van der Ster 11
CASTOR & XRootD/EOS
• Exploring RADOS backend for these two HEP-developed
• Gateway model, similar to S3 via RADOSGW
• CASTOR needs raw throughput performance (to feed
many tape drives at 250MBps each).
• Striped RWs across many OSDs are important.
• XRootD/EOS may benefit from the highly scalable
namespace to store O(billion) objects
• Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
• Developments are in early stages.
Scaling Ceph at CERN - D. van der Ster 12
Operations & Lessons Learned
Scaling Ceph at CERN - D. van der Ster 13
Configuration and Deployment
• Dumpling 0.67.7
• Fully Puppet-ized
• Automated server deployment,
automated OSD replacement
• Very few custom ceph.conf
• Experimenting with the
• we find that disabling it
completely gives better IOps
• But don’t do this!!!
Scaling Ceph at CERN - D. van der Ster 14
Scaling Ceph at CERN - D. van der Ster 15
• In these ~7 months of running the cluster, there have been very
• No outages
• No data losses/corruptions
• No unfixable performance issues
• Behaves well during stress tests
• But now we’re starting to get real/varied/creative users, and this
brings up many interesting issues...
• “No amount of stress testing can prepare you for real users”
• (point being, don’t take the next slides to be too negative – I’m
just trying to give helpful advice ;)
Scaling Ceph at CERN - D. van der Ster 16
Latency & Slow Requests
• Best latency we can achieve is 20-40ms
• Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster,
but could in a smaller limited use-case cluster (e.g. for Cinder-only)
• Latency can increase dramatically with heavy usage
• Don’t mix latency-bound and throughput-bound users on the same
• Local processes scanning the disks can hurt performance
• Add /var/lib/ceph to the updatedb PRUNEPATH
• If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a
5 second deadline vs. 500ms for reads!
• Kernel tuning: vm.* sysctl, dirty page flushing, memory
• “Something is flushing the buffers, blocking the OSD processes”
• Slow requests: monitor them, eliminate them.
Scaling Ceph at CERN - D. van der Ster 17
Life with 250 million objects
• Recently, a user decided to write 250 million 1kB objects
• Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster
being full of RBD images, at least in terms of # objects
• It worked – no big problems from holding this many objects.
• Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
• But now we want to cleanup, and it is not trivial to remove 250M
• rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down).
• Probably due to a mistake in our wbthrottle tuning
Scaling Ceph at CERN - D. van der Ster 18
Other backfilling issues
• During a backfilling event (draining a whole server),
we started observing repeated monitor elections
• Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up.
• When a mon falls behind, it calls an election
• Could be due to LevelDB compaction…
• We moved /var/lib/ceph/mon to SSDs – no more
elections during backfilling
• Avoid double backfilling when taking an OSD out of
• Start with ceph
• If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.
Scaling Ceph at CERN - D. van der Ster 19
Fun with CRUSH
• CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout
• But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements,
which create extra disk load and may disrupt users.
• Early CRUSH planning is crucial!
• A network switch is a failure domain, so we should
configure CRUSH to replicate across switches,
• But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks
• Unclear tradeoff between uptime and performance.
Scaling Ceph at CERN - D. van der Ster 20
CRUSH & Data distribution
• CRUSH may give your cluster
an uneven data distribution
• An OSD’s used space will
scale with the number of PGs
assigned to it
• After you have designed your
cluster, created your pools,
started adding data, check the
PG and volume distributions
is useful to iron out an uneven
• The hashpspool flag is also
important if you have many
Scaling Ceph at CERN - D. van der Ster 21
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs
Number of OSDs having N PGs
(for pool = volumes)
RBD Reliability with 3 Replicas
• RBD devices are chunked across thousands of objects:
• A full 1TB volume is composed of 250,000 4MB objects
• If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!)
• If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
• Our incorrect & irrational fears:
• Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted.
• As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase.
• But this is wrong!!
• The only triple disk failures that can lead to data loss are those combinations
actively used by PGs – so having e.g. 4096 PGs for RBDs means that only
4096 combinations out of the 10^9 possible combinations matter.
• N_PGs * ~(P_diskfailure^3) / 3!
• We use 4 replicas for the RBD volumes, but this is probably overkill.
Scaling Ceph at CERN - D. van der Ster 22
Trust your clients
• There is no server-side per-client throttling
• A few nasty clients can overwhelm an OSD, leading to slow requests
• When you have a high load / slow requests, it is not always
trivial to identify and blacklist/firewall the misbehaving client
• Could use some help in the monitoring: per-client perf stats?
• One of our creative users found a way to make the mon’s
generate 5*40 MBps of outbound network traffic
• Could saturate the mon network, lead to disruptions
• RADOS is not for end-users. A cephx keyring is for trusted
persons only, not for Joe Random User.
Scaling Ceph at CERN - D. van der Ster 23
• A healthy cluster is always vulnerable to human errors
• We’ve thus far avoided any big mistakes
• Used PG splitting to grow a pool from 8 to 2048 PGs
• Leads to unresponsive OSDs who get marked down à degraded objs.
• Safer & now-enforced to grow in 2x or 4x steps
• ulimits, ulimits, ulimits
• With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere:
• Glance, qemu, radosgw, ceph/rados CLI, …
• If you use XFS, don’t put your OSD journal as a file on the disk
• Use a separate partition, the first partition!
• We still need to reinstall our whole cluster to re-partition the OSDs
Scaling Ceph at CERN - D. van der Ster 24
Scale up and out
• Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs.
• What about 10,000 or 100,000 OSDs?
• What about 10,000 or 100,000 clients?
• Many Ceph instances is always an option, but not ideal
• Scale out: our growing data centre in Budapest
brings many options:
• Replicate over the WAN (though, 30ms RTT)
• Tiering / Caching pools (new feature, need to get
• Data locality – direct IOs to nearby replica or caching pool
Scaling Ceph at CERN - D. van der Ster 25
Scaling Ceph at CERN - D. van der Ster 26
• CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the
• Our CASTOR and XRootD physics data use-
cases may exploit RADOS for improved
• In seven months with a 3PB cluster, we’ve not
had any disasters. Actually it’s working quite
• Presented some lessons learned, I hope they
prove useful in your Ceph explorations.
Scaling Ceph at CERN - D. van der Ster 27