3. Agenda
3
@
First Ceph Environment at Target went live in October of 2014
• “Firefly” Release
Ceph was backing Target’s first ‘official’ Openstack release
• Icehouse Based
• Ceph is used for:
• RBD for Openstack Instances and Volumes
• RADOSGW for Object (instead of Swift)
• RBD backing Celiometer MongoDB volumes
• Currently DEV is largest environment with ~1700 instances
Replaced traditional array-based approach that was implemented in our
prototype Havana environment.
• Traditional storage model was problematic to integrate
• Maintenance/purchase costs from array vendors can get prohibitive
• Traditional SAN just doesn’t ‘feel’ right in this space.
• Ceph’s tight integration with Openstack
4. Agenda
4
@
Initial Ceph Deployment:
• 3 x Monitor Nodes – Cisco B200
• 12 x OSD Nodes – Cisco C240 LFF
• 12 4TB SATA Disks
• 10 OSD per server
• Journal partition co-located on each OSD disk
• 120 OSD Total = ~ 400 TB
• 2x 10GBE per host
• 1 public_network
• 1 cluster_network
• Basic LSI ‘MegaRaid’ controller – SAS 2008M-8i
• No supercap or cache capability onboard
• 10xRAID0
5. Post rollout it became evident that there were performance issues within
the environment.
• KitchenCI users would complain of slow Chef converge times
• Yum transactions / app deployments would take abnormal amounts of time to
complete.
• Instance boot times, especially for images using cloud-init would take excessively
long time to boot
• General user griping about ‘slowness’
Lesson #1 -
Instrument Your Deployment!
Track statistics/ metrics that have real impact to
your users
6. Unacceptable levels of latency even while cluster was relatively unworked
High levels of CPU IOWait% on the OSD servers & IDLE Openstack Instances
Poor IOPS / Latency - FIO benchmarks running INSIDE Openstack Instances
$ fio --rw=write --ioengine=libaio --runtime=100 --direct=1 --bs=4k --size=10G --iodepth=32 --name=/tmp/testfile.bin
test: (groupid=0, jobs=1): err= 0: pid=1914
read : io=1542.5MB, bw=452383 B/s, iops=110 , runt=3575104msec
write: io=527036KB, bw=150956 B/s, iops=36 , runt=3575104msec
Having more effective instrumentation from the outset would have revealed obvious
problems with our architecture
7. Compounding the performance issues we began to see mysterious
reliability issues.
• OSDs would randomly fall offline
• Cluster would enter a HEALTH_ERR state about once a week with ‘unfound objects’
and/or Inconsistent page groups that required manual intervention to fix.
• These problems were usually coupled with a large drop in our already suspect
performance levels
Lesson #2 –
Do your research on hardware
your server vendor provides!
Don’t just blindly accept whatever they had laying around, be proactive!
8. • Root cause of HEALTH_ERRs was “unnamed vendor’s” SATA drives in our
solution ‘soft-failing’ – slowly gaining media errors without reporting themselves
as failed. Don’t rely on SMART. Interrogate your disks with a array-level tool, like
MegaRAID to identify drives for proactive replacement.
$ opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep Media
• In installations with co-located journal partitions, a RAID solution with
cache+BBU for writeback operation would have been a huge performance gain.
Paying more attention to the suitability of hardware our vendor of choice provided
would have saved a lot of headaches
9. Which leads us to –
Lesson #3 –
Ceph is not magic. It does the best
with the hardware you give it!
Much ill-advised advice floating around that if you throw enough crappy disks at Ceph you
will achieve enterprise grade performance. Garbage in – Garbage out. Don’t be greedy and
build for capacity, if your objective is to create more a performant block storage solution.
10. Agenda
1
0
New Ceph OSD Deployment:
• 5 x OSD Nodes – Cisco C240M3 SFF
• 18 10k Seagate SAS 1.1TB
• 6 480g Intel S3500 SSD
• I like Intel SSDs for use with Ceph. Huge disparity between SSD vendors
performance.
• 18 OSD per server
• Journal partition on SSD with 4/5:1 OSD/Journal ratio
• 90 OSD Total = ~ 100 TB
• Improved LSI ‘MegaRaid’ controller – SAS-9271-8i
• Supercap
• Writeback capability
• 18xRAID0
• Writethru on journals, writeback on spinning OSDs .
• Still experimenting with this! – Writeback seems to help on systems
without JBOD mode adapters!
• UCS M4 Gen finally has cards that support LSI’s ‘ITMODE’ or JBOD!
• Based on “Hammer” Ceph Release
Lessons learned – we set out to rebuild
11. • Obtaining metrics from our design change was nearly immediate due to having
effective monitors in place
– Latency improvements have been extreme
– IOWait% within Openstack instances have been greatly reduced
– Raw IOPS throughput has sykrocketed
• Testing Celiometer backended by MongoDB on kRBD I’ve seen this 5 node / 90 OSD cluster
spike to ~25k IOPS
– Throughput testing with RADOS bench and FIO shows aprox. 10 fold increase
– User feedback has been extremely positive, general Openstack experience at Target is
much improved.
– Performance within Openstack instances has increase about 10x
Results
test: (groupid=0, jobs=1): err= 0: pid=1914
read : io=1542.5MB, bw=452383 B/s,
iops=110 , runt=3575104msec
write: io=527036KB, bw=150956 B/s,
iops=36 , runt=3575104msec
test: (groupid=0, jobs=1): err= 0: pid=2131
read : io=2046.6MB, bw=11649KB/s,
iops=2912 , runt=179853msec
write: io=2049.1MB, bw=11671KB/s,
iops=2917 , runt=179853msec
12.
13. • Before embarking on creating a Ceph environment, have a good idea of what
your objectives are for the environment.
– Capacity?
– Performance?
• If you make wrong decisions it can lead to a negative user perception of Ceph,
and technologies that depend on it, like Openstack
• Once you understand your objective, understand that your hardware selection is
crucial to your success
• Unless you are architecting for raw capacity, use SSDs for your journal volumes
without exception
– If you must co-locate journals, use a RAID adapter with BBU+Writeback cache
• A hybrid approach may be feasible with SATA ‘capacity’ disks with SSD
journals. I’ve yet to try this, I’d be interested in seeing some benchmark data on
a setup like this
• Research, experiment, consult with Red Hat / Inktank
• Monitor, monitor, monitor and provide a very short feedback loop for your users
to engage you with their concerns
Conclusion
14. • Looking to test all SSD pool performance
– All SSD in Ceph has been maturing rapidly
– We have needs for a ‘ultra’ Cinder tier for workloads that require high IOPS / low
latency for use cases such as Kafka, Cassandra
– Also considering Solidfire for this use case
– If anyone has experience with this – I’d love to hear about it!
• Repurposing legacy SATA hardware into a dedicated object pool
– High capacity, low performance drives should work well in an object use case – more
research is needed into end-user requirements
• Automate deployment with Chef to bring parity with our Openstack automation
• Broadening Ceph beyond cloud niche use case. Especially with improved object
offering.
• Repurpose ‘capacity’ frames
– Video archiving for security camera footage
– Register / POS log archiving
Next Steps
15. • Plan time into your deployment schedule to iron out dependancy hell, especially
if you are moving from Inktank packages to Red Hat packages
• In Hammer, you no longer have to use Apache and the FastCGI shim for
RADOSGW object service. Enable civitweb with the following entry in the
[client.radosgw.gateway] section of ceph.conf and make sure you shut off
Apache!
– rgw_frontends = "civetweb port=80”
• Use new and improved CRUSH algorithm. This WILL trigger a lot of rebalancing
activity!
– $ ceph osd crush tunables optimal
• In the [osd] section of ceph.conf set the following directive. This prevents new
OSDs from triggering rebalancing. Nope, setting NOIN won’t do the trick!
– osd_crush_update_on_start = false
• Ceph’s default recovery settings are far too aggressive. Tone it down with the
following in the [osd] section or it will impact client IO
osd_max_backfills = 1
osd_recovery_priority = 1
osd_client_op_priority = 63
osd_recovery_max_active = 1
osd_recovery_max_single_start = 1
General Tips on Migrations/Upgrade to Hammer
16. • Best method to ‘drain’ hosts is by adjusting the CRUSH weight of the OSDs on
those hosts, NOT the OSD weight.
– CRUSH weight dictates cluster-wide data distribution. OSD weights only impact the
host the OSD is on and can cause unpredictability.
• Don’t work serially host by host – Drop the CRUSH weight of all the OSDs you
are removing across the cluster simultaneously. I used a ‘reduce by 50% and
allow recovery’ scheme. Your mileage may vary.
$ for i in {0..119}; do ceph osd crush reweight osd.$i 3.0; done
$ for i in {0..119}; do ceph osd crush reweight osd.$i 1.5; done
$ for i in {0..119}; do ceph osd crush reweight osd.$i .75; done
• Consider numad to auto-magically set numa affinities.
– Still experimenting with the impact of this on cluster performance.
• Last but not least – VERY Important. You WILL run out of threads and OSDs
WILL crash if you don’t tune the kernel.pid_max value – especially in servers
with > 12 OSDs
$ echo "kernel.pid_max = 4194303" >> /etc/sysctl.conf