A description of the Sanger Institute's journey with OpenStack to date, covering RHOSP, Ceph, S3, user applications, and future plans. Given at the Sanger Institute's OpenStack Day.
3. What I’ll talk about
● The Sanger Institute
● Motivations for using OpenStack
● Our journey
● Some decisions we made (and why)
● Some problems we encountered (and how we addressed them)
● Projects that are using it so far
● Next steps
4. The Sanger Institute
LSF 9
~10,000 cores in main compute farm
~10,000 cores across smaller project-specific farms
13PB Lustre storage
Mostly everything is available everywhere - “isolation” is based on POSIX file
permissions
5. Motivations
LSF great for HPC utilization but…
● It doesn’t address data size/sharing/locality
● It’s quicker to move an image (or an image definition) to the data
○ benefit from existing data security arrangements
○ benefit from tenant isolation
LSF isn’t going away - complementary to cloud-style computing
6. Our journey
● 2015, June: sysadmin training
● July: experiments with RHOSP6 (Juno)
● August: RHOSP7 (Kilo) released
● December: pilot “beta” system opened to testers
● 2016, first half: Science As A Service
● July: pilot “gamma” system opened using proper Ceph hardware
● August: datacentre shutdown
● September: production system hardware installation
● 2017, January: “delta” system opened to early adopters
● February: Sanger Flexible Compute Platform announced
7. Science As A Service
First half of 2016
Proof-of-concept of a user-friendly orchestration portal (CloudForms) on top
of OpenStack and VMware
Consultancy and development input from RedHat
Presented at Scientific Working Group in Barcelona summit, October 2016
11. Hardware
We approached current vendors, and SuperMicro via BIOS-IT
Wanted to get most bang for buck
Arista provided seed switch kit and offered VXLAN support
12.
13. Production OpenStack (1)
• 107 Compute nodes (Supermicro) each with:
• 512GB of RAM, 2 * 25GB/s network interfaces
• 1 * 960GB local SSD, 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
• 6 Control nodes (Supermicro) allow 2 openstack deployments
• 256 GB RAM, 2 * 100 GB/s network interfaces
• 1 * 120 GB local SSD, 1 * Intel P3600 NVMe (/var)
• 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
• Total of 53 TB of RAM, 2996 cores, 5992 with hyperthreading
• RHOSP8 (Liberty) deployed with Triple-O
14. Production OpenStack (2)
• 9 Storage nodes (Supermicro) each with:
• 512GB of RAM
• 2 * 100GB/s network interfaces,
• 60 * 6TB SAS discs, 2 system SSD
• 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
• 4TB of Intel P3600 NVMe used for journal
• Ubuntu Xenial
• 3 PB of disc space, 1PB usable
• Single instance (1.3 GBytes/sec write, 200 MBytes/sec read)
• Ceph benchmarks imply 7 GBytes/sec
15. Production OpenStack (3)
• 3 racks of equipment, 24 KW load per rack
• 10 Arista 7060CX-32S switches
• 1U, 32 * 100Gb/s -> 128 * 25Gb/s
• Hardware VXLAN support integrated with OpenStack *
• Layer two traffic limited to rack, VXLAN used inter-rack
• Layer three between racks and interconnect to legacy systems
• All network switch software can be upgraded without disruption
• True Linux systems
• 400 Gb/s from racks to spine, 160 Gb/s from spine to legacy systems
* VxLan in ml2 plugin not used in first iteration because of software issues
16. OpenStack installation
RHOSP vs Packstack vs …
• Paid-for support from RedHat
• Terminology confusion: Triple-O undercloud and overcloud
• Need wellness checks of undercloud and overcloud before each
(re)deploy
• Keep deployment configuration in git and deploy with a script for
consistency
17.
18. Ceph installation
Integrated or standalone?
• Deployment by RHOSP is easier but is tied to that OpenStack
• A separate self-supported Ceph was more cost effective and a
better fit for staff knowledge at the time
• It’s possible to share a Ceph between multiple OpenStacks
• ceph-ansible is seductive but brings some headaches
• e.g. --check causes problems like changing the fsid
19. Networking
We wanted VXLAN support in switches to enable metal-as-a-service
Unfortunately we’re not there yet…
e.g. ml2 driver bugs: “reserved” is not a valid UUID
We currently have VXLAN double encapsulation
21. Puppet or what?
We chose to use Ansible
• There’s only a single Puppet post-deploy hook
• Wider strategic use of Ansible within Sanger IT
• Keep configuration in git
22. Our customisations
• scheduler tweaks (stack not spread, CPU/RAM overcommit)
• hypervisor tweaks (instance root disk on Ceph or hypervisor)
• enable SSL for Horizon and API
• change syslog destination
• add “MOTD” to Horizon login page
• change session timeouts
• register systems with RedHat
• and more...
23. Customisation pitfalls
Some customisations become obsolete when moving to a newer
version of OpenStack - can’t blindly carry them forward
A redeploy (e.g. to add compute nodes) overwrites configuration so
the customisations need to be reapplied - and there’s a window when
they’re absent
Restarting too many services too quickly upsets HAproxy, rabbitmq...
24. Flavours and host aggregates
Three main flavour types:
1. Standard “m1.*”
• True cloud-style compute; root disk on hypervisor; 90% of compute
nodes
2. Ceph “c1.*”
• Root disk on Ceph allows live migration; 6 compute nodes support this
3. Reserved “h1.*”
• Limited to tenants running essential availability services
25. Flavours and host aggregates
Per-project flavours:
• For Cancer group “k1.*”
• True cloud-style compute, like “m1.*”
• Sized to fit two instances on each hypervisor: half the disk, half the CPUs,
half the RAM
• Trying to prevent Ceph “double load” caused by data movement:
Ceph→S3→instance→Cinder volume→Ceph
• Only viable with homogeneous hypervisors and known/predictable
resource requirements
26. Deployment thoughts
“Premature optimisation is the root of all evil” - Knuth
“Get it working, then make it faster” - my boss Pete
“Keep it simple (because I’m) stupid” - me
Turn off h/w acceleration (10GbE offloads guilty until proven innocent)
Find some enthusiastic early adopters to shake the problems out
Deploy, monitor, tweak, rinse, repeat
28. Metrics
Find the balance between
“if it moves, graph it”
and
“don’t overload the metrics server”
50,000 metrics every 10 seconds is optimistic
29. Architecture
We’re using collectd → graphite/carbon → grafana
Modular plugins make it easy to record new metrics e.g.
entropy_avail
Using the collectd libvirt plugin means new instances are
automatically measured
...although the automatic naming isn’t great:
openstack_flex2.instance-00000097_bbb85e84-6c0c-4fe
8-9b3c-db17a665e7ef.libvirt.virt_cpu_total
34. Logging
We wanted something like Splunk
...but without the £££
We’re using ELK
Today as a syslog destination; planning to use rsyslog to watch
OpenStack component log files
35. Monitoring
Bare minimum in Opsview (Nagios)
• Horizon and API availability
• Controllers up
• radosgw S3 availability
• Ceph nodes up
We’d like hardware status reporting but SuperMicro IPMI is not helpful
37. “Space,” it says, “is big. Really big. You just won't believe how vastly,
hugely, mindbogglingly big it is.”
There’s a substantial learning curve for admins and developers
OpenStack
38. Problems with Docker
Docker likes to use 172.17.0.0/16 for its bridge network
Sanger uses 172.16.0.0/12 for its internal network
...oh.
Also problems with bridge MTU > instance MTU and PMTUD not
working. Fix: --bip=192.168.3.3/24 --mtu=1400
39. Problems with radosgw
Ceph radosgw implements most but not all AWS S3 features
ACLs are implemented, policies are not
We’re trying to implement a write-only bucket using nginx as a proxy
to rewrite the auth header
40. Problems with DHCP
On Ceph nodes, Ubuntu DHCP client doesn’t request a default
gateway
Infoblox DHCP server sends Classless Static Routes option
DHCP client can override a server-supplied value but not ignore it
The Ceph nodes’ default route ends up pointing down the 1GbE
management NIC not the 2x100GbE bond
...oh.
41. Problems with rabbitmq
rabbitmq partitions are really painful
We sometimes end up rebooting all the controllers - there must be a
better way
Fortunately running instances aren’t affected
42. Problems with deployment
Running the overcloud deployment from the wrong directory is
very bad
The deployer doesn’t find the file containing the service
passwords and proceeds to change them all, which is very tedious
to recover from
The deployment script really really really needs to have
cd ~stack
to prevent accidents
43. Problems with cinder
When a volume is destroyed, cinder overwrites the volume with
zeroes
If a user is running a pipeline which creates and destroys many 1TB
volumes this produces a lot of I/O
Consider setting volume_clear and/or volume_clear_size in
cinder.conf
45. Prostate cancer analysis
Pan-Prostate builds on previous Pan-Cancer work
Multiple participating institutes using Docker to provide a consistent
analysis framework
In the past that required admin time to build an isolated network,
now OpenStack gives us that for free - and lets the scientists drive it
themselves
46.
47.
48. wr - Workflow Runner
Reimplementation of Vertebrate Resequencing Group’s pipeline
manager in Go
Designed to be fast, powerful and easy to use
Can manage LSF like existing version, and adds OpenStack
https://github.com/VertebrateResequencing/wr
49.
50. wr - Workflow Runner
Lessons learned:
• “There’s a surprising amount of stuff you have to do to get
everything working well”
• There are annoying gaps in the Go SDK
• Lots of things can go wrong if end users bring up servers, so handle
all the details for them
51. New Pipeline Group
Using s3fs as a shim on top of radosgw S3 speeds development
s3fs presents a bucket as a filesystem (but it’s turtles all the way
down)
In tests, launching up to 240 instances, for read-only access to a few
GB of reference sequence data, with caching turned on: up to ~8
might get stuck
52. Human Genetics Informatics
Working towards a production Arvados system
Speedbumps around many tools/SDKs assuming real AWS S3, not
some S3-alike
Sending patches to open-source projects (Packer, Terraform…)
54. More Ceph
...because 1PB isn’t enough…
This has implications for DC placement (due to cooling requirements)
and Ceph CRUSH map (to ensure data replicas are properly
separated)
Should we split rbd pools from radosgw pools?
55. OpenStack version upgrade
We will probably skip to RHOSP10 (Newton)
Need Arista driver integrations for VXLAN for metal-as-a-service
We will install a new system alongside the current one and migrate
users and then compute nodes
56. $THING-as-a-service
metal - deploy instance on bare-metal (Ironic)
key management (Barbican) to enable encrypted volumes
DNS (Designate)
shared filesystem (Manila)
…though many of these can already be achieved with creative use of
images/heat/user-data
57. Federation
JISC Assent looks interesting
Lots of internal process to work through first
Open questions about:
• scheduling - pre-emptible instances would help
• charging - market-based instance pricing?
58. Lustre
We have 13PB of Lustre storage
Consider exposing some of it to tenants using Lustre routers, NID
mapping and sub-mounts
59. Little things
• expose hypervisor RNG to instances
• could make instance key generation go faster
• have LogStash report metrics of “log per host”
• to spot log volume anomalies
• ...
60. Thanks
My colleagues at Sanger - both in Systems and across the institute
The OpenStack community
Helpful people on mailing lists