NodeWeaver is an OpenNebula-based hyperconverged platform designed to run despite massive hardware, software and networking faults. The talk will cover the kind of issues we had preparing OpenNebula for execution in the strangest places (like behind a NMR machine, or in a pit in the desert), how to test it in ways that could be featured in an horror film, and what OpenNebula allows us to do that would be difficult in other platforms.
YouTube: https://youtu.be/G75unWZGMQE
5. ● Ensuring that the platform runs well in uncontrolled environment
requires some attention to design (focused on the target) and lots
of testing
● Some basic principles:
○ “perfection is finally attained not when there is no longer
anything to add, but when there is no longer anything to take
away” - Antoine de Saint Exupéry
○ Complexity may be necessary at scale, but not for every
application. Every piece that is added may break at some
point
6. Source: Werner Wogels, Real-time graph of microservice dependencies at http://amazon.com in 2008.
7. ● If you ever ask the user for something, she becomes part of the
system to be tested! …
● … which means that in principle, you should never ask the user for
information that may be obtained in some other (automated) way
● The user may not understand, may not be there, may be mistaken
by all the knobs and dials, or may be deliberately destructive
8. ● Testing must be done on the complete system -
software+hardware+configs …
● … because software faults are more common than hardware ones
● Faults are complex: stop, corruption, limping…
● Trust only what you measure (as Grace Hopper said: "One
accurate measurement is worth a thousand expert opinions.")
9. ● We model our system as a Petri Net
● We run a group of NodeWeaver images (within NodeWeaver),
each with a set of disks attached to emulate local storage &
multiple virtual ethernet links
● Within each emulated node, we run a small set of Centos images,
that receive from contextualization the number of FIO runs and
the kind of emulated workload
● And we run our little chaos monkey process (actually, some bash
scripts)
10. ● Disks:
○ detach disk, then destroy it
○ detach disk, then attach an empty disk
○ detach, wait (random), then reattach
○ Inject random data in a random file within the disk image
○ Inject random data in the disk image
● Network: virsh domif-setlink (up, down) to simulate a faulty cable
(hint: https://dev.opennebula.org/issues/3219 pretty pleeeease... )
● Virtual Node: Hardreset + full time cluster reset
● Future: wrong BIOS clock (through qemu -rtc base=XXXX), IPMI
emulation, packet loss/latency/bandwidth (through NETem: only
25MB!)
11. ● What we discovered:
○ The underlying filesystem is hugely important
○ EXT4 handles most of it, XFS works (but recovery may be very
slow), BTRFS dies in horrible ways, ZFS barely notices
○ Using MySQL as the OpenNebula DB: every ≅25 crashes it
requires some work, every ≅150 crashes requires non-trivial
manual effort
○ Our custom SQLite (with WAL) survives happily (we
compensate the lack of concurrency with a query sequencer)
○ LizardFS is highly tolerant of multiple, parallel failures - disk,
network, whatever
12. ● We took advantage of the exceptionally simple host probes
mechanism, to add additional information that is used by the
platform and the recovery heuristics
● Adding new probes takes very little time and effort - thanks to
OpenNebula simplicity
● We continue to add probes (for example, the P-value for
predicted user experience) and use background processes to add
forecasts
15. ● OpenNebula works exceptionally well under torture, both in
virtual and physical testing
● LizardFS is amazingly resilient (CRC everywhere helps)...
● ...especially on ZFS with its transaction groups
● Chaos-monkey testing does not guarantee that every possible
fault path is tested…
● ...yet it helps in finding paths that we never thought about - but
our customers will for sure