Massive amounts of storage at affordable prices — it almost sounds too good to be true, doesn’t it?
We have 4 “Storinators” — each box some 3U in height, each stuffed with 39 4TB SATA drives, and 6 1TB SSDs. 45 drives in total. 2 of them are “in production”, meaning that we’ve pulled them in to participate in rendering what you see when you use Kolab Now.
A long, long time ago (in fact, on the old kolab.org website), I blogged about benchmarking these storage pods. How would we divide up these drives in to RAID arrays, and how would we use the SSD storage for faster access?
We have them in production as replicated GlusterFS nodes, serving up disk images to our virtualization environment.
Boy, can I tell you the result is disappointing. The limited hardware resources in the Storinators cause a factor 4-5 memory overcommit, and its CPU is in virtually constant I/O wait. Here’s the CPU graph to illustrate the point;
However, what is the cause? Here’s where we can point at various suspects.
I may need to mention we use the storage pods’ GlusterFS volume for “Operating System Disks”, and not “payload”, with some exceptions in the write-once-read-many category.
Suspect #1: Software RAID (Topology)
The software RAID topology we laid down is one RAID 10 array per controller (so, 3x), with each array containing 10 drives each, hot spares not included. Excellent throughput in a couple of fio test cases, sustained over longer periods of time. Seemingly terrible throughput when put to use. But it’s not the only suspect.
The reason that it is a suspect is because software RAID is extremely expensive on the host CPU. This is part of the reasons we did not choose to go with RAID 5 and expand the available volume of storage that way — parity calculations in our view are more expensive then dup-and-dump.
Suspect #2: GlusterFS
GlusterFS squarely falls in to the category of “Ah, yeah, right, f%@k…”. This will be the last time we use it for this purpose, and absolutely the last time we recommend it to our customers. It should be considered an absolute no-go unless you already have the world’s resources to waste.
Had I known in advance that for a replicated scenario, it is the GlusterFS client, the “mountor” if you will, that is responsible for the replication, I would have thought long and hard about applying it, twice. That’s just terrible design. In practice, it means I can’t balance the use of the capacity that I do have (network, CPU, memory, dm-cache), but all participants just run in circles occupying one another’s personal space.
GlusterFS is going out, and Ceph’s coming in. Because, hell yeah, Ceph.
Suspect #3: dm-cache
I pride myself in having a healthy amount if ignorance, but this one just flabbergasts me. dm-cache is failing to live up to the expectations in miserable fashion. If I weren’t aware of my own ignorance (see what I did there?), I would need to conclude that either we don’t read any blocks twice, or it just doesn’t function the way it says on the tin.
This particular part failing us effectively negates the awesome kindness of 45drives.com, whom sent us controller cards that are better suitable for using SSDs.
Nowadays, storage solutions like GlusterFS and Ceph support storage tiering, though, which will likely be the next attempt we take at this (with Ceph though, as I may have mentioned).
Suspect #4: I/O patterns
We’ve learned from experience that particular I/O patterns can wreak havoc on storage otherwise performing excellently. One of our first experiences in this realm nearly knocked down production servers and was caused by our build server environment.
We have kind of a zero-IO policy deployed across the board, meaning that what can be in tmpfs is in tmpfs, and logging to the local OS disk is limited to a minimal amount (error and above, selected info messages for system statistics).
Suspects #5 through #n: Assorted
Scheduling used to make a difference (deadline, cfq, noop), but libgfapi for QEMU under virtio does not work, and using disk images off of a glusterfs mount does not seem to allow the guest any controls over the scheduler. I would love to be able to adjust the seemingly assertive “everything matters” policy, literally waiting for all I/O to be completed, to a looser “nothing matters” for the lather-rinse-repeat OS disks (all we have to do is set netboot-enabled and kick it). For now, I’m applying a writeback cache policy.
Not zero-IO at all? While we may intend to not write at all (read: as much) to the OS disk, this policy seems to fail;
I have yet to nail down what it is exactly, that causes this amount of writing — in contrast with the reading.