Lessons in Glustration

Despite the negative sounding title, I am now finding gluster easy(ish) to use, now I am getting used to some of its quirks. This page is intended as documentation of the things I have learned along the way, and a “do this” / “don't do that” kind of warning for my future self.

In The Beginning

For many years now, I have run a home file server for all of my storage needs. Originally this started as a spin-off of my backup solution, did you know that the Linux NTFS drivers can create hard links that Windows can read, even though Windows doesn't let you create them? So my backups were on a USB HDD, and to enable keeping versioned 'snapshots' (in a manner of speaking) I had this attached to a Linux machine that would run rsync every night, keeping the sizes down by using the aforementioned hard links for files that were unchanged. I wanted to keep the file system as something my Windows desktop could read, as at the time I only had two machines, one Windows and one Linux, so a backup HDD that could only be read by one machine was no backup at all! Logically you don't want to be copying the entire dataset across the network every time, so then the Linux box becomes an SMB file server. Fast forward several years and that one Linux server is now an entire fleet of machines, NFS has replaced SMB because SMB just wasn't fast enough, and the advent of Windows 8 and its successors has seen Microsoft almost entirely eliminated from my home. The core file server is still just the one machine though, with all the limits that implies.

Before starting down the road that ultimately led me to gluster, I had a primary file server with 6x 6TB storage disks (in btrfs RAID10), backing up to a secondary file server with 8x 3TB disks (in zfs RAIDZ2). Thanks to my data hoarding tendencies and a lack of thorough clean-up leading to unnecessarily duplicated data, naturally this wasn't enough. At some point one of the 6TB disks in the primary server got replaced with a 10TB disk (I think it was a disk failure and one particular 10TB disk model being especially cheap at the time) so then the servers were actually mismatched in size. Although I could compensate for this as some of the data on the primary was itself a backup and therefore didn't need yet another backup on the secondary, that didn't stop me from filling the primary anyway.

A Btrfs Diversion

Before wandering off into cluster territory, I just want to touch on a few issues I had with btrfs. Although these are not the reason I decided to change things up, this seem like the perfect place to keep these notes.

Firstly, and this is documented in the usual literature, is that 10TB hard drive I mentioned previously. While it was a simple enough job subbing the new drive in for the failed 6TB drive, the issue comes when you want to use that extra 4TB of space, especially when your array is already nearly full. Now, if you read up on btrfs (or any of these modern advanced file systems, e.g. zfs) one word of advice you will find is not to let them get too full, usually somewhere around ~80% utilisation is considered “full” for most purposes, especially if you expect reasonable performance. As a self admitted data hoarder, naturally I know this and yet cheerfully carry on all the way up to 99.9%. So the particular issue that crops up here is that because I am running RAID10, btrfs can only allocate as much of the 10TB drive as it can also allocate mirrors for. This is fine for the original 6TB the drive is expected to carry, as the mirrors have already been allocated among the 5 other drives, in fact this is where the data is copied from to build the drive unless the failed/failing drive is still accessible in some way. But now when you want to add data you are limited instead by the free space on those 5 other drives, so approximately 5 drives x 6TB x 0.1% = 30GB (ok maybe I'm exaggerating how full the array was, but you get the point). The answer, of course, as per all the literature available, is to re-balance the array, but since this involves effectively rewriting all 36TB that is a slow process, and the speed is even further limited by the lack of free space you have on hand. Anyway, these are the natural consequences of ignoring all of the warnings.

Secondly, RAID10 can make recovery more difficult. This is a natural consequence of the RAID0 part of things, and should be familiar to anyone who has dabbled in RAID for any length of time. Because RAID0 splits files across disks to improve performance, if you lose just the wrong set of disks you end up losing chunks of files. Funnily enough btrfs is actually better at this than regular RAID0, mostly because the chunk or stripe width is bigger. Essentially any file bigger than the chunk width is at risk of losing at least a portion of the file, and any file smaller than the chunk width could be lost entirely if it happened to land on the missing disks. Btrfs also performs better in recovery than most RAID setups, I can't say I've ever recovered anything from a RAID10 array with a missing stripe like I have with btrfs.

Thirdly, btrfs has an odd quirk when you need to do recovery. If the array cannot be mounted clean, you cannot mount it at its normal mount point (i.e. the one in fstab) until you've fixed the issue. This also prevents you from mounting it in degraded mode (-o degraded). I'm not sure what causes this and I haven't seen any notes about it anywhere, but the fix (for me anyway) is to mount the damaged array to a temporary mount point, do whatever you need to do to resolve the error, and then you can reboot and it will mount normally. In my experience this does not mean you have to complete the entire recovery, but you do need to have the array in a state where it is recovering or able to recover without interaction. I haven't spent much time experimenting with this, I do want my data back online after all.

Regardless of the above, btrfs does get a big thumbs up from me, I just have a habit of pushing things beyond where I should, and encountering edge cases because that's where you find them. One of btrfs' main benefits, in my opinion, is just how flexible it is compared to zfs, where zfs was built more in the image of that more traditional hardware RAID, working at the whole disk level.

Enter The Cluster

To continue expanding a single server to have ever more storage becomes ever more complicated and ever more expensive, even if you are willing to compromise on things like the spindle speed of your HDDs, or shingled magnetic recording (don't get me started). Cases only have so many drive bays, motherboards only have so many SATA ports (and PCI/PCIe slots for SATA expansion cards, which can be very hit or miss for driver support), and larger hard drives means replacing entire disks, which gets expensive both in the wallet and on the clock, plus increasing the chances of a disk failure taking out ever larger chunks of data. The next logical step is to transition from RAID, where you plan around dealing with individual disk failures, to RAIN, where you plan around dealing with the loss of an entire server instead. If you are running a mirroring-based setup (i.e. RAID1 or 10) this effectively doubles your storage space right off the bat, if you think of it like moving your 'mirrored copies' off to another entirely separate machine, instead of taking up valuable disk space in your primary server. I had a decent number of idle disks and servers, so simply buying yet more kit just wasn't the right route for me.

This is where I did my usual, starting at what may seem like a tangent. I wanted to try Proxmox's “hyper-converged” setup, which combines their VM host clustering (built on corosync and drbd I think) with Ceph (for clustered storage) for an all-in-one clustering solution. Ultimately, I wanted to host my main file store on Ceph in a Proxmox cluster, so first I needed a Proxmox cluster. But if that was the aim, then I may as well have my VMs on the cluster too, so “everything” could be all clustery with high-availability and automatic failover and all those wonderful things that sound so amazing in theory and yet are oh so painful to realise. So I started small, building a cluster just for the VMs that could then be expanded into my primary file server too.

The Ceph Trials

At first Ceph went rather well, running my small number of VMs on a small group of SSDs (literally one per server), and while yes it did seem slower than a raw SSD or even a RAID1 pair, this was to be expected since the mirrors now have to be copied across the network, which in my case is a mere 1Gbps versus the 3Gbps of SATA-II (and SATA-III and NVMe are faster, of course). This bandwidth also had to be shared with everything the VMs themselves were doing, since I also do not have a separate backside network to segregate the traffic. All this to say that yes, I am well aware I was not running Ceph in the most optimal environment, but that is generally how I run things, trying to squeeze everything in to the bare minimum specs and avoid splashing the cash like a corporation might, especially if their business depends on it. The VMs were solid if not fast, migration worked well, and recovery from a server crash was robust, although again not fast.

The main datastore, well that really demonstrated the weakness of this setup. Adding the 7x 3TB HDDs per server into the mix came up fine, but loading the data was excruciatingly slow, and when it was finally all on there it was painful to use too. The lack of speed also severely hampered any recovery efforts should a server fail.