Lately, I’ve been working with Kubernetes and various storage backends. So far what seems to work best for my environment is NFS backed by ZFS. ZFS has a rich feature set (snapshots, quotas, etc.), is very stable, and offers great performance for a variety of storage access patterns. Below are some notes on how I generally configure this.
Note these deployments are generally done in a sort of production dev environment. Losing data would be bad, but it would not lead to business continuity issues. This data criticality should be taken into account when reading about the deployment described below.
My k8s projects have been very data-heavy, but not data intensive. EG. there is a lot of data being written and read, but the speed at which this happens is not entirely relevant to the associated workloads. As such, we are optimizing for total capacity first and performance second. This means using as many large spinning drives as will fit in a machine. Normally, any enterprise or NAS-grade HDD will work for this. The primary thing to avoid when choosing HDDs is shingled magnetic recording drives. While these types of drives will offer great capacity, their poor random I\O performance makes them ill-suited for use in a software RAID such as ZFS. I prefer to use disks with a 5-year warranty as the lifespan of a disk seems to relate closely to the warranty period.
For the example described in this post, we’re using 18 2TB SAS drives set up using multipath, for failover.
In order to increase the performance of the slow HDD backend, I’ll generally try and deploy at least 2 NVMe (or optane) drives to be used as a cache. If your server (or budget) can’t handle NVMe drives, SATA SSDs would work as well. Avoid using high capacity QLC drives in preference to the lower capacity but higher performance SLC drives. Again, I tend to go for the drives with the best warranty and terabytes written (TBW) rating. The cache disks have to deal with a lot of writes so it’s critical that the drives you choose have the write endurance capable of supporting your workload.
The size of these drives is related to your workload and total storage capacity, but in general, you don’t need much space. For the example here we’ll use 512GB PCI base NVMe drives, which honestly is probably overkill.
I hate to sound like a broken record but the ZFS RAID level you deploy depends on your workload and data sensitivity. For my environments, I generally use raidz2 which uses 2 drives for parity (RAID6). The general consensus seems to be that your raidz2 should never be more than 8-10 drives, so I tend to follow that advice. Using larger raidz2 virtual devices (vdevs) would give more disk space, but you run the risk of multiple drives failing during rebuilds.
With our 18 disks we’re going to make 2x8 disk raidz2 vdevs, and use the remaining drives as spares
$ sudo zpool create -f data raidz2 \ mpatha \ mpathaa \ mpathab \ mpathac \ mpathad \ mpathae \ mpathaf \ mpathag \ raidz2 \ mpathah \ mpathai \ mpathaj \ mpathak \ mpathal \ mpatham \ mpathan \ mpathao
This creates a zpool with two 8 disk, raidz2 vdevs
$ sudo zpool status pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 mpatha ONLINE 0 0 0 mpathaa ONLINE 0 0 0 mpathab ONLINE 0 0 0 mpathac ONLINE 0 0 0 mpathad ONLINE 0 0 0 mpathae ONLINE 0 0 0 mpathaf ONLINE 0 0 0 mpathag ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 mpathah ONLINE 0 0 0 mpathai ONLINE 0 0 0 mpathaj ONLINE 0 0 0 mpathak ONLINE 0 0 0 mpathal ONLINE 0 0 0 mpatham ONLINE 0 0 0 mpathan ONLINE 0 0 0 mpathao ONLINE 0 0 0
Now we add those last two drives as spares
zpool add data spare mpathz mpathap
We can confirm that they’re available via
$ sudo zpool status ... spares mpathz AVAIL mpathap AVAIL
Two partitions on each of NVMe device will be mirrored to create a SLOG which will act as a sort of write cache to our NFS workloads which have synchronous writes. The other partitions will be used for the l2arc cache. You can get away with creating a SLOG on an un-mirrored partition but if the disk and machine were to fail you can potentially lose writes committed to the SLOG but not to the disk. In most setups, this is likely 10-15 seconds of data. For workloads like databases or VM backends, losing 15 seconds of writes can quickly lead to corruption.
Given the workload of this cluster, we want a relatively large SLOG size, so we’re going to use 15GB mirrored partition. This is likely larger than we need, but this specific storage server will see large bursts of writes from a 100Gb connection. Again, it’s a sort of dev environment so we’ll run with this for a while, collect metrics, and re-evaluate later.
$ sudo zpool add data log mirror nvme0n1p1 nvme1n1p1
Finally, we add the l2arc cache using the remaining NVMe partitions and then confirm the SLOG and cache is set up appropriately.
$ sudo zpool add data cache nvme0n1p2 nvme1n1p2 $ sudo zpool status ... logs mirror-9 ONLINE 0 0 0 nvme0n1p1 ONLINE 0 0 0 nvme1n1p1 ONLINE 0 0 0 cache nvme0n1p2 ONLINE 0 0 0 nvme1n1p2 ONLINE 0 0 0
$ sudo zpool status pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 mpatha ONLINE 0 0 0 mpathaa ONLINE 0 0 0 mpathab ONLINE 0 0 0 mpathac ONLINE 0 0 0 mpathad ONLINE 0 0 0 mpathae ONLINE 0 0 0 mpathaf ONLINE 0 0 0 mpathag ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 mpathah ONLINE 0 0 0 mpathai ONLINE 0 0 0 mpathaj ONLINE 0 0 0 mpathak ONLINE 0 0 0 mpathal ONLINE 0 0 0 mpatham ONLINE 0 0 0 mpathan ONLINE 0 0 0 mpathao ONLINE 0 0 0 logs mirror-9 ONLINE 0 0 0 nvme0n1p1 ONLINE 0 0 0 nvme1n1p1 ONLINE 0 0 0 cache nvme0n1p2 ONLINE 0 0 0 nvme1n1p2 ONLINE 0 0 0 spares mpathz AVAIL mpathap AVAIL errors: No known data errors
We now have an 18-disk raidz2 set up with an l2arc cache and mirrored SLOG. We can create datasets and export them via
zfs set sharenfs or use standard NFS sharing facilities (EG.
/etc/exports). I’ve run with both in production and generally prefer using the normal OS /etc/exports to share the partitions, but this is just habit more than any sort of technical reason. Once this is complete you can create the appropriate PV\PVC in k8s like you would for any other NFS-based storage.
What does this setup offer that’s specific to k8s? Realistically… nothing. I would use this sort of setup for any NFS-based NAS in a multiuser environment.