ZFS as NFS back-end for kubernetes

Introduction

Lately, I’ve been working with Kubernetes and various storage backends. So far what seems to work best for my environment is NFS backed by ZFS. ZFS has a rich feature set (snapshots, quotas, etc.), is very stable, and offers great performance for a variety of storage access patterns. Below are some notes on how I generally configure this.

Note these deployments are generally done in a sort of production dev environment. Losing data would be bad, but it would not lead to business continuity issues. This data criticality should be taken into account when reading about the deployment described below.

Physical hardware

HDD

My k8s projects have been very data-heavy, but not data intensive. EG. there is a lot of data being written and read, but the speed at which this happens is not entirely relevant to the associated workloads. As such, we are optimizing for total capacity first and performance second. This means using as many large spinning drives as will fit in a machine. Normally, any enterprise or NAS-grade HDD will work for this. The primary thing to avoid when choosing HDDs is shingled magnetic recording drives. While these types of drives will offer great capacity, their poor random I\O performance makes them ill-suited for use in a software RAID such as ZFS. I prefer to use disks with a 5-year warranty as the lifespan of a disk seems to relate closely to the warranty period.

For the example described in this post, we’re using 18 2TB SAS drives set up using multipath, for failover.

SSD

In order to increase the performance of the slow HDD backend, I’ll generally try and deploy at least 2 NVMe (or optane) drives to be used as a cache. If your server (or budget) can’t handle NVMe drives, SATA SSDs would work as well. Avoid using high capacity QLC drives in preference to the lower capacity but higher performance SLC drives. Again, I tend to go for the drives with the best warranty and terabytes written (TBW) rating. The cache disks have to deal with a lot of writes so it’s critical that the drives you choose have the write endurance capable of supporting your workload.

The size of these drives is related to your workload and total storage capacity, but in general, you don’t need much space. For the example here we’ll use 512GB PCI base NVMe drives, which honestly is probably overkill.

Logical setup

Data

I hate to sound like a broken record but the ZFS RAID level you deploy depends on your workload and data sensitivity. For my environments, I generally use raidz2 which uses 2 drives for parity (RAID6). The general consensus seems to be that your raidz2 should never be more than 8-10 drives, so I tend to follow that advice. Using larger raidz2 virtual devices (vdevs) would give more disk space, but you run the risk of multiple drives failing during rebuilds.

With our 18 disks we’re going to make 2x8 disk raidz2 vdevs, and use the remaining drives as spares

$ sudo zpool create -f data raidz2 \
  mpatha \
  mpathaa \
  mpathab \
  mpathac \
  mpathad \
  mpathae \
  mpathaf \
  mpathag \
raidz2 \
  mpathah \
  mpathai \
  mpathaj \
  mpathak \
  mpathal \
  mpatham \
  mpathan \
  mpathao

This creates a zpool with two 8 disk, raidz2 vdevs

$ sudo zpool status
  pool: data
 state: ONLINE
  scan: none requested
config:

  NAME           STATE     READ WRITE CKSUM
  data           ONLINE       0     0     0
    raidz2-0     ONLINE       0     0     0
      mpatha     ONLINE       0     0     0
      mpathaa    ONLINE       0     0     0
      mpathab    ONLINE       0     0     0
      mpathac    ONLINE       0     0     0
      mpathad    ONLINE       0     0     0
      mpathae    ONLINE       0     0     0
      mpathaf    ONLINE       0     0     0
      mpathag    ONLINE       0     0     0
    raidz2-1     ONLINE       0     0     0
      mpathah    ONLINE       0     0     0
      mpathai    ONLINE       0     0     0
      mpathaj    ONLINE       0     0     0
      mpathak    ONLINE       0     0     0
      mpathal    ONLINE       0     0     0
      mpatham    ONLINE       0     0     0
      mpathan    ONLINE       0     0     0
      mpathao    ONLINE       0     0     0

Now we add those last two drives as spares

zpool add data spare mpathz mpathap

We can confirm that they’re available via

$ sudo zpool status
...

    spares
    mpathz       AVAIL
    mpathap      AVAIL

Cache

Two partitions on each of NVMe device will be mirrored to create a SLOG which will act as a sort of write cache to our NFS workloads which have synchronous writes. The other partitions will be used for the l2arc cache. You can get away with creating a SLOG on an un-mirrored partition but if the disk and machine were to fail you can potentially lose writes committed to the SLOG but not to the disk. In most setups, this is likely 10-15 seconds of data. For workloads like databases or VM backends, losing 15 seconds of writes can quickly lead to corruption.

Given the workload of this cluster, we want a relatively large SLOG size, so we’re going to use 15GB mirrored partition. This is likely larger than we need, but this specific storage server will see large bursts of writes from a 100Gb connection. Again, it’s a sort of dev environment so we’ll run with this for a while, collect metrics, and re-evaluate later.

$ sudo zpool add data log mirror nvme0n1p1 nvme1n1p1

Finally, we add the l2arc cache using the remaining NVMe partitions and then confirm the SLOG and cache is set up appropriately.

$ sudo zpool add data cache nvme0n1p2 nvme1n1p2
$ sudo zpool status
...
  logs
    mirror-9     ONLINE       0     0     0
      nvme0n1p1  ONLINE       0     0     0
      nvme1n1p1  ONLINE       0     0     0
  cache
    nvme0n1p2    ONLINE       0     0     0
    nvme1n1p2    ONLINE       0     0     0

Conclusion

$ sudo zpool status
  pool: data
 state: ONLINE
  scan: none requested
config:

  NAME           STATE     READ WRITE CKSUM
  data           ONLINE       0     0     0
    raidz2-0     ONLINE       0     0     0
      mpatha     ONLINE       0     0     0
      mpathaa    ONLINE       0     0     0
      mpathab    ONLINE       0     0     0
      mpathac    ONLINE       0     0     0
      mpathad    ONLINE       0     0     0
      mpathae    ONLINE       0     0     0
      mpathaf    ONLINE       0     0     0
      mpathag    ONLINE       0     0     0
    raidz2-1     ONLINE       0     0     0
      mpathah    ONLINE       0     0     0
      mpathai    ONLINE       0     0     0
      mpathaj    ONLINE       0     0     0
      mpathak    ONLINE       0     0     0
      mpathal    ONLINE       0     0     0
      mpatham    ONLINE       0     0     0
      mpathan    ONLINE       0     0     0
      mpathao    ONLINE       0     0     0
  logs
    mirror-9     ONLINE       0     0     0
      nvme0n1p1  ONLINE       0     0     0
      nvme1n1p1  ONLINE       0     0     0
  cache
    nvme0n1p2    ONLINE       0     0     0
    nvme1n1p2    ONLINE       0     0     0
  spares
    mpathz       AVAIL
    mpathap      AVAIL

errors: No known data errors

We now have an 18-disk raidz2 set up with an l2arc cache and mirrored SLOG. We can create datasets and export them via zfs set sharenfs or use standard NFS sharing facilities (EG. nfs-kernel-server and /etc/exports). I’ve run with both in production and generally prefer using the normal OS /etc/exports to share the partitions, but this is just habit more than any sort of technical reason. Once this is complete you can create the appropriate PV\PVC in k8s like you would for any other NFS-based storage.

What does this setup offer that’s specific to k8s? Realistically… nothing. I would use this sort of setup for any NFS-based NAS in a multiuser environment.