Disk onboarding procedure

Disks follow a certain protocol before inclusion in a pool. Since disks are usually used, they need to be tested / reformated before deployment.

Testing and low-level reformating

For certain SAS drives, a parity bit is added at the end of each block (512 or 4096) that allows for error correction. This makes it so 512 blocks use more space, thus less space is then available. Since we use zfs for checksum, these protection features aren't necessary. If parity bit is set (look for Protection: prot_en=1 when executing sg_readcap /dev/sdx), you can disable it via sg_format -v --format --fmtpinfo=0 --pfu=0 --size=4096 /dev/sdx. Some drives do not support 4096 block sizes, in which case --size=512. Note that this doesn't always free-up the lost space, but at least it disables that parity bit. Also note that this needs to be done for each new drive, and takes a long while.

Once drive is in 4K block mode, with no parity bit, run a long SMART test to make sure the drive has no mechanical issues via smartctl -t long /dev/sdx. If this drive is destined to not be part of a zfs pool, add in a badblocks test to be safe.

WIth these tests complete, you can add the drive to your ZFS pool.

Drive naming convention

The following naming convention is used for the drives: (letter denoting what vol group drive is a part of)(drive number in that group)-(name of pool)-(machine that pool is assigned to). Thus, for a raid10 pool of 4 drives named data on gandalf, these drives would be named a1.data.gandalf, a2.data.gandalf, b1.data.gandalf, b2.data.gandalf.

Each drive is given a label with this name, along with its wwn ID. Unfortunately, it is often the case that the wwn ID on the drive's label is not accurate. We should thus use the one that appears on when the drive is connected and add that to the label.

Pool creation

If creating a new pool with these drives, the following command is to be used to setup the desired default configuration:

zpool create \
   -o ashift=12 \
   -O acltype=posixacl -O compression=lz4 -O xattr=sa -O recordsize=512K \
   -O dnodesize=auto -O normalization=formD -O relatime=on -O special_small_blocks=64K \
   pool_name /dev/disk/by-id/wwn-*

Note the usage of the wwn ID. Using this identifier ensures no risk of drive ID changes.

For ease of use, we will rename the part labels with the drive names so that zpool list -v shows the drive names rather than the wwn IDs. First export the zpool, and then rename each drive by gdisk /dev/sdx -» c -» 1 -» (drive name) -» w. Once all drives are renamed, you can reimport the pool and refresh the names using partlabel via zpool import -d /dev/disk/by-partlabel pool_name.

Dataset configuration

Each zpool is organized with the following convention: pool_name/encrypted_dataset/(dataset_type)/subvol-1000-disk-0. encrypted_dataset is usually the name of the client / organization. Thus, for us, it is ilot. This dataset is what contains the encryption key. Anything under this dataset is encrypted using this key. You can create this dataset using zfs create -o encryption=aes-256-gcm -o keylocation=prompt -o keyformat=passphrase

dataset_type is the proxmox datastore. Optimization is done on this level depending on the type of data that resides under this subvolume. For example, in the case of dataset_type psql, blocksize is set to 16K to match what postgresql expects. We usually use the following `datasete_type:

root for system roots
psql for postgresql databases (recordsize=16K, logbias=throughput, atime=off, primarycache=all)
data for any multi-GB dataset.
var for dataset containing many small files (other database) that should be on SSD (recordsize=64K)

Using settings in /etc/sanoid/sanoid.conf, we can also apply different snapshot settings on this different dataset types. root, for example, only needs a snapshot a day, while data should be every 15 minutes.