Published On: 27 February 2026Last Updated: 27 February 2026

I encountered an interesting case of Linux OS (Debian) failing to boot after a SSD failed – in this case, the disk completely disappeared rather than showing up as IO failures. The root filesystem was formatted using BTRFS and configured in RAID1 redundancy. The OS didn’t boot, and got stuck at initramfs stage.

Technically speaking, when a disk fails in a RAID1 setup it is supposed to work just fine with the remaining disks. This is the standard behavior of the traditional method of setting up RAID1 in Linux using mdadm; if one is using ZFS on Linux, it too does the same thing.

Upon checking the KVM console, I saw an error being thrown up by the kernel when it tried to boot the system:

[365842.920006] BTRFS error (device zd16): devid 2 uuid 888fb8f1-1ae6-4504-a066-cef29c64ad7c is missing
[365842.920009] BTRFS error (device zd16): failed to read chunk tree: -2
[365842.920140] BTRFS error (device zd16): open_ctree failed: -2

Like I said earlier, the filesystem was configured to be redundant with RAID1 redundancy. This had two NVME SSDs. After searching for a few solutions, I discovered that the filesystem will mount only with the degraded option.

[366399.344476] BTRFS info (device zd16): using crc32c (crc32c-intel) checksum algorithm
[366399.344929] BTRFS warning (device zd16): devid 2 uuid 888fb8f1-1ae6-4504-a066-cef29c64ad7c is missing
[366399.345433] BTRFS info (device zd16): allowing degraded mounts
[366399.345436] BTRFS info (device zd16): enabling ssd optimizations
[366399.345438] BTRFS info (device zd16): turning on async discard
[366399.345440] BTRFS info (device zd16): enabling free space tree

Now since I was connected via a KVM console, it was pretty difficult for me to correctly time my keystrokes to enter into GRUB console and modify the rootflags to be able to boot into the system. Instead, I booted into a rescue system and removed the redundancy from the filesystem.

btrfs -v balance start -f -mconvert=single -dconvert=single /mnt
Dumping filters: flags 0xf, state 0x0, force is on
  DATA (flags 0x100): converting, target=281474976710656, soft is off
  METADATA (flags 0x100): converting, target=281474976710656, soft is off
  SYSTEM (flags 0x100): converting, target=281474976710656, soft is off
Done, had to relocate 3 out of 3 chunks

btrfs device remove missing /mnt

Now this allows me to boot and use the system while I requested a replacement for the SSD from the server vendor. After the SSD was replaced, again configured RAID1.

btrfs device add /dev/zvol/rpool/btrfs_3 /mnt
Performing full device TRIM /dev/zvol/rpool/btrfs_3 (2.00GiB)

btrfs device usage /mnt
/dev/zd16, ID: 1
   Device size:             2.00GiB
   Device slack:              0.00B
   Data,RAID1:            416.00MiB
   Metadata,RAID1:        256.00MiB
   System,RAID1:           32.00MiB
   Unallocated:             1.31GiB

/dev/zd32, ID: 2
   Device size:             2.00GiB
   Device slack:              0.00B
   Data,RAID1:            416.00MiB
   Metadata,RAID1:        256.00MiB
   System,RAID1:           32.00MiB
   Unallocated:             1.31GiB

It is quite perplexing as to why BTRFS chooses to behave this way in the default configuration while most other RAID1 solutions can continue to work with a failed disk. Perhaps there’s a strong design decision behind this, I’d have to read up to understand more. Or may be it’s just a quirk and there are more quirks.

ZFS seems to be the best option for flexibility & performance. In this server though, I couldn’t get ZFS on root working so chose BTRFS thinking both are quite similar. This experience taught me it isn’t, a new learning.

Also the snippets shown above are simulated, hence you see it’s BTRFS over ZFS, because I use ZFS on my system. I have described my setup in this post.

Consider sharing your thoughts about what you read

Share

Get new posts by email