Skip to content

Simultaneous failure/offline of 2 drives in draid2 results in metadata/checksum errors #14041

@samwyc

Description

@samwyc

System information

Type Version/Name
Distribution Name Red Hat Enterprise Linux release 8.6 (Ootpa)
Distribution Version 8.6
Kernel Version 4.18.0-372.26.1.el8_6.x86_64
Architecture x86_64
OpenZFS Version zfs-2.1.6

Describe the problem you're observing

During simultaneous failure of 2 vdevs on an empty draid2 zpool with 2 dspares, At times (2 out of 3 times) we observe permanent metadata errors and checksum errors on all vdevs reported in zpool status.
The frequency of the occurrence of the issue gets reduced as the pool gets filled up.

Detection of failure of the first drive, rebuild starts to first dspare.
Detection of the second drive failure leads to the vdev_rebuild_reset_wanted flag to be set, this is because, the existing rebuild thread has already completed where the vdev_rebuild_thread has become NULL, but the vdev_rebuild_complete_sync hasn't yet cleared the vdev_rebuilding. So the vdev_rebuild_reset_wanted signal is getting created but never handled.
As a result, even though the 2nd dspare gets attached, the rebuild never happened for the 2nd faulted drive. which is the issue as seen in zpool status.

Describe how to reproduce the problem

 truncate -s 1G d{1..53}
 zpool create -f -o cachefile=none -o failmode=panic -O canmount=off tank draid2:11d:53c:2s ~/disks/d{1..53}
 zfs create -o mountpoint=/mnt/data tank/ds
 zpool offline -f tank ~/disks/d1 & zpool offline -f tank ~/disks/d3
 zpool status -v

Include any warning/errors/backtraces from the system logs

[root@localhost zfs]# zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:01 with 21 errors on Mon Oct 17 01:50:13 2022
  scan: resilvered (draid2:11d:53c:2s-0) 35K in 00:00:00 with 0 errors on Mon Oct 17 01:50:12 2022
config:

        NAME                   STATE     READ WRITE CKSUM
        tank                   DEGRADED     0     0     0
          draid2:11d:53c:2s-0  DEGRADED     0     0     0
            spare-0            DEGRADED     0     0    28
              /root/disks/d1   FAULTED      0     0     0  external device fault
              draid2-0-1       ONLINE       0     0     0
            /root/disks/d2     ONLINE       0     0     0
            spare-2            DEGRADED     0     0    20
              /root/disks/d3   FAULTED      0     0     0  external device fault
              draid2-0-0       ONLINE       0     0     0
            /root/disks/d4     ONLINE       0     0    16
            /root/disks/d5     ONLINE       0     0    28
            /root/disks/d6     ONLINE       0     0    20
            /root/disks/d7     ONLINE       0     0    20
            /root/disks/d8     ONLINE       0     0    20
            /root/disks/d9     ONLINE       0     0    28
            /root/disks/d10    ONLINE       0     0    16
            /root/disks/d11    ONLINE       0     0    16
            /root/disks/d12    ONLINE       0     0    20
            /root/disks/d13    ONLINE       0     0    28
            /root/disks/d14    ONLINE       0     0    28
            /root/disks/d15    ONLINE       0     0    20
            /root/disks/d16    ONLINE       0     0    28
            /root/disks/d17    ONLINE       0     0    20
            /root/disks/d18    ONLINE       0     0     4
            /root/disks/d19    ONLINE       0     0    24
            /root/disks/d20    ONLINE       0     0    16
            /root/disks/d21    ONLINE       0     0    20
            /root/disks/d22    ONLINE       0     0    12
            /root/disks/d23    ONLINE       0     0    20
            /root/disks/d24    ONLINE       0     0    20
            /root/disks/d25    ONLINE       0     0    24
            /root/disks/d26    ONLINE       0     0    16
            /root/disks/d27    ONLINE       0     0    16
            /root/disks/d28    ONLINE       0     0    20
            /root/disks/d29    ONLINE       0     0    32
            /root/disks/d30    ONLINE       0     0    20
            /root/disks/d31    ONLINE       0     0    20
            /root/disks/d32    ONLINE       0     0    28
            /root/disks/d33    ONLINE       0     0    28
            /root/disks/d34    ONLINE       0     0    16
            /root/disks/d35    ONLINE       0     0    32
            /root/disks/d36    ONLINE       0     0    28
            /root/disks/d37    ONLINE       0     0     0
            /root/disks/d38    ONLINE       0     0    16
            /root/disks/d39    ONLINE       0     0    16
            /root/disks/d40    ONLINE       0     0    28
            /root/disks/d41    ONLINE       0     0    28
            /root/disks/d42    ONLINE       0     0    20
            /root/disks/d43    ONLINE       0     0    20
            /root/disks/d44    ONLINE       0     0    20
            /root/disks/d45    ONLINE       0     0    40
            /root/disks/d46    ONLINE       0     0    28
            /root/disks/d47    ONLINE       0     0    20
            /root/disks/d48    ONLINE       0     0    16
            /root/disks/d49    ONLINE       0     0    12
            /root/disks/d50    ONLINE       0     0    12
            /root/disks/d51    ONLINE       0     0    20
            /root/disks/d52    ONLINE       0     0    20
            /root/disks/d53    ONLINE       0     0    24
        spares
          draid2-0-0           INUSE     currently in use
          draid2-0-1           INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x3d>

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions