Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zpool import first fails then succeeds after typing Ctrl + D #26

Open
hadrienk opened this issue May 8, 2018 · 7 comments
Open

Zpool import first fails then succeeds after typing Ctrl + D #26

hadrienk opened this issue May 8, 2018 · 7 comments

Comments

@hadrienk
Copy link

hadrienk commented May 8, 2018

Hi, thank for sharing your work. I am trying to create a minimal initrd. I configured the hooks as follow:

HOOKS=(base udev block systemd sd-plymouth autodetect modconf keyboard keymap sd-zfs)

I am using refind

menuentry "Arch Linux (ck-surface4)" {
    icon     /EFI/refind/icons/os_arch.png
    loader   vmlinuz-linux-ck-surface4
    options  "initrd=intel-ucode.img initrd=initramfs-linux-ck-surface4-minimal.img rw root=zfs:zroot/root/default zfs_wait=30"
    submenuentry "Boot using default initramfs" {
        initrd initramfs-linux-ck-surface4.img
    }
    submenuentry "Boot using fallback initramfs" {
        initrd initramfs-linux-ck-surface4-fallback.img
        add_options "break=postmount"
    }
    submenuentry "Boot to terminal" {
        add_options "systemd.unit=multi-user.target"
    }
}
 

When booting the zpool import first fails. When I type Ctrl + D it seems it tries again and starts normally.
Any idea what I did wrong?

@hadrienk hadrienk changed the title Zpool import first fails then succeed after typing Ctrl + D Zpool import first fails then succeeds after typing Ctrl + D May 8, 2018
@dasJ
Copy link
Owner

dasJ commented May 25, 2018

Are there any relevant systemd messages around it? You should be able to see them from your running system with journalctl -b

@kerberizer
Copy link

I'm seeing the same issue on one system. zpool complains about "no such pool or dataset", but it does succeed importing the pool when the zfs-import-cache service is run from the shell after Ctrl+D. I suspect a timing problem, probably related to #25: perhaps the devices are not yet properly initialized when the import cache service is run for the first time. It's an important system, so unfortunately I can't make experiments at will, but if I have new information, I'll report it.

@kerberizer
Copy link

The logs seem to confirm my suspicions:

Sep 06 15:17:09 archlinux systemd[1]: Started udev Wait for Complete Device Initialization.
Sep 06 15:17:09 archlinux systemd[1]: Reached target System Initialization.
Sep 06 15:17:09 archlinux systemd[1]: Reached target Basic System.
Sep 06 15:17:09 archlinux systemd[1]: System is tainted: var-run-bad
Sep 06 15:17:09 archlinux systemd[1]: Starting Import ZFS pools by cache file...
Sep 06 15:17:09 archlinux kernel: spl: loading out-of-tree module taints kernel.
Sep 06 15:17:09 archlinux kernel: icp: module license 'CDDL' taints kernel.
Sep 06 15:17:09 archlinux kernel: Disabling lock debugging due to kernel taint
Sep 06 15:17:09 archlinux kernel: usb 2-12: new high-speed USB device number 2 using xhci_hcd
Sep 06 15:17:09 archlinux kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: usb 1-1: new high-speed USB device number 2 using ehci-pci
Sep 06 15:17:09 archlinux kernel: usb 4-1: new high-speed USB device number 2 using ehci-pci
Sep 06 15:17:09 archlinux kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata10: SATA link down (SStatus 0 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata9: SATA link down (SStatus 0 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata2.00: NCQ Send/Recv Log not supported
(...snip...)
Sep 06 15:17:11 archlinux kernel: ZFS: Loaded module v0.7.0-1551_gcc99f275a, ZFS pool version 5000, ZFS filesystem version 5
Sep 06 15:17:11 archlinux kernel: random: crng init done
Sep 06 15:17:11 archlinux kernel: random: 7 urandom warning(s) missed due to ratelimiting
Sep 06 15:17:11 archlinux zpool[281]: cannot import '<redacted>': no such pool or dataset
Sep 06 15:17:11 archlinux zpool[281]:         Destroy and re-create the pool from
Sep 06 15:17:11 archlinux zpool[281]:         a backup source.
Sep 06 15:17:11 archlinux systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Sep 06 15:17:11 archlinux systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Sep 06 15:17:11 archlinux systemd[1]: Failed to start Import ZFS pools by cache file.

Apparently a lot of device initialization happens after udevadm settle on this particular system.

@kerberizer
Copy link

@dasJ I can confirm being able to avoid the issue by inserting an appropriate delay before pool import. My test solution was rather crude: if the first import would fail, it would sleep 2 seconds, then try again and sleep another 4 seconds on failure before trying one last time. I'm afraid I don't know right now what would be the most elegant and efficient approach. In any case, the ability to configure a delay before the pool import—possibly via a kernel parameter—may at least be a reasonable interim solution.

@kerberizer
Copy link

I've also encountered the issue on another system, but can't tell yet what might be different about those problematic systems. The same solution with inserting a delay at least did work.

@Klowner
Copy link

Klowner commented Sep 3, 2019

@kerberizer I realize it's been a year, but would you be willing to share the modifications you made to introduce the delay? I'm having a heck of a time booting a system with a zpool on a USB device and it appears to be entirely a timing issue.

@kerberizer
Copy link

kerberizer commented Sep 3, 2019

@Klowner No problem sharing at all, but I need to recall myself what were those changes; it appears that at some point of time I've removed them. Off the top of my head I'd suggest probably editing zfs-import-cache.service (or -scan if not using zpool.cache), replacing the /usr/bin/zpool import in ExecStart with something like /usr/bin/sh -c "zpool import ... || sleep N && zpool import ... || sleep N ... The point is to retry the pool import after some time if it fails, hoping that the devices would have time to settle in the meantime.

More robust solution may be unnecessary, as Arch Linux may at some point ditch initcpio, replacing it with dracut—or at least that was my impression from some emails on the arch-dev-public mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants