Storage pool exceptions handled ungracefully #5723
Labels
C: core
P: default
Priority: default. Default priority for new issues, to be replaced given sufficient information.
T: enhancement
Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Qubes OS version:
4.0 (according to
/etc/fedora-release
)Affected component(s) or functionality:
qubes-core-admin
->qubesd
-> storage pool initializationSteps to reproduce the behavior:
qubes-core-admin
qubes.storage.StoragePoolException("unable to load")
qubes/storage/__init__.py
doesn't catch the exception on the line that sayspool = self.vm.app.get_pool(volume_config['pool'])
sudo journalctl -fu qubesd
spams "failed to start" messages.I tentatively "fixed" it with the patch below, which turned out to not be a good idea (see notes in section below)
Expected or desired behavior:
Storage pool drivers should be allowed to fail (temporarily, ideally) without bringing down the whole system, and without the affected VMs/pools being erased from
qubes.xml
(I took a backup of myqubes.xml
before trying to fix this, which turned out to be a good idea since my "fix" resulted in that).Both when loading the storage pool module itself (which I believe this bug report is about), and during
init_volume
/init_pool
.The affected pools / vms should still be visible, even if they are not startable, or at the very least they should reappear when the transient error is corrected.
Actual behavior:
qubesd
fails to start, as a consequence no VMs are able to start and theqvm-*
commands fail.This happened when updating to a new kernel + new
kernel-devel
package which included "support" for a GCC plugin to collect extra entropy (somewhere ininclude/linux/random.h
), guarded by anifdef
. For some reason the ifdef-guard resulted in the new fancy code being expanded by DKMS, but a symbol (latent_entropy
) was missing, causing DKMS compilation of the ZFS module to fail. That's a separate issue, but I suspect there will be more build failures in the future, so I'd like to have a solution to this problem.My failed attempts to fix this resulted in the pools being "forgotten," I think because I returned
None
wherequbesd
expects an updated copy of the dict that tracks storage pool parameters, causing that to be serialized in place of the original date.I'm not sure if this was caused by
qubes/storage/__init__.py
or somewhere else (after my patch), but that line definitely failed. I'm not super inclined to repeat the experiment if the issue can be resolved without further information, since it takes some time to conduct safely, but I can do that if there's not other path forward.General notes:
I'm not sure if I'm doing it wrong / throwing the wrong exception from the wrong place, but I would appreciate some help understanding how to do this gracefully.
I have consulted the following relevant documentation:
I did not :-(
I am aware of the following related, non-duplicate issues:
The patchset which lead to this (ZFS storage pools):
QubesOS/qubes-core-admin#289
The text was updated successfully, but these errors were encountered: