-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing LVM snapshot volume results in data loss (R4-rc1) #3164
Comments
So attaching an ISO worked for you? Wonder why I'm special. |
@0spinboson Yeah, actually that took some work. I ended up doing "losetup -f" on the ISO image and then using the VM Settings GUI to boot the empty VM from the new loop block device. |
I experienced (probably) the same. On a test machine sys-net lost it's private volume. I think I now understand what's happening:
So we have two problems:
Re (2): I think we should add Additionally we might want to consider:
@doug1: Thanks for your report and analysis. This was quite helpful to track down the problem. |
It is done if
You mean set |
Right, this sounds good.
Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that |
Maybe it should be a lock, that is taken at Currently the code correctly recover from the case when This means that even if So, to answer my question above: if
To be honest I don't know. Looking at the code it isn't obvious.
It is |
See linked PR for data loss prevention. |
If we set The best similar solution I came up, which is able to deal with unreliable events, involved counting starts and shutdowns. But it's not ideal and this is getting quite complex so I thought about another strategy. The idea is that we register for libvirt events only from the domain we are interested in. By doing so we can stop events by unregistering. While doing this I moved also the lock logic before firing the event so that the events are ordered meaningfully and the lock is hold inside the event handlers. Like it's already the case for Current WIP version at https://github.com/HW42/qubes-core-admin/commits/hw42/wip/stopped-event. Attention: It's barely tested. |
This is a race condition, so to make it more likely to fail (if it's broken), make some things manually. In normal circumstances this order of actions is also possible, just less likely to happen. But as seen in the bug report, happens from time to time. QubesOS/qubes-issues#3164
Automated announcement from builder-github The package
|
Updated my branch with the improved event handling. I did not only fix the logic flaw you noticed but also addressed another problem. The previous version relied on that after calling When I added a second lock for the stopped event handling (instead of using This version should be quite resistant against unreliable stopped events (missing, dupplicated, delayed). I did not opened a PR yet since I first want to test if I can trigger the race (and then verify my patch). |
Second tests, for actual events, not storage handling done by its handler. cc @HW42 QubesOS/qubes-issues#3164
The previous version did not ensure that the stopped/shutdown event was handled before a new VM start. This can easily lead to problems like in QubesOS/qubes-issues#3164. This improved version now ensures that the stopped/shutdown events are handled before a new VM start. Additionally this version should be more robust against unreliable events from libvirt. It handles missing, duplicated and delayed stopped events. Instead of one 'domain-shutdown' event there are now 'domain-stopped' and 'domain-shutdown'. The later is generated after the former. This way it's easy to run code after the VM is shutdown including the stop of it's storage.
The previous version did not ensure that the stopped/shutdown event was handled before a new VM start. This can easily lead to problems like in QubesOS/qubes-issues#3164. This improved version now ensures that the stopped/shutdown events are handled before a new VM start. Additionally this version should be more robust against unreliable events from libvirt. It handles missing, duplicated and delayed stopped events. Instead of one 'domain-shutdown' event there are now 'domain-stopped' and 'domain-shutdown'. The later is generated after the former. This way it's easy to run code after the VM is shutdown including the stop of it's storage.
Second tests, for actual events, not storage handling done by its handler. cc @HW42 QubesOS/qubes-issues#3164
The previous version did not ensure that the stopped/shutdown event was handled before a new VM start. This can easily lead to problems like in QubesOS/qubes-issues#3164. This improved version now ensures that the stopped/shutdown events are handled before a new VM start. Additionally this version should be more robust against unreliable events from libvirt. It handles missing, duplicated and delayed stopped events. Instead of one 'domain-shutdown' event there are now 'domain-stopped' and 'domain-shutdown'. The later is generated after the former. This way it's easy to run code after the VM is shutdown including the stop of it's storage.
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
R4-rc1 plus some updates from testing
While attempting to install Windows 7 in a standalone HVM, I performed many start/stop/kill cycles which possibly got things into a strange state. For whatever reason, my Win7 root volume was missing its snapshot volume when it should have been there. The only manual LVM command that I had executed was the GUI-recommended LVM resize on the root volume about an hour earlier, and then performed several more stop/start cycles while installing Win7 patches.
After a later "stop" command, I lost my root volume completely and was unable to start the VM again. I believe that ThinVolume::stop called ThinVolume::_commit which does not check that the snapshot volume (self._vid_snap) exists (i.e. is_dirty) before removing the main volume (self.vid) and cloning the snapshot.
https://github.com/QubesOS/qubes-core-admin/blob/b8d45c214d96516f807fa3d4f9f4bb133b1ef4d6/qubes/storage/lvm.py#L274
I believe that adding a check here would prevent anyone who gets into this weird state from losing data in the future. Thank you. -Doug
The text was updated successfully, but these errors were encountered: