Missing LVM snapshot volume results in data loss (R4-rc1) #3164

doug1 · 2017-10-10T06:45:22Z

R4-rc1 plus some updates from testing

While attempting to install Windows 7 in a standalone HVM, I performed many start/stop/kill cycles which possibly got things into a strange state. For whatever reason, my Win7 root volume was missing its snapshot volume when it should have been there. The only manual LVM command that I had executed was the GUI-recommended LVM resize on the root volume about an hour earlier, and then performed several more stop/start cycles while installing Win7 patches.

After a later "stop" command, I lost my root volume completely and was unable to start the VM again. I believe that ThinVolume::stop called ThinVolume::_commit which does not check that the snapshot volume (self._vid_snap) exists (i.e. is_dirty) before removing the main volume (self.vid) and cloning the snapshot.

https://github.com/QubesOS/qubes-core-admin/blob/b8d45c214d96516f807fa3d4f9f4bb133b1ef4d6/qubes/storage/lvm.py#L274

I believe that adding a check here would prevent anyone who gets into this weird state from losing data in the future. Thank you. -Doug

0spinboson · 2017-10-10T07:39:13Z

So attaching an ISO worked for you? Wonder why I'm special.

doug1 · 2017-10-10T08:29:38Z

@0spinboson Yeah, actually that took some work. I ended up doing "losetup -f" on the ISO image and then using the VM Settings GUI to boot the empty VM from the new loop block device.

HW42 · 2017-10-12T13:53:02Z

I experienced (probably) the same. On a test machine sys-net lost it's private volume.

I think I now understand what's happening:

vm.start() starts the VM.
VM halts. domain-shutdown get's generated.
A second vm.start() get's the startup_lock before vm.on_domain_shutdown_coro() get's it. VM starts.
vm.on_domain_shutdown_coro() aquires lock and runs vm.storage.stop(). So ['remove', self.vid], then ['clone', self._vid_snap, self.vid] and then ['remove', self._vid_snap].
VM halts. vm.on_domain_shutdown_coro() runs vm.storage.stop(). Now ['remove', self.vid] deletes the only remaining volume containing the private image.

So we have two problems:

thin_volume._commit() does not check if the snapshot it wants to commit actually exists.

The TODO in vm.on_domain_shutdown() has not been addressed.

# TODO: ensure that domain haven't been started _before_ this
# coroutine got a chance to acquire a lock

Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?

Additionally we might want to consider:

Log lvm operations at default log level. This makes reconstructing what happend much easier.
Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.

cc: @marmarek, @kalkin

@doug1: Thanks for your report and analysis. This was quite helpful to track down the problem.

marmarek · 2017-10-12T19:32:06Z

Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.

It is done if revisions_to_keep is set to something positive. Anyway, I think this can be improved by renaming volume instead of removing in the first step. And removing after successful clone.

Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?

You mean set vm.halt_pending in vm.start(), or vm.on_domain_shutdown() ? Ideally we'd take vm.startup_lock in vm.on_domain_shutdown(), but unfortunately it is not possible (it isn't coroutine).

HW42 · 2017-10-12T19:51:52Z

Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.

It is done if revisions_to_keep is set to something positive. Anyway, I think this can be improved by renaming volume instead of removing in the first step. And removing after successful clone.

Right, this sounds good.

Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?

You mean set vm.halt_pending in vm.start(), or vm.on_domain_shutdown() ?

vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so. BTW is it intentional that vm.is_running() does not mean vm.get_power_state() == 'Running'.

Ideally we'd take vm.startup_lock in vm.on_domain_shutdown(), but unfortunately it is not possible (it isn't coroutine).

Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that libvirt_domain.isActive() == True till VIR_DOMAIN_EVENT_STOPPED has been delivered.

marmarek · 2017-10-12T20:53:18Z

vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so.

Maybe it should be a lock, that is taken at vm.start() and released by vm.on_domain_shutdown_coro()? But I fear some deadlocks if VIR_DOMAIN_EVENT_STOPPED got missing for any reason (like libvirt restart). What should happen if vm.running==True but libvirt_domain.isActive() == True and someone calls vm.start()? Actively wait for vm.running==False? Exception?

Currently the code correctly recover from the case when VIR_DOMAIN_EVENT_STOPPED isn't delivered at all.
Maybe better course of action would be ensuring that:
a) vm.on_domain_shutdown_coro() is called before vm.start(), if domain was running before
b) vm.on_domain_shutdown_coro() does nothing if was called after domain was started again

This means that even if VIR_DOMAIN_EVENT_STOPPED arrives at totally different time (or never), vm.on_domain_shutdown_coro() will still be called after domain shutdown (unless someone stop qubesd for that time).

So, to answer my question above: if vm.running==True and someone calls vm.start(), but vm.is_running() == False, call vm.on_domain_shutdown_coro() then set vm.running=False. And in vm.on_domain_shutdown_coro() exit early if vm.running==False. This require careful handling vm.startup_lock. What do you think?

Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that libvirt_domain.isActive() == True till VIR_DOMAIN_EVENT_STOPPED has been delivered.

To be honest I don't know. Looking at the code it isn't obvious.

BTW is it intentional that vm.is_running() does not mean vm.get_power_state() == 'Running'.

It is vm.is_running() == vm.get_power_state() != 'Halted'. This is shortcut for ease various tests if domain files/config can be safely manipulated. Probably better name would be is_active()...

marmarek · 2017-10-12T23:28:35Z

See linked PR for data loss prevention.

HW42 · 2017-10-13T02:48:31Z

vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so.

Maybe it should be a lock, that is taken at vm.start() and released by vm.on_domain_shutdown_coro()? But I fear some deadlocks if VIR_DOMAIN_EVENT_STOPPED got missing for any reason (like libvirt restart). What should happen if vm.running==True but libvirt_domain.isActive() == True and someone calls vm.start()? Actively wait for vm.running==False? Exception?

Currently the code correctly recover from the case when VIR_DOMAIN_EVENT_STOPPED isn't delivered at all.
Maybe better course of action would be ensuring that:
a) vm.on_domain_shutdown_coro() is called before vm.start(), if domain was running before
b) vm.on_domain_shutdown_coro() does nothing if was called after domain was started again

This means that even if VIR_DOMAIN_EVENT_STOPPED arrives at totally different time (or never), vm.on_domain_shutdown_coro() will still be called after domain shutdown (unless someone stop qubesd for that time).

So, to answer my question above: if vm.running==True and someone calls vm.start(), but vm.is_running() == False, call vm.on_domain_shutdown_coro() then set vm.running=False. And in vm.on_domain_shutdown_coro() exit early if vm.running==False. This require careful handling vm.startup_lock. What do you think?

If we set vm.running=False then cleanup will not work. Also after thinking about it I currently see no way how to fix it such that a) cleanup gets run b) if the VM stops very quickly again cleanup is guaranteed to run only once.

The best similar solution I came up, which is able to deal with unreliable events, involved counting starts and shutdowns. But it's not ideal and this is getting quite complex so I thought about another strategy.

The idea is that we register for libvirt events only from the domain we are interested in. By doing so we can stop events by unregistering. While doing this I moved also the lock logic before firing the event so that the events are ordered meaningfully and the lock is hold inside the event handlers. Like it's already the case for domain-spawn, domain-start, etc.

Current WIP version at https://github.com/HW42/qubes-core-admin/commits/hw42/wip/stopped-event. Attention: It's barely tested.

This is a race condition, so to make it more likely to fail (if it's broken), make some things manually. In normal circumstances this order of actions is also possible, just less likely to happen. But as seen in the bug report, happens from time to time. QubesOS/qubes-issues#3164

qubesos-bot · 2017-10-16T02:48:32Z

Automated announcement from builder-github

The package qubes-core-dom0-4.0.10-1.fc25 has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

HW42 · 2017-10-17T05:04:29Z

Updated my branch with the improved event handling.

I did not only fix the logic flaw you noticed but also addressed another problem. The previous version relied on that after calling domainEventDeregisterAny() the callback is not called anymore. But the libvirt documentation is not clear if this is true or if there might be some unhandled event in the queue which will trigger the callback after unregistering the callback.

When I added a second lock for the stopped event handling (instead of using startup_lock) I also noticed that the logic can be (IMO) simplified.

This version should be quite resistant against unreliable stopped events (missing, dupplicated, delayed).

I did not opened a PR yet since I first want to test if I can trigger the race (and then verify my patch).

@HW42

Second tests, for actual events, not storage handling done by its handler. cc @HW42 QubesOS/qubes-issues#3164

The previous version did not ensure that the stopped/shutdown event was handled before a new VM start. This can easily lead to problems like in QubesOS/qubes-issues#3164. This improved version now ensures that the stopped/shutdown events are handled before a new VM start. Additionally this version should be more robust against unreliable events from libvirt. It handles missing, duplicated and delayed stopped events. Instead of one 'domain-shutdown' event there are now 'domain-stopped' and 'domain-shutdown'. The later is generated after the former. This way it's easy to run code after the VM is shutdown including the stop of it's storage.

@HW42

Second tests, for actual events, not storage handling done by its handler. cc @HW42 QubesOS/qubes-issues#3164

The previous version did not ensure that the stopped/shutdown event was handled before a new VM start. This can easily lead to problems like in QubesOS/qubes-issues#3164. This improved version now ensures that the stopped/shutdown events are handled before a new VM start. Additionally this version should be more robust against unreliable events from libvirt. It handles missing, duplicated and delayed stopped events. Instead of one 'domain-shutdown' event there are now 'domain-stopped' and 'domain-shutdown'. The later is generated after the former. This way it's easy to run code after the VM is shutdown including the stop of it's storage.

qubesos-bot · 2017-10-30T14:08:11Z

Automated announcement from builder-github

The package qubes-core-dom0-4.0.11-1.fc25 has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core labels Oct 11, 2017

andrewdavidwong added this to the Release 4.0 milestone Oct 11, 2017

marmarek mentioned this issue Oct 12, 2017

storage/lvm: remove old volume only after successfully cloning new one QubesOS/qubes-core-admin#158

Merged

marmarek closed this as completed in marmarek/qubes-core-admin@021047f Oct 16, 2017

qubesos-bot added the r4.0-dom0-cur-test label Oct 16, 2017

qubesos-bot mentioned this issue Oct 16, 2017

core-admin v4.0.10 (r4.0) QubesOS/updates-status#271

Closed

This was referenced Oct 17, 2017

Unable to add software to AppVM through TemplateVM update #3181

Closed

Expand vm disk in gui and the vm won't start #3166

Closed

marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Oct 20, 2017

tests: domain-shutdown event race condition

a21cf40

Second tests, for actual events, not storage handling done by its handler. cc @HW42 QubesOS/qubes-issues#3164

HW42 mentioned this issue Oct 20, 2017

qubes/vm: Improve stopped event handling QubesOS/qubes-core-admin#159

Merged

marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Oct 21, 2017

tests: domain-shutdown event race condition

c3430fe

Second tests, for actual events, not storage handling done by its handler. cc @HW42 QubesOS/qubes-issues#3164

qubesos-bot mentioned this issue Oct 21, 2017

core-admin v4.0.11 (r4.0) QubesOS/updates-status#284

Closed

qubesos-bot removed the r4.0-dom0-cur-test label Oct 30, 2017

qubesos-bot added the r4.0-dom0-stable label Oct 30, 2017

qubesos-bot mentioned this issue Nov 21, 2017

core-admin v4.0.12 (r4.0) QubesOS/updates-status#313

Closed

marmarek mentioned this issue Mar 17, 2018

template reinstall not ported to Qubes 4.0 #3169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing LVM snapshot volume results in data loss (R4-rc1) #3164

Missing LVM snapshot volume results in data loss (R4-rc1) #3164

doug1 commented Oct 10, 2017 •

edited

Loading

0spinboson commented Oct 10, 2017

doug1 commented Oct 10, 2017

HW42 commented Oct 12, 2017

marmarek commented Oct 12, 2017

HW42 commented Oct 12, 2017

marmarek commented Oct 12, 2017

marmarek commented Oct 12, 2017

HW42 commented Oct 13, 2017

qubesos-bot commented Oct 16, 2017

HW42 commented Oct 17, 2017

qubesos-bot commented Oct 30, 2017

Missing LVM snapshot volume results in data loss (R4-rc1) #3164

Missing LVM snapshot volume results in data loss (R4-rc1) #3164

Comments

doug1 commented Oct 10, 2017 • edited Loading

0spinboson commented Oct 10, 2017

doug1 commented Oct 10, 2017

HW42 commented Oct 12, 2017

marmarek commented Oct 12, 2017

HW42 commented Oct 12, 2017

marmarek commented Oct 12, 2017

marmarek commented Oct 12, 2017

HW42 commented Oct 13, 2017

qubesos-bot commented Oct 16, 2017

HW42 commented Oct 17, 2017

qubesos-bot commented Oct 30, 2017

doug1 commented Oct 10, 2017 •

edited

Loading