Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing LVM snapshot volume results in data loss (R4-rc1) #3164

Closed
doug1 opened this issue Oct 10, 2017 · 11 comments
Closed

Missing LVM snapshot volume results in data loss (R4-rc1) #3164

doug1 opened this issue Oct 10, 2017 · 11 comments
Labels
C: core r4.0-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Milestone

Comments

@doug1
Copy link

doug1 commented Oct 10, 2017

R4-rc1 plus some updates from testing

While attempting to install Windows 7 in a standalone HVM, I performed many start/stop/kill cycles which possibly got things into a strange state. For whatever reason, my Win7 root volume was missing its snapshot volume when it should have been there. The only manual LVM command that I had executed was the GUI-recommended LVM resize on the root volume about an hour earlier, and then performed several more stop/start cycles while installing Win7 patches.

After a later "stop" command, I lost my root volume completely and was unable to start the VM again. I believe that ThinVolume::stop called ThinVolume::_commit which does not check that the snapshot volume (self._vid_snap) exists (i.e. is_dirty) before removing the main volume (self.vid) and cloning the snapshot.

https://github.com/QubesOS/qubes-core-admin/blob/b8d45c214d96516f807fa3d4f9f4bb133b1ef4d6/qubes/storage/lvm.py#L274

I believe that adding a check here would prevent anyone who gets into this weird state from losing data in the future. Thank you. -Doug

@0spinboson
Copy link

So attaching an ISO worked for you? Wonder why I'm special.

@doug1
Copy link
Author

doug1 commented Oct 10, 2017

@0spinboson Yeah, actually that took some work. I ended up doing "losetup -f" on the ISO image and then using the VM Settings GUI to boot the empty VM from the new loop block device.

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core labels Oct 11, 2017
@andrewdavidwong andrewdavidwong added this to the Release 4.0 milestone Oct 11, 2017
@HW42
Copy link

HW42 commented Oct 12, 2017

I experienced (probably) the same. On a test machine sys-net lost it's private volume.

I think I now understand what's happening:

  1. vm.start() starts the VM.
  2. VM halts. domain-shutdown get's generated.
  3. A second vm.start() get's the startup_lock before vm.on_domain_shutdown_coro() get's it. VM starts.
  4. vm.on_domain_shutdown_coro() aquires lock and runs vm.storage.stop(). So ['remove', self.vid], then ['clone', self._vid_snap, self.vid] and then ['remove', self._vid_snap].
  5. VM halts. vm.on_domain_shutdown_coro() runs vm.storage.stop(). Now ['remove', self.vid] deletes the only remaining volume containing the private image.

So we have two problems:

  1. thin_volume._commit() does not check if the snapshot it wants to commit actually exists.

  2. The TODO in vm.on_domain_shutdown() has not been addressed.

    # TODO: ensure that domain haven't been started _before_ this
    # coroutine got a chance to acquire a lock
    

Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?

Additionally we might want to consider:

  1. Log lvm operations at default log level. This makes reconstructing what happend much easier.
  2. Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.

cc: @marmarek, @kalkin

@doug1: Thanks for your report and analysis. This was quite helpful to track down the problem.

@marmarek
Copy link
Member

Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.

It is done if revisions_to_keep is set to something positive. Anyway, I think this can be improved by renaming volume instead of removing in the first step. And removing after successful clone.

Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?

You mean set vm.halt_pending in vm.start(), or vm.on_domain_shutdown() ? Ideally we'd take vm.startup_lock in vm.on_domain_shutdown(), but unfortunately it is not possible (it isn't coroutine).

@HW42
Copy link

HW42 commented Oct 12, 2017

Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.

It is done if revisions_to_keep is set to something positive. Anyway, I think this can be improved by renaming volume instead of removing in the first step. And removing after successful clone.

Right, this sounds good.

Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?

You mean set vm.halt_pending in vm.start(), or vm.on_domain_shutdown() ?

vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so. BTW is it intentional that vm.is_running() does not mean vm.get_power_state() == 'Running'.

Ideally we'd take vm.startup_lock in vm.on_domain_shutdown(), but unfortunately it is not possible (it isn't coroutine).

Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that libvirt_domain.isActive() == True till VIR_DOMAIN_EVENT_STOPPED has been delivered.

@marmarek
Copy link
Member

vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so.

Maybe it should be a lock, that is taken at vm.start() and released by vm.on_domain_shutdown_coro()? But I fear some deadlocks if VIR_DOMAIN_EVENT_STOPPED got missing for any reason (like libvirt restart). What should happen if vm.running==True but libvirt_domain.isActive() == True and someone calls vm.start()? Actively wait for vm.running==False? Exception?

Currently the code correctly recover from the case when VIR_DOMAIN_EVENT_STOPPED isn't delivered at all.
Maybe better course of action would be ensuring that:
a) vm.on_domain_shutdown_coro() is called before vm.start(), if domain was running before
b) vm.on_domain_shutdown_coro() does nothing if was called after domain was started again

This means that even if VIR_DOMAIN_EVENT_STOPPED arrives at totally different time (or never), vm.on_domain_shutdown_coro() will still be called after domain shutdown (unless someone stop qubesd for that time).

So, to answer my question above: if vm.running==True and someone calls vm.start(), but vm.is_running() == False, call vm.on_domain_shutdown_coro() then set vm.running=False. And in vm.on_domain_shutdown_coro() exit early if vm.running==False. This require careful handling vm.startup_lock. What do you think?

Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that libvirt_domain.isActive() == True till VIR_DOMAIN_EVENT_STOPPED has been delivered.

To be honest I don't know. Looking at the code it isn't obvious.

BTW is it intentional that vm.is_running() does not mean vm.get_power_state() == 'Running'.

It is vm.is_running() == vm.get_power_state() != 'Halted'. This is shortcut for ease various tests if domain files/config can be safely manipulated. Probably better name would be is_active()...

@marmarek
Copy link
Member

See linked PR for data loss prevention.

@HW42
Copy link

HW42 commented Oct 13, 2017

vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so.

Maybe it should be a lock, that is taken at vm.start() and released by vm.on_domain_shutdown_coro()? But I fear some deadlocks if VIR_DOMAIN_EVENT_STOPPED got missing for any reason (like libvirt restart). What should happen if vm.running==True but libvirt_domain.isActive() == True and someone calls vm.start()? Actively wait for vm.running==False? Exception?

Currently the code correctly recover from the case when VIR_DOMAIN_EVENT_STOPPED isn't delivered at all.
Maybe better course of action would be ensuring that:
a) vm.on_domain_shutdown_coro() is called before vm.start(), if domain was running before
b) vm.on_domain_shutdown_coro() does nothing if was called after domain was started again

This means that even if VIR_DOMAIN_EVENT_STOPPED arrives at totally different time (or never), vm.on_domain_shutdown_coro() will still be called after domain shutdown (unless someone stop qubesd for that time).

So, to answer my question above: if vm.running==True and someone calls vm.start(), but vm.is_running() == False, call vm.on_domain_shutdown_coro() then set vm.running=False. And in vm.on_domain_shutdown_coro() exit early if vm.running==False. This require careful handling vm.startup_lock. What do you think?

If we set vm.running=False then cleanup will not work. Also after thinking about it I currently see no way how to fix it such that a) cleanup gets run b) if the VM stops very quickly again cleanup is guaranteed to run only once.

The best similar solution I came up, which is able to deal with unreliable events, involved counting starts and shutdowns. But it's not ideal and this is getting quite complex so I thought about another strategy.

The idea is that we register for libvirt events only from the domain we are interested in. By doing so we can stop events by unregistering. While doing this I moved also the lock logic before firing the event so that the events are ordered meaningfully and the lock is hold inside the event handlers. Like it's already the case for domain-spawn, domain-start, etc.

Current WIP version at https://github.com/HW42/qubes-core-admin/commits/hw42/wip/stopped-event. Attention: It's barely tested.

marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Oct 15, 2017
This is a race condition, so to make it more likely to fail (if it's
broken), make some things manually. In normal circumstances this order
of actions is also possible, just less likely to happen. But as seen in
the bug report, happens from time to time.

QubesOS/qubes-issues#3164
@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.0.10-1.fc25 has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@HW42
Copy link

HW42 commented Oct 17, 2017

Updated my branch with the improved event handling.

I did not only fix the logic flaw you noticed but also addressed another problem. The previous version relied on that after calling domainEventDeregisterAny() the callback is not called anymore. But the libvirt documentation is not clear if this is true or if there might be some unhandled event in the queue which will trigger the callback after unregistering the callback.

When I added a second lock for the stopped event handling (instead of using startup_lock) I also noticed that the logic can be (IMO) simplified.

This version should be quite resistant against unreliable stopped events (missing, dupplicated, delayed).

I did not opened a PR yet since I first want to test if I can trigger the race (and then verify my patch).

marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Oct 20, 2017
Second tests, for actual events, not storage handling done by its
handler.

cc @HW42

QubesOS/qubes-issues#3164
HW42 added a commit to HW42/qubes-core-admin that referenced this issue Oct 20, 2017
The previous version did not ensure that the stopped/shutdown event was
handled before a new VM start. This can easily lead to problems like in
QubesOS/qubes-issues#3164.

This improved version now ensures that the stopped/shutdown events are
handled before a new VM start.

Additionally this version should be more robust against unreliable
events from libvirt. It handles missing, duplicated and delayed stopped
events.

Instead of one 'domain-shutdown' event there are now 'domain-stopped'
and 'domain-shutdown'. The later is generated after the former. This way
it's easy to run code after the VM is shutdown including the stop of
it's storage.
HW42 added a commit to HW42/qubes-core-admin that referenced this issue Oct 20, 2017
The previous version did not ensure that the stopped/shutdown event was
handled before a new VM start. This can easily lead to problems like in
QubesOS/qubes-issues#3164.

This improved version now ensures that the stopped/shutdown events are
handled before a new VM start.

Additionally this version should be more robust against unreliable
events from libvirt. It handles missing, duplicated and delayed stopped
events.

Instead of one 'domain-shutdown' event there are now 'domain-stopped'
and 'domain-shutdown'. The later is generated after the former. This way
it's easy to run code after the VM is shutdown including the stop of
it's storage.
marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Oct 21, 2017
Second tests, for actual events, not storage handling done by its
handler.

cc @HW42

QubesOS/qubes-issues#3164
HW42 added a commit to HW42/qubes-core-admin that referenced this issue Oct 21, 2017
The previous version did not ensure that the stopped/shutdown event was
handled before a new VM start. This can easily lead to problems like in
QubesOS/qubes-issues#3164.

This improved version now ensures that the stopped/shutdown events are
handled before a new VM start.

Additionally this version should be more robust against unreliable
events from libvirt. It handles missing, duplicated and delayed stopped
events.

Instead of one 'domain-shutdown' event there are now 'domain-stopped'
and 'domain-shutdown'. The later is generated after the former. This way
it's easy to run code after the VM is shutdown including the stop of
it's storage.
@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.0.11-1.fc25 has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core r4.0-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

6 participants