Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LV disposal when removed from pool: deallocated pool space is neither zeroed nor discarded #5077

Closed
brendanhoar opened this issue Jun 5, 2019 · 27 comments
Labels
C: core community dev This is being developed by a member of the community rather than a core Qubes developer. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Milestone

Comments

@brendanhoar
Copy link

brendanhoar commented Jun 5, 2019

Community Dev: @tasket
Pull Request: QubesOS/qubes-core-admin#267


The problem you're addressing (if any)
Within a VM, a user can utilize fstrim against a mounted filesystem to invoke discard commands against most of the unused space in the filesystem. This has two benefits: a) releasing unused space in the thin pool and b) sending discard commands down to the physical device (if the user has made the other adjustments documented on the Qubes website).

However, when a VM is deleted (or a disposable VM is exited), the lvremove command is used to deallocate the space in the thin pool used by the VM's LVs. That command does not send discard down the chain to the physical device.

Describe the solution you'd like
Under R4.0, adding an invocation of blkdiscard before invoking lvremove (in lvm.py) would assist in amnesiac behaviors with disposable or deleted VMs (again, if the user has activated discard on all device/fs layers with the appropriate hardware).

Poking around on github, I suspect this could be remedied by modifying qubes-core-admin's qubes/storage/lvm.py in the 'remove' handler (line 750) and adding a blkdiscard invocation.

https://github.com/QubesOS/qubes-core-admin/blob/master/qubes/storage/lvm.py

Where is the value to a user, and who might that user be?

  • Users who are concerned with remnants of old sessions or old VMs hanging around in the thin pool's unallocated space.
  • Users who might (perhaps incorrectly) believe that disposable VM and/or deleted VM content is not recoverable.

Describe alternatives you've considered
Manually modifying lvm.py on my own install and maintaining the changes over time locally.
Manually running blkdiscard against LVs of a non-running VM I am about to delete.
Loading disposable VMs with self-destruct script run out of tmpfs during late shutdown that blkdiscards the /rw volumes (would this cover all snapshots?).

If you happen to miss the opportunity to blkdiscard before lvremove, there seems to be no other (simple) method to push those discards through other than offline analysis of the pool metadata and targeted block-range discards on all non-allocated regions.

Additional context
It appears the oVirt developers are considering adding a blkdiscard call against the logical volume block devices before performing lvremove:

https://www.ovirt.org/develop/release-management/features/storage/wipe-volumes-using-blkdiscard.html

Relevant documentation you've consulted
https://www.qubes-os.org/doc/disk-trim/

Related, non-duplicate issues
#3717
#3226
#5055

@brendanhoar brendanhoar added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality. labels Jun 5, 2019
@andrewdavidwong
Copy link
Member

Also related: #904, #1819.

@andrewdavidwong andrewdavidwong added this to the Release 4.1 milestone Jun 5, 2019
@iprid23
Copy link

iprid23 commented Jun 11, 2019

@brendanhoar How about? issue_discards = 1 in /etc/lvm/lvm.conf

I am pretty sure lvremove should do discards if you have that option enabled.

@brendanhoar
Copy link
Author

brendanhoar commented Jun 11, 2019

Hi @iprid23 - I thought so too but I was wrong.

I put together a script in dom0 to monitor discards to the physical drive. After enabling discards and/or pass through in all layers, if I create a VM, create a large file in the VM, delete it in the VM, and execute sync in the VM, discards are passed down to the drive.

If I create another large file in the VM, then shut down the VM and use qubes tools to delete the entire VM, no discards are issued down to the drive.

Therefore by experiment above (and googling to verify I wasn’t missing a config option): LVM does not issue and currently cannot issue discards on the lv volume delete. That is, deallocation is not equivalent to discards.

Rather: discards are a method of deallocation with the bonus of the passthrough of discards option. That bonus does not apply to LV deletions.

Brendan

PS-I’ll post the script when I get back to my desk.

PPS - Also I do have issue_discards set to 1. The text comments above the setting in lvm.conf are misleading in my opinion (at least for this version of lvm).

@brendanhoar
Copy link
Author

brendanhoar commented Jun 11, 2019

Script:

#!/bin/bash
watch -n 1 -d \
"if [ -d /sys/block/sda ] ; then pat=sd ; else pat=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/\$pat*/stat|awk 'BEGIN {print \"DRIVE IOS    QMERGES   SECTORS      MSWAITS\"} {printf \"%5i %-8s %s %15s %11s\\n\",NR,\$12,\$13,\$14,\$15}'"

2nd to last column, "SECTORS" is the count of discarded sectors seen by the device as per https://www.kernel.org/doc/Documentation/block/stat.txt

On dom0, it'll report all discards from /dev/sd* since the kernel started keeping track.
Inside VMs, it'll report all discards from /dev/xvd* since the kernel started keeping track.

B

@tasket
Copy link

tasket commented Jun 17, 2019

I tried different settings in lvm.conf also, but got the same results. (Edit: the disparity does seem to exist between fs ops and lvremove, so testing with the latter is necessary.)

Looks like thin_trim must (somehow) be employed, or a switch to Btrfs, to produce the TRIMs at the device level.

Is there an upstream bug for this? I haven't found one.

@tasket
Copy link

tasket commented Jun 17, 2019

Could be important:

The oVirt page indicates they are using the blkdiscard zero-fill option, but if TRIM is desired then zero-fill should not be used. When I filled a thin lv with data and then did a plain blkdiscard on it, the discards showed up on /sda. So this could be a quick fix for the issue, without needing to zero-fill or deactivate entire pools.

@brendanhoar
Copy link
Author

brendanhoar commented Jun 17, 2019

Welcome to my rabbit hole. :)

No upstream bug captured in LVM2 as far as I can tell. My read is that there were performance issues ( stalling? ) on some storage arrays some years ago and they (RH?) don't really want to touch it again? Also, the LVM-thin approach of "zero on allocation" leads me to think they don't really want to be adding any performance impacts on de-allocation.

They're aware though (and as easily confused as we are - https://www.redhat.com/archives/linux-lvm/2015-August/msg00006.html ):

On 08/12/2015 09:46 AM, Peter Rajnoha wrote:
> # lvremove vg1/lvol0
> Do you really want to remove and DISCARD logical volume lvol0? [y/n]: 
> > For this, the "devices/issue_discards=1" must be set - it's not
> > by default.
> > 
> 
> Well, sorry, this applies for usual PVs, not thin pools (I overlooked
> the $subject and just saw the last question...)

Anyway, other virtualization environments noticed this issue and added workarounds to their own management stack (oVirt/proxmox/even some of the more "enterprisey" tools in my searches). They all probably noticed because deleted thin-LVs from their discarded VMs were not de-allocated down to their storage arrays as they had expected, even though the thin-pool space was de-allocated from LVM's perspective.

Where to go next:

  • short term: I'll probably write a script to allocate a bunch of data into a temporary thin-pool LV, sized just shy of filling the thin-pool, perhaps the script could monitor data/metadata and stop appropriately. It takes a a while, it is hard on the SSD, writing data for no good reason other than convenience and lack of appropriate tooling. Discard across already-discarded flash is basically a NO-OP, but writing for the purposes of discarding rubs me the wrong way. So, I'll run it from time to time.
  • medium-term: adding a blkdiscard-before-lvremove in qubes tooling is probably the quickest way to address the issue for disposable/deleted VMs. No extraneous data writes needed.
  • long-term: execute thin_trim on boot and/or shutdown (might be a bit more involved due to not really being an lvm2 tool), but it's a better "hygiene" approach. No extraneous data writes needed.

Brendan

@tasket
Copy link

tasket commented Jun 17, 2019

Some thoughts:

  • Are you sure this isn't due to an old lvm tools version?
  • It would be interesting to see if reducing with lvresize has the same problem as lvremove.
  • I can't think of any other category of normal use that would cause blkdiscard-before-lvremove to miss data that thin_trim would catch. For example, I am developing an lvm backup tool which does a lot of lvremove, and now plan to use blkdiscard where appropriate. The user would have to install and use lvm-using tools in dom0 that don't take such precautions.
  • Your mid-term case is probably my short-term case: Patching Qubes for this is pretty easy since the lvm.py module handles all lvm changes. Might have to put it in qubes_lvm() in the form of if cmd[0]=='remove': then run blkdiscard on cmd[1].
  • Replacing /sbin/lvm with a wrapper script that inserts blkdiscard commands where appropriate, to catch instances of non-Qubes lvm manipulation, may be practical.

@brendanhoar
Copy link
Author

brendanhoar commented Jun 17, 2019

* Are you sure this isn't due to an old lvm tools version?

Pretty sure. Looking at: https://github.com/lvmteam/lvm2/blob/master/lib/metadata/lv_manip.c

it appears that seg_type matters, and only AREA_PV gets discards via discard_pv_segment, while pool/mirror data just gets reduced.

* It would be interesting to see if reducing with `lvresize` has the same problem as `lvremove`.

Amusingly, they appear to be mapped to the same code path.

* I can't think of any other category of normal use that would cause _blkdiscard-before-lvremove_ to miss data that `thin_trim` would catch. For example, I am developing an lvm backup tool which does a lot of `lvremove`, and now plan to use `blkdiscard` where appropriate. The user would have to install and use lvm-using tools in dom0 that don't take such precautions.

I see you too have been caught by surprise here. :)

Anyway...I'd be concerned about data unexpectedly hanging around from:
a) before the fix was put into qubes.
b) recovery after an unclean shutdown
c) cleanup tasks when orphaned volumes are found, performed by novices...like me.

And maybe, if volatile volumes are somehow taking advantage of DM/LVM auto-dispose features (do those exist?), that could be a gap (haven't investigated this, though).

* Your mid-term case is probably my short-term case: Patching Qubes for this is pretty easy since the lvm.py module handles all lvm changes. Might have to put it in `qubes_lvm()` in the form of `if cmd[0]=='remove':` then run blkdiscard on `cmd[1]`.

Please send a pull request to Marek et al! :)

* Replacing `/sbin/lvm` with a wrapper script that inserts `blkdiscard` commands where appropriate, to catch instances of non-Qubes lvm manipulation, may be practical.

Maybe...but I could see there being pushback on that (does it survive updates?, etc.).

Brendan

@marmarek
Copy link
Member

Please send a pull request to Marek et al! :)

Yes please! I have slightly reduced test capacity right now, but if I get a tested commit, I'd gladly accept it.
It should be enough to put it somewhere here, if cmd[0] == 'remove'.

And maybe, if volatile volumes are somehow taking advantage of DM/LVM auto-dispose features (do those exist?), that could be a gap (haven't investigated this, though).

No, plain lvremove is in use. In case of unclean shutdown - on the next startup.

@marmarek
Copy link
Member

As for the long term solution, it should really be done by lvm automatically (maybe with some configuration option, or re-use issue_discards)...

@tasket
Copy link

tasket commented Jun 20, 2019

One question I have about implementation is whether a lvchange -p rw command should be issued before blkdiscard. The reason is that blkdiscard will fail on a read-only lv, and the lvchange was a precaution I had to take with sparsebak. If we don't think Qubes will ever mark a snapshot as read-only then this extra step may not be necessary.

Another question: Should this discard step be controlled by an optional setting and/or detection of trim support for the pool's device?

@brendanhoar
Copy link
Author

brendanhoar commented Jun 20, 2019

1st Q is probably directed to Marek, but I’d say yes to setting rw before blkdiscard.

Peanut gallery answer to 2nd question :)

  1. Simple answer is to detect and use blkdiscard if trim is enabled. Of course one might also want to look at the layer under the thin pool, which is presumably either a luks layer or a physical partition (depending on install). Or maybe all three layers. Oy. Maybe just assume if issue_discards is 1 that the user knew what they were doing.

  2. More complex, but arguably more correct approach is to have a global Qubes setting that lets the user choose among: a) do not clear before removing volumes/snapshots ; b) blkdiscard before removing volumes/snapshots (if qubes determines trim is available/effective) ; c) zero overwrite before removing volumes/snapshots (for certain use cases where the performance tradeoff is worth it).

  1. A stretch would be to have Qubes be aware of which layers pass discards and whether the hardware (assuming that xen is not virtualized) is capable of DRAT, RZAT, non-deterministic or broken (which can be determined experimentally). And provide above options and/or advice based in that.*

B

* I just realized what I wrote in No. 3 assumed a single thin pool, but users could have multiple pools across multiple devices, some of which might have differing levels of support down the stack...

@marmarek
Copy link
Member

As for lvchange, I don't think it's necessary, all snapshots are created rw by default. But, if blkdiscard fails, I would just log it but proceed with lvremove.

As I understand, if lower layer doesn't support TRIM (LUKS with not enabled allow_discards, HDD or else), this simply become no-op, right? If so, I'd do blkdiscard unconditionally. It if turns out there is an need for an option, we can add it later.

@brendanhoar
Copy link
Author

As I understand, if lower layer doesn't support TRIM (LUKS with not enabled allow_discards, HDD or else), this simply become no-op, right? If so, I'd do blkdiscard unconditionally. It if turns out there is an need for an option, we can add it later.

Correct.

B

@tasket
Copy link

tasket commented Jun 21, 2019

PR is here.

Since its the convention used in ThinVolume: methods, I simply added '/dev/' and the cmd parameter together to create a path blkdiscard could use. I also used the asyncio exec methods already used in qubes_lvm_coro(), although I'm not familiar with asyncio and don't know all the implications of using it; this could be changed to simple subprocess.run().

@tasket
Copy link

tasket commented Jun 21, 2019

As for testing, I don't have a setup to run unit tests, unfortunately. I don't even have a fedora dev vm anymore since the last time I had to reinstall.

However, I've manually tested it on a regular Qubes system using brendan's trim watcher script and it works as intended.

@brendanhoar
Copy link
Author

brendanhoar commented Jun 22, 2019

[See replies below - chris's code works, I just failed to reboot first.]
Thanks Chris.

I pulled the replacement lvm.py and installed it. Doesn't seem to be working for me:

2019-06-21 20:21:30,899 Removing volume private: qubes_dom0/vm-disp6977-private
2019-06-21 20:21:37,214 unhandled exception while calling src=b'dom0' meth=b'admin.vm.Kill' dest=b'disp6977' arg=b'' len(untrusted_payload)=0
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/qubes/api/__init__.py", line 264, in respond
    self.send_event)
  File "/usr/lib/python3.5/site-packages/qubes/api/__init__.py", line 125, in __init__
    self.dest = self.app.domains[dest.decode('ascii')]
  File "/usr/lib/python3.5/site-packages/qubes/app.py", line 467, in __getitem__
    raise KeyError(key)
KeyError: 'disp6977'

I did try adding a sudo before the blkdiscard but that didn't seem to make it work.

Reverting the code to baseline stopped the error from appearing.

Brendan

@brendanhoar
Copy link
Author

brendanhoar commented Jun 22, 2019

[See replies below - chris's code works, I just failed to reboot first.]
Marek said the above error message is unrelated (see #5105).

Still, I created two large random files in the disposable VM. If I delete file #1 inside the VM I see discards. If I then shutdown the VM, I do not see discards.

This is with both the pull request code as well as a modification to replace 'blkdiscard' with 'sudo', 'blkdiscard'

Brendan

EDIT: hmm, the tested VM has a private volume size of 16GB, which is also the discard IO max for LVs in the thin pool. Tried it again with a smaller private volume, but same results. Large discard activity on deletion of the first file inside the VM, little to no discard activity on shutdown of disposable VM.

@brendanhoar
Copy link
Author

brendanhoar commented Jun 22, 2019

Update: rebooting fixed it, as per Chris's suggestion in the PR thread...a qubes service needed to restart to reload the lvm.py file from the storage components.

Thanks!

@brendanhoar
Copy link
Author

On a 2012-era thinkpad with an SSD, incorporation of blkdiscard into the LV removal process added approximately 1.5s latency to the VM shutdown time on a lightly used disposable VM. It appears that the removes of LV are in parallel, so to are the blkdiscard invocations.

The addition of blkdiscard also added a negligible <100ms latency to the VM start up time, as qubes appears to perform a precautionary 'remove' invocation before every 'create' invocation at startup (also in parallel).

Startup:

23:08:53 exit  17053    256   0.006s blkdiscard /dev/qubes_dom0/vm-disp2843-private-snap
23:08:53 exit  17054    256   0.006s blkdiscard /dev/qubes_dom0/vm-disp2843-volatile
23:08:53 exit  17055    256   0.007s blkdiscard /dev/qubes_dom0/vm-disp2843-root-snap
23:08:53 exit  17056   1280   0.098s lvm lvremove -f qubes_dom0/vm-disp2843-private-snap
23:08:53 exit  17057   1280   0.158s lvm lvremove -f qubes_dom0/vm-disp2843-volatile
23:08:53 exit  17059   1280   0.222s lvm lvremove -f qubes_dom0/vm-disp2843-root-snap
23:08:53 exit  17062      0   0.473s lvm lvcreate -kn -ay -s /dev/qubes_dom0/vm-whonix-ws-14-dvm-private -n qubes_dom0/vm-disp2843-private-snap
23:08:53 exit  17066      0   0.627s lvm lvcreate -T qubes_dom0/pool00 -kn -ay -n vm-disp2843-volatile -V 12884901888B
23:08:54 exit  17068      0   0.840s lvm lvcreate -kn -ay -s /dev/qubes_dom0/vm-whonix-ws-14-root -n qubes_dom0/vm-disp2843-root-snap

Notably, the exit codes in the precautionary remove (and implicit blkdiscard) during startup are non-zero (256/1280), indicating that they failed (as the LVs they were to operate on did not exist).

Very rarely, the startup time may be increased to several seconds if there happens to be an LV to remove that matches the name of the lv to be created.

Shutdown:

23:09:28 exit  17575      0   1.492s blkdiscard /dev/qubes_dom0/vm-disp2843-volatile
23:09:28 exit  17576      0   1.556s blkdiscard /dev/qubes_dom0/vm-disp2843-root-snap
23:09:28 exit  17574      0   1.593s blkdiscard /dev/qubes_dom0/vm-disp2843-private-snap
23:09:28 exit  17580      0   0.281s lvm lvremove -f qubes_dom0/vm-disp2843-volatile
23:09:29 exit  17584      0   0.421s lvm lvremove -f qubes_dom0/vm-disp2843-root-snap
23:09:29 exit  17585      0   0.638s lvm lvremove -f qubes_dom0/vm-disp2843-private-snap

Again, as you can see by the timestamps, the blkdiscard invocations were in parallel.

Brendan

@marmarek
Copy link
Member

The addition of blkdiscard also added a negligible <100ms latency to the VM start up time, as qubes appears to perform a precautionary 'remove' invocation before every 'create' invocation at startup (also in parallel).

Maybe it's worth checking if the device exists from the python code, before calling it? And if not, skip the blkdiscard, or maybe even lvremove too.

@andrewdavidwong andrewdavidwong added community dev This is being developed by a member of the community rather than a core Qubes developer. S: needs review Status: needs review. Core devs must review contributed code for potential inclusion in Qubes OS. labels Jun 22, 2019
@tasket
Copy link

tasket commented Jun 22, 2019

The overhead was in my mind when I asked about checking trim support first.

Using os.path.exists() we need to be confident that /dev/ paths always reflect the pool contents, or else vms may not be able to start.

@brendanhoar
Copy link
Author

One could argue that inserting blkdiscard in the remove process uncovered the unnecessary lvremove(s) during VM startup and, when that is addressed, inserting blkdiscard led to a performance improvement. :P :)

B

@marmarek
Copy link
Member

Using os.path.exists() we need to be confident that /dev/ paths always reflect the pool contents, or else vms may not be able to start.

I see we already rely on that in other functions, so in theory it shouldn't be an issue. But indeed in this case, the impact may be less pleasant, than, say, remove(). Alternatively, there is cached list of available volumes (lvs output) in size_cache variable. This may also sounds risky, but I think it may be safer choice, because a) the cache is refreshed frequently (after each changing action, and also periodically by reads done by disk space widget) b) operations on a single volume does not happen in parallel (there is a lock).

Anyway, it's still a risk, so I wouldn't add this optimization to R4.0, only to R4.1.

@brendanhoar
Copy link
Author

brendanhoar commented Jun 23, 2019

For the gap where the unallocated space may contain sensitive data that a user thought was gone via VM deletion, I am putting together a script to create an LV larger than the remaining thin pool data, writing /dev/zero to it until a threshold data/metadata percent of pool allocation occurs, then blkdiscard the LV and lvremove the LV.

Proof-of-concept script that limits pool data/metadata usage 95% before cleanup: scrub_lvm_thin_slack.sh
https://pastebin.com/geGN4fVm

I need to build another test Qubes build to see how close to full is really safe before continuing to ~99.9% data. Metadata will always have a lower threshold.

B

@andrewdavidwong
Copy link
Member

I observe that:

Therefore, I infer that the central enhancement for this issue has been satisfactorily implement and that the issue should be closed. If anyone believes that this issue should remain open, please leave a comment briefly explaining why, and we'll be happy to reopen this. Please keep in mind that other action items that happened to arise in the course of discussing this issue should, in most cases, be filed separately. Thank you.

@andrewdavidwong andrewdavidwong removed the S: needs review Status: needs review. Core devs must review contributed code for potential inclusion in Qubes OS. label Aug 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core community dev This is being developed by a member of the community rather than a core Qubes developer. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Projects
None yet
Development

No branches or pull requests

5 participants