-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LV disposal when removed from pool: deallocated pool space is neither zeroed nor discarded #5077
Comments
@brendanhoar How about? issue_discards = 1 in /etc/lvm/lvm.conf I am pretty sure lvremove should do discards if you have that option enabled. |
Hi @iprid23 - I thought so too but I was wrong. I put together a script in dom0 to monitor discards to the physical drive. After enabling discards and/or pass through in all layers, if I create a VM, create a large file in the VM, delete it in the VM, and execute sync in the VM, discards are passed down to the drive. If I create another large file in the VM, then shut down the VM and use qubes tools to delete the entire VM, no discards are issued down to the drive. Therefore by experiment above (and googling to verify I wasn’t missing a config option): LVM does not issue and currently cannot issue discards on the lv volume delete. That is, deallocation is not equivalent to discards. Rather: discards are a method of deallocation with the bonus of the passthrough of discards option. That bonus does not apply to LV deletions. Brendan PS-I’ll post the script when I get back to my desk. PPS - Also I do have issue_discards set to 1. The text comments above the setting in lvm.conf are misleading in my opinion (at least for this version of lvm). |
Script:
2nd to last column, "SECTORS" is the count of discarded sectors seen by the device as per https://www.kernel.org/doc/Documentation/block/stat.txt On dom0, it'll report all discards from /dev/sd* since the kernel started keeping track. B |
I tried different settings in lvm.conf also, but got the same results. (Edit: the disparity does seem to exist between fs ops and lvremove, so testing with the latter is necessary.) Looks like Is there an upstream bug for this? I haven't found one. |
Could be important: The oVirt page indicates they are using the |
Welcome to my rabbit hole. :) No upstream bug captured in LVM2 as far as I can tell. My read is that there were performance issues ( stalling? ) on some storage arrays some years ago and they (RH?) don't really want to touch it again? Also, the LVM-thin approach of "zero on allocation" leads me to think they don't really want to be adding any performance impacts on de-allocation. They're aware though (and as easily confused as we are - https://www.redhat.com/archives/linux-lvm/2015-August/msg00006.html ):
Anyway, other virtualization environments noticed this issue and added workarounds to their own management stack (oVirt/proxmox/even some of the more "enterprisey" tools in my searches). They all probably noticed because deleted thin-LVs from their discarded VMs were not de-allocated down to their storage arrays as they had expected, even though the thin-pool space was de-allocated from LVM's perspective. Where to go next:
Brendan |
Some thoughts:
|
Pretty sure. Looking at: https://github.com/lvmteam/lvm2/blob/master/lib/metadata/lv_manip.c it appears that seg_type matters, and only AREA_PV gets discards via discard_pv_segment, while pool/mirror data just gets reduced.
Amusingly, they appear to be mapped to the same code path.
I see you too have been caught by surprise here. :) Anyway...I'd be concerned about data unexpectedly hanging around from: And maybe, if volatile volumes are somehow taking advantage of DM/LVM auto-dispose features (do those exist?), that could be a gap (haven't investigated this, though).
Please send a pull request to Marek et al! :)
Maybe...but I could see there being pushback on that (does it survive updates?, etc.). Brendan |
Yes please! I have slightly reduced test capacity right now, but if I get a tested commit, I'd gladly accept it.
No, plain lvremove is in use. In case of unclean shutdown - on the next startup. |
As for the long term solution, it should really be done by lvm automatically (maybe with some configuration option, or re-use |
One question I have about implementation is whether a Another question: Should this discard step be controlled by an optional setting and/or detection of trim support for the pool's device? |
1st Q is probably directed to Marek, but I’d say yes to setting rw before blkdiscard. Peanut gallery answer to 2nd question :)
B * I just realized what I wrote in No. 3 assumed a single thin pool, but users could have multiple pools across multiple devices, some of which might have differing levels of support down the stack... |
As for lvchange, I don't think it's necessary, all snapshots are created rw by default. But, if blkdiscard fails, I would just log it but proceed with lvremove. As I understand, if lower layer doesn't support TRIM (LUKS with not enabled |
Correct. B |
PR is here. Since its the convention used in |
As for testing, I don't have a setup to run unit tests, unfortunately. I don't even have a fedora dev vm anymore since the last time I had to reinstall. However, I've manually tested it on a regular Qubes system using brendan's trim watcher script and it works as intended. |
[See replies below - chris's code works, I just failed to reboot first.] I pulled the replacement lvm.py and installed it. Doesn't seem to be working for me:
I did try adding a sudo before the blkdiscard but that didn't seem to make it work. Reverting the code to baseline stopped the error from appearing. Brendan |
[See replies below - chris's code works, I just failed to reboot first.] Still, I created two large random files in the disposable VM. If I delete file #1 inside the VM I see discards. If I then shutdown the VM, I do not see discards. This is with both the pull request code as well as a modification to replace 'blkdiscard' with 'sudo', 'blkdiscard' Brendan EDIT: hmm, the tested VM has a private volume size of 16GB, which is also the discard IO max for LVs in the thin pool. Tried it again with a smaller private volume, but same results. Large discard activity on deletion of the first file inside the VM, little to no discard activity on shutdown of disposable VM. |
Update: rebooting fixed it, as per Chris's suggestion in the PR thread...a qubes service needed to restart to reload the lvm.py file from the storage components. Thanks! |
On a 2012-era thinkpad with an SSD, incorporation of blkdiscard into the LV removal process added approximately 1.5s latency to the VM shutdown time on a lightly used disposable VM. It appears that the removes of LV are in parallel, so to are the blkdiscard invocations. The addition of blkdiscard also added a negligible <100ms latency to the VM start up time, as qubes appears to perform a precautionary 'remove' invocation before every 'create' invocation at startup (also in parallel). Startup:
Notably, the exit codes in the precautionary remove (and implicit blkdiscard) during startup are non-zero (256/1280), indicating that they failed (as the LVs they were to operate on did not exist). Very rarely, the startup time may be increased to several seconds if there happens to be an LV to remove that matches the name of the lv to be created. Shutdown:
Again, as you can see by the timestamps, the blkdiscard invocations were in parallel. Brendan |
Maybe it's worth checking if the device exists from the python code, before calling it? And if not, skip the blkdiscard, or maybe even lvremove too. |
The overhead was in my mind when I asked about checking trim support first. Using |
One could argue that inserting blkdiscard in the remove process uncovered the unnecessary lvremove(s) during VM startup and, when that is addressed, inserting blkdiscard led to a performance improvement. :P :) B |
I see we already rely on that in other functions, so in theory it shouldn't be an issue. But indeed in this case, the impact may be less pleasant, than, say, Anyway, it's still a risk, so I wouldn't add this optimization to R4.0, only to R4.1. |
For the gap where the unallocated space may contain sensitive data that a user thought was gone via VM deletion, I am putting together a script to create an LV larger than the remaining thin pool data, writing /dev/zero to it until a threshold data/metadata percent of pool allocation occurs, then blkdiscard the LV and lvremove the LV. Proof-of-concept script that limits pool data/metadata usage 95% before cleanup: scrub_lvm_thin_slack.sh I need to build another test Qubes build to see how close to full is really safe before continuing to ~99.9% data. Metadata will always have a lower threshold. B |
I observe that:
Therefore, I infer that the central enhancement for this issue has been satisfactorily implement and that the issue should be closed. If anyone believes that this issue should remain open, please leave a comment briefly explaining why, and we'll be happy to reopen this. Please keep in mind that other action items that happened to arise in the course of discussing this issue should, in most cases, be filed separately. Thank you. |
Community Dev: @tasket
Pull Request: QubesOS/qubes-core-admin#267
The problem you're addressing (if any)
Within a VM, a user can utilize fstrim against a mounted filesystem to invoke discard commands against most of the unused space in the filesystem. This has two benefits: a) releasing unused space in the thin pool and b) sending discard commands down to the physical device (if the user has made the other adjustments documented on the Qubes website).
However, when a VM is deleted (or a disposable VM is exited), the lvremove command is used to deallocate the space in the thin pool used by the VM's LVs. That command does not send discard down the chain to the physical device.
Describe the solution you'd like
Under R4.0, adding an invocation of blkdiscard before invoking lvremove (in lvm.py) would assist in amnesiac behaviors with disposable or deleted VMs (again, if the user has activated discard on all device/fs layers with the appropriate hardware).
Poking around on github, I suspect this could be remedied by modifying qubes-core-admin's qubes/storage/lvm.py in the 'remove' handler (line 750) and adding a blkdiscard invocation.
https://github.com/QubesOS/qubes-core-admin/blob/master/qubes/storage/lvm.py
Where is the value to a user, and who might that user be?
Describe alternatives you've considered
Manually modifying lvm.py on my own install and maintaining the changes over time locally.
Manually running blkdiscard against LVs of a non-running VM I am about to delete.
Loading disposable VMs with self-destruct script run out of tmpfs during late shutdown that blkdiscards the /rw volumes (would this cover all snapshots?).
If you happen to miss the opportunity to blkdiscard before lvremove, there seems to be no other (simple) method to push those discards through other than offline analysis of the pool metadata and targeted block-range discards on all non-allocated regions.
Additional context
It appears the oVirt developers are considering adding a blkdiscard call against the logical volume block devices before performing lvremove:
https://www.ovirt.org/develop/release-management/features/storage/wipe-volumes-using-blkdiscard.html
Relevant documentation you've consulted
https://www.qubes-os.org/doc/disk-trim/
Related, non-duplicate issues
#3717
#3226
#5055
The text was updated successfully, but these errors were encountered: