Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup restore and template installation should write directly to LVM volumes #3230

Closed
qubesuser opened this issue Oct 27, 2017 · 17 comments
Closed
Labels
C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@qubesuser
Copy link

Qubes OS version:

R4.0-rc2

Steps to reproduce the behavior:

  1. Try to restore a backup or try to install a template package

Expected behavior:

dom0 disk space usage does not change significantly.
Backups with VMs larger than half the size of the disk can be restored.

Actual behavior:

dom0 disk space usage changes significantly because the data is first written to a file in the dom0 root and then copied over.

Backups with VMs larger than half the size of the disk cannot be restored since there is not enough disk space for both the data on dom0 root and on the LVM volume

General notes:

This is a big issue for restoring large VMs and also fixing this would allow to use a smaller dom0 root rather than sizing it to be as large as the thin pool, saving GBs wasted for filesystem structures for an unnecessarily big filesystem (also need to make sure log files don't expand out of control for that).

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core labels Oct 28, 2017
@andrewdavidwong andrewdavidwong added this to the Release 4.0 milestone Oct 28, 2017
@marmarek
Copy link
Member

Template installation is much less issue, as it is limited by template builder to 10GB. And also it is much trickier to solve, as RPM do not like to write directly to block device (or rather - a socket/pipe - remember that now it use Admin API, so you can install template also from a management VM).

As for backup restore - only parts are stored as files (100MB each), and in parallel are uploaded to the actual VM volume (using Admin API). But currently there is no limit on how many such parts are queued. This is because (currently) you can't control the speed of archive extraction (either tar, or qfile-unpacker). That would require either adding additional layer (cat-like process, used to pause data input when needed), or instructing somehow extractor process to pause operation (SIGSTOP/SIGCONT? that could be fragile...).
Not storing those fragments as files at all would be very tricky, because you need to verify the fragment before doing anything with it. And you can do that only when the full fragment is extracted. You can not (should not) start parsing its content in any way before verification.
Some alternative could be using tmpfs, or using memory directly (python object). But that could easily lead to OOM, especially when restoring using a VM (aka "paranoid mode").

@qubesuser
Copy link
Author

qubesuser commented Oct 29, 2017

I think one could write the fragments directly to an LVM volume (for instance using tar --to-stdout and piping to dd), verify either by reading from the VM volume or by teeing the data to the verification, and then rename the VM volume to $vm-private if verification passes.

For templates, probably the best solution is to not ship them in the RPMs, but rather ship them like the installation ISOs and only provide a download link and hash in the RPM, and have the RPM install script download the ISO and pipe it via qrexec to dom0 while the checksum is verified in parallel and the install is finalized only if the checksum verification succeeds.

@marmarek
Copy link
Member

This is too late if you want to keep clean tasks separation. The principle is to do nothing with the data until it gets verified. While the current implementation may indeed allow that (not use the volume until it get renamed to $vm-private), we should not make such assumption. Also keep in mind, the backup restore tool should not assume direct access to LVM. It use Admin API to upload volume content. So, such mechanism would require introducing some additional action to rename volume, or separate "upload" and "commit" actions. Reading the volume for verification is intentionally not supported through Admin API, but that isn't a problem here, because you can calculate data hash on the fly (and in fact scrypt tool we use there do that already).
There is also one technical detail - you need somehow pass individual fragments to scrypt for decryption and verification. While its output could be redirected somewhere, for input you need to separate individual VM's volumes (and their fragments), so just tar --to-stdout isn't feasible, because you'll get all of them concatenated.

Backup archive is split into fragments exactly to allow limiting temporary space needed to do a backup and to restore it. The latter feature is not implemented, but the current architecture should allow that.

@na--
Copy link

na-- commented Oct 29, 2017

@qubesuser: I think that if the other issue you reported is fixed, this one would not be that big of a deal.

@marmarek: If this is up-to-date, that means that there's a tar extraction of the huge backup file in the beginning. The tar options --checkpoint= and --checkpoint-action=exec=...... can be used to limit the speed of the archive extraction with some artificial sleep. It's an ugly hack but I use it for a task that needs piping tar extraction of huge files in /tmp and processing them as they are being extracted.

Here's the code I use: tar --checkpoint=20000 --checkpoint-action=exec='sleep "$(stat -f --format="(((%b-%a)/%b)^5)*30" /tmp | bc -l)"' --extract --verbose __other_tar_args__ | program_to_process_extracted_files

Ugly as sin, but it causes tar to sleep progressively more as /tmp is being filled up, so that the program_to_process_extracted_files can catch up with processing and deleting the already extracted files. For more complex flow control logic, tar can call an external script that implements it, for example "pause extraction of file n until file n-2 is processed and removed" or something of the sort, which should be much less fragile than signalling tar externally

Edit: link to the tar checkpoint documentation: https://www.gnu.org/software/tar/manual/html_section/tar_26.html and https://www.gnu.org/software/tar/manual/html_section/tar_29.html

@marmarek
Copy link
Member

The tar options --checkpoint= and --checkpoint-action=exec=...... can be used to limit the speed of the archive extraction with some artificial sleep. It's an ugly hack but I use it for a task that needs piping tar extraction of huge files in /tmp and processing them as they are being extracted.

Tar is used there only if backup file is exposed directly to dom0. If it is loaded from some VM (like sys-usb), then qfile-unpacker is used. But in this case we could add such option ourselves.

@qubesuser
Copy link
Author

qubesuser commented Oct 30, 2017

Yeah, it would need some sort of upload+commit interface (with hashes computed on the fly): ideally one where a qrexec connection is kept open until a commit command is sent, and the VM/volume is deleted automatically when the connection is broken or upon booting the system (to handle the system being hard rebooted during restore).

Not totally sure how to setup the input with tar. Maybe it could be possible to create private.img.XXX files as UNIX sockets or fifos and convince tar to write into them instead of recreating them? (perhaps tar --overwrite does that, not sure). Or use tar --to-stdout with a single-file filelist, if tar can seek efficiently (but this requires that the input be a file and not a pipe from another VM, unless tar is run in the other VM). Alternatively, one could even just use tar --to-stdout with all the files and have them concatenated, and then split them afterwards since the size of each fragment is known (or can be determined by separately running tar -t).

@marmarek
Copy link
Member

Generally it is too late for major changes in backup (or other) architecture changes for Qubes 4.0. Upload+commit may be a good idea for Qubes 4.1. Splitting concatenated files, or placing fifos for tar to write to is IMO too fragile to consider it at all. Backup mechanism is complex enough already.

One think we may consider at this stage, is slowing down tar/qfile-unpacker enough to not require too much space in in /tmp. --checkpoint-action is interesting, but exact command there needs to be adjusted. I'd put there something controlled from python script, and from there make sure not more than X files/size units are waiting to be handled. For example reading 1 byte from a pipe; and from python write 1 byte after each file is handled. And also put X bytes there at the beginning. Classic token solution.
What "checkpoint" ("record") unit is? I though it may be one tar block (512 bytes), but according to simple test with --checkpoint=1 it is closer to "a file" (but sometimes two small files are fit between checkpoints). Do you know any documentation about this? @na--

@na--
Copy link

na-- commented Oct 31, 2017

@marmarek: sorry, I'm not sure. I've read only what's in the tar manual and it's not very specific. I remember fiddling with the options until it was good enough and leaving it at that, since in my case it was not for something very important. I thought that a record is one tar block, but apparently not.

@jpouellet
Copy link
Contributor

@qubesuser can you elaborate on what exactly you see an upload+commit interface performing and looking like?

Just the ability to write a stream directly to pool storage with some temporary name guaranteed to never be used by any VM, returning perhaps some token to be used by admin.vm.volume.CloneTo or such?

@marmarek
Copy link
Member

@jpouellet take a look about --checkpoint option above.

@andrewdavidwong andrewdavidwong added the P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. label Aug 21, 2019
@heinrich-ulbricht
Copy link

heinrich-ulbricht commented Sep 7, 2019

Currently I'm in a position where I need to find a solution for restoring huge backups while having nearly no space in dom0 for restore left.
I started a thread over in the Google Groups and the community was tremendously helpful so far.
Unfortunately I now seem to be in a position where I need to get a fix/hack applied to restore.py that prevents the restore operation from generating hundreds of GB of temporary data. And the "sleep fix" is a hot candidate.
I made a (naive?) suggestion for a restore.py modification here. Maybe somebody could have a look if this could work?

@github-actions
Copy link

github-actions bot commented Aug 5, 2023

This issue is being closed because:

If anyone believes that this issue should be reopened and reassigned to an active milestone, please leave a brief comment.
(For example, if a bug still affects Qubes OS 4.1, then the comment "Affects 4.1" will suffice.)

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2023
@DemiMarie
Copy link

@marmarek Has this actually been fixed?

@marmarek
Copy link
Member

I knew we had an issues for this! Yes, #8876

@DemiMarie
Copy link

Did QubesOS/qubes-core-admin-client#278 fix backup restore too?

@marmarek
Copy link
Member

No, that's independent, and it doesn't suffer the same issue as templates. The restore issue was fixed differently: QubesOS/qubes-core-admin-client@9360865

@DemiMarie
Copy link

Closing as “completed”.

@DemiMarie DemiMarie removed the eol-4.0 Closed because Qubes 4.0 has reached end-of-life (EOL) label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

7 participants