Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file-reflink, a storage driver optimized for CoW filesystems #188

Merged
merged 2 commits into from
Feb 13, 2018

Conversation

rustybird
Copy link
Contributor

@rustybird rustybird commented Jan 23, 2018

This adds the file-reflink storage driver. It is never selected automatically for pool creation, especially not the creation of varlibqubes (though it can be used if set up manually).

The code is quite small:

reflink.py lvm.py file.py + block-snapshot
sloccount 334 lines 447 (134%) 570 (171%)

Background: btrfs and XFS (but not yet ZFS) support instant copies of individual files through the FICLONE ioctl behind cp --reflink. Which file-reflink uses to snapshot VM image files without an extra device-mapper layer. All the snapshots are essentially freestanding; there's no functional origin vs. snapshot distinction.

In contrast to 'file'-on-btrfs, file-reflink inherently avoids CoW-on-CoW. Which is a bigger issue now on R4.0, where even AppVMs' private volumes are CoW. (And turning off the lower, filesystem-level CoW for 'file'-on-btrfs images would turn off data checksums too, i.e. protection against bit rot.)

Also in contrast to 'file', all storage features are supported, including

  • any number of revisions_to_keep
  • volume.revert()
  • volume.is_outdated
  • online fstrim/discard

Example tree of a file-reflink pool - *-dirty.img are connected to Xen:

  • /var/lib/testpool/appvms/foo/volatile-dirty.img
  • /var/lib/testpool/appvms/foo/root-dirty.img
  • /var/lib/testpool/appvms/foo/root.img
  • /var/lib/testpool/appvms/foo/private-dirty.img
  • /var/lib/testpool/appvms/foo/private.img
  • /var/lib/testpool/appvms/foo/private.img@2018-01-02T03:04:05Z
  • /var/lib/testpool/appvms/foo/private.img@2018-01-02T04:05:06Z
  • /var/lib/testpool/appvms/foo/private.img@2018-01-02T05:06:07Z
  • /var/lib/testpool/appvms/bar/...
  • /var/lib/testpool/appvms/...
  • /var/lib/testpool/template-vms/fedora-26/...
  • /var/lib/testpool/template-vms/...

It looks similar to a 'file' pool tree, and in fact file-reflink is drop-in compatible:

$ qvm-shutdown --all --wait
$ systemctl stop qubesd
$ sed 's/ driver="file"/ driver="file-reflink"/g' -i.bak /var/lib/qubes/qubes.xml
$ systemctl start qubesd
$ sudo rm -f /path/to/pool/*/*/*-cow.img*

If the user tries to create a fresh file-reflink pool on a filesystem that doesn't support reflinks, qvm-pool will abort and mention the setup_check=no option. Which can be passed to force a fallback on regular sparse copies, with of course lots of time/space overhead. The same fallback code is also used when initially cloning a VM from a foreign pool, or from another file-reflink pool on a different mountpoint.

journalctl -fu qubesd will show all file-reflink copy/rename/remove operations on VM creation/startup/shutdown/etc.


TODO: port unit tests

@rustybird rustybird force-pushed the file-reflink branch 3 times, most recently from 80d2ada to b3a24c8 Compare January 23, 2018 20:13
@rustybird
Copy link
Contributor Author

rustybird commented Jan 23, 2018

Does CI policy require no new pylint warnings? There's one protected-access warning that I had planned on leaving in, because the accessed attribute isn't mine.

(I've also fixed one warning and silenced one false positive error. They weren't printed by my local pylint version.)

@rustybird
Copy link
Contributor Author

There's one protected-access warning that I had planned on leaving in, because the accessed attribute isn't mine.

On second thought, let me handle that more cleanly...

@marmarek
Copy link
Member

Does CI policy require no new pylint warnings?

Generally yes. But in some specific cases, muting with # pylint: disable=... is ok too. It's here to at least force developer to look at those cases.

@codecov
Copy link

codecov bot commented Jan 23, 2018

Codecov Report

Merging #188 into master will decrease coverage by 1.66%.
The diff coverage is 0.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #188      +/-   ##
==========================================
- Coverage   56.32%   54.65%   -1.67%     
==========================================
  Files          55       56       +1     
  Lines        8734     9000     +266     
==========================================
  Hits         4919     4919              
- Misses       3815     4081     +266
Flag Coverage Δ
#unittests 54.65% <0.37%> (-1.67%) ⬇️
Impacted Files Coverage Δ
qubes/vm/dispvm.py 67.12% <ø> (ø) ⬆️
qubes/vm/appvm.py 90% <ø> (ø) ⬆️
qubes/storage/reflink.py 0% <0%> (ø)
qubes/app.py 74.47% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b5256f...1695a73. Read the comment docs.

@rustybird
Copy link
Contributor Author

I've fixed the underlying cause of the warning (and added a previously missing fsync after flush).

@rustybird
Copy link
Contributor Author

rustybird commented Jan 26, 2018

In contrast to 'file'-on-btrfs, file-reflink inherently avoids CoW-on-CoW.

Huh, that's not actually true for the root volume of TemplateBasedVMs and DispVMs - even in R4.0, they still implement their own device-mapper based CoW thing, /dev/mapper/dmroot, inside the VM, using the second partition of volatile. Is my understanding correct that this approach is simply a remnant of the pre-R4.0 days? @marmarek @kalkin

If so, I propose to make these root volumes work exactly like the private volume of DispVMs, using a two step transition:

  1. Flip their root volume from read-only to read-write in dom0.
  2. At some later point (maybe R4.1), rip out the dmroot stuff inside the VMs.

Old e.g. backed up VMs still using dmroot would continue to work after that.

@marmarek
Copy link
Member

Huh, that's not actually true for the root volume of TemplateBasedVMs and DispVMs - even in R4.0, they still implement their own device-mapper based CoW thing, /dev/mapper/dmroot, inside the VM, using the second partition of volatile. Is my understanding correct that this approach is simply a remnant of the pre-R4.0 days? @marmarek @kalkin

This is intentionally left in place, to allow VM to verify original root image. This would allow semi-untrusted storage domain. See original architecture paper. This part is yet to be implemented...

Since this is unused right now, I think its ok to disable it, but not completely remove. And leave clear comment why it is this way. So:

  1. Flip their root volume from read-only to read-write in dom0.

Yes.

  1. At some later point (maybe R4.1), rip out the dmroot stuff inside the VMs.

No. And not really needed, as currently if /dev/xvda is read-write, then /dev/mapper/dmroot is just a symlink (not dm-linear as it was in R3.2).

BTW Are you waiting for me on this PR? I've skimmed through it and looks fine, but haven't done careful review yet. Are you going to port tests in this PR, or separate at some future time?

@rustybird
Copy link
Contributor Author

This would allow semi-untrusted storage domain.

Ah, okay.

Since this is unused right now, I think its ok to disable it, but not completely remove. And leave clear comment why it is this way. So:

Flip their root volume from read-only to read-write in dom0.

Yes.

Done.

At some later point (maybe R4.1), rip out the dmroot stuff inside the VMs.

No. And not really needed, as currently if /dev/xvda is read-write, then /dev/mapper/dmroot is just a symlink (not dm-linear as it was in R3.2).

That's fantastic!

BTW Are you waiting for me on this PR? I've skimmed through it and looks fine, but haven't done careful review yet. Are you going to port tests in this PR, or separate at some future time?

If possible, I'd implement tests in a separate PR, before submitting a third one to automatically use file-reflink (instead of 'file') for varlibqubes when a btrfs-without-LVM layout has been set up during R4.1 installation. As for this first PR, getting it merged would conveniently stop breaking my qubesd during qubes-core-admin package upgrades. 8)

Hey while I have your ear, should is_outdated() return True if the source volume has been started but not yet been stopped since the snapshot? file-reflink currently returns True only after source volume commit (because restarting the consumer volume would still give the same data), but I can change that.

@marmarek
Copy link
Member

As for this first PR, getting it merged would conveniently stop breaking my qubesd during qubes-core-admin package upgrades. 8)

Not sure if you know, but you can install additional storage pool drivers from outside of qubes-core-admin. Just provide appropriate entry point (as you've done here in setup.py). That wouldn't work for default_pool, but you can set it using qubes-prefs.

file-reflink currently returns True only after source volume commit (because restarting the consumer volume would still give the same data),

That's ok.

# inefficient CoW-on-CoW setup. Avoid this by always
# overriding root to be read-write - which may become
# incompatible with a future untrusted storage domain!
volume_config['rw'] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better patch it directly where volume_config is defined (and similar for other VM types).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please separate commit for such changes.

Copy link
Contributor Author

@rustybird rustybird Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rustybird rustybird force-pushed the file-reflink branch 2 times, most recently from 6da472e to 958d6c7 Compare January 27, 2018 01:26
@rustybird
Copy link
Contributor Author

Not sure if you know, but you can install additional storage pool drivers from outside of qubes-core-admin. Just provide appropriate entry point (as you've done here in setup.py).

Thanks, I'll give that a try.

This adds the file-reflink storage driver. It is never selected
automatically for pool creation, especially not the creation of
'varlibqubes' (though it can be used if set up manually).

The code is quite small:

               reflink.py  lvm.py      file.py + block-snapshot
    sloccount  334 lines   447 (134%)  570 (171%)

Background: btrfs and XFS (but not yet ZFS) support instant copies of
individual files through the 'FICLONE' ioctl behind 'cp --reflink'.
Which file-reflink uses to snapshot VM image files without an extra
device-mapper layer. All the snapshots are essentially freestanding;
there's no functional origin vs. snapshot distinction.

In contrast to 'file'-on-btrfs, file-reflink inherently avoids
CoW-on-CoW. Which is a bigger issue now on R4.0, where even AppVMs'
private volumes are CoW. (And turning off the lower, filesystem-level
CoW for 'file'-on-btrfs images would turn off data checksums too, i.e.
protection against bit rot.)

Also in contrast to 'file', all storage features are supported,
including

    - any number of revisions_to_keep
    - volume.revert()
    - volume.is_outdated
    - online fstrim/discard

Example tree of a file-reflink pool - *-dirty.img are connected to Xen:

    - /var/lib/testpool/appvms/foo/volatile-dirty.img
    - /var/lib/testpool/appvms/foo/root-dirty.img
    - /var/lib/testpool/appvms/foo/root.img
    - /var/lib/testpool/appvms/foo/private-dirty.img
    - /var/lib/testpool/appvms/foo/private.img
    - /var/lib/testpool/appvms/foo/private.img@2018-01-02T03:04:05Z
    - /var/lib/testpool/appvms/foo/private.img@2018-01-02T04:05:06Z
    - /var/lib/testpool/appvms/foo/private.img@2018-01-02T05:06:07Z
    - /var/lib/testpool/appvms/bar/...
    - /var/lib/testpool/appvms/...
    - /var/lib/testpool/template-vms/fedora-26/...
    - /var/lib/testpool/template-vms/...

It looks similar to a 'file' pool tree, and in fact file-reflink is
drop-in compatible:

    $ qvm-shutdown --all --wait
    $ systemctl stop qubesd
    $ sed 's/ driver="file"/ driver="file-reflink"/g' -i.bak /var/lib/qubes/qubes.xml
    $ systemctl start qubesd
    $ sudo rm -f /path/to/pool/*/*/*-cow.img*

If the user tries to create a fresh file-reflink pool on a filesystem
that doesn't support reflinks, qvm-pool will abort and mention the
'setup_check=no' option. Which can be passed to force a fallback on
regular sparse copies, with of course lots of time/space overhead. The
same fallback code is also used when initially cloning a VM from a
foreign pool, or from another file-reflink pool on a different
mountpoint.

'journalctl -fu qubesd' will show all file-reflink copy/rename/remove
operations on VM creation/startup/shutdown/etc.
rustybird added a commit to rustybird/qubes-core-admin-client that referenced this pull request Feb 12, 2018
I don't know if any template currently hits this code path, even the
fedora-26-minimal root.img is large enough to be split into multiple
parts. Maybe Arch Linux?

Related to QubesOS/qubes-core-admin#188
@marmarek marmarek merged commit 1695a73 into QubesOS:master Feb 13, 2018
@rustybird rustybird deleted the file-reflink branch September 12, 2018 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants