Skip to content

Default to zfs_bclone_wait_dirty=1 #17455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

#17446

Description

Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync.

How Has This Been Tested?

Existing test case.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Member

robn commented Jun 11, 2025

Would it be worthwhile to plumb a flag through to indicate whether or not a fallback is available (ie, is this copy_file_range() vs FICLONE/FICLONERANGE)? I worry about a conventional copy offload on small, newly-created files - seems like blocking might be worse than just falling back to a content copy? Not sure.

(now thinking very hard about my use of ccache, which I have configured to clone between cachedir and builddir).

@behlendorf
Copy link
Contributor Author

I worry about a conventional copy offload on small, newly-created files

Agreed. But on the flip side, we also want to consider the single large newly created file. It'd be a shame to end up making a full copy of it because we didn't want to wait for the sync.

Yeah, plumbing a flag through seems like the thing to do.

@amotin
Copy link
Member

amotin commented Jun 11, 2025

@robn I was several times thinking about limiting minimal block size to either clone or deduplicate, since it might be not really efficient to clone or dedup too small file(s), unless there is going to be a huge number of those blocks (though we don't know that in advance). But if application explicitly uses FICLONE/FICLONERANGE instead of copy_file_range(), there might be some bigger chance that it knows what it is doing, so the mentioned flag might have sense from that perspective also. On the other side FreeBSD had only copy_file_range(), so it can't use this as a hint.

This change though I don't think I will object. We already have exactly the same situation for zfs_dmu_offset_next_sync, which we do have enabled by default. And we actually saw pretty pathological scenario for it with Samba, when IIRC Windows clients first truncate the destination file to a target size (making the dnode dirty) and then try to copy into it, calling lseek(). That's what actually made me to create the recent PR there. And Veeam already complain on zfs_bclone_wait_dirty being disabled, that breaks their legal workload, and I don't have anything else to propose them.

Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'm persuaded for now. I'll flip it locally (2.3) and see how it feels for the next little while. Most likely I don't even notice!

@IvanVolosyuk
Copy link
Contributor

Will it cause a txg storm similar to #11353? @snajpa voiced pretty strong opinion about requiring txg to progress.

@snajpa
Copy link
Contributor

snajpa commented Jun 12, 2025

Any feature relying on wait_txg_synced is like it wouldn't exist for our use-case. We're not using block cloning at all, I dont think the design is OK. Out of 1500+ members, not one ever reported wanting to advance this behavior in any way, so we're unlikely to enable it in the near future. My focus currently is on getting rid of wait_txg_synced in the zfs_link_create path for TMPFILE, which forces us to run with sync=disabled, which is even worse.

@snajpa
Copy link
Contributor

snajpa commented Jun 12, 2025

One of the original strenghts of ZFS, aside reliability and stability of the implementation, was its well thought-out caching design. The ARC has traditionally seen 99%+ long term hitrates in all setups we ever had.

And similarly, with dirty data write back, we appreciated the clever design from the start - it turned a lot of what would be random writes all over, into one linear write-out, limited only by sequential write speed of underlying hardware (or, well, almost-linear). That was especially nice for rotational drives, but may be not as relevant for current M.2, even though they have a huge discrepancy between linear and random write access too - we don't reach the speeds yet. But when we do, it will be noticeable in comparison with other CoW FS doing a similar thing, if we're stuck in a writeback sync loop.

In our latest setup on the 128 core / 2 TB RAM boxes, we have patched ZFS to allow for far greater zfs_dirty_data_max_max, which in turn enables far greater zfs_dirty_data_sync due its default to 20-something percent IIRC, where by default it would start a txg sync way too early, so the pool was continuosly in txg sync loop. I don't find such behavior normal/tolerable even in this mild case, where there isn't any application specifically hanging in wait_txg_synced loop:

  • approximately 1/3 of the data are constant rewrites of the same blocks that were written in the previous txg, b/c the systems are heavily multitenant and run wide variety of workloads and they just do things, aren't idle, but what they do ends up very agreggated
  • we want to run with zfs_txg_timeout set to 15 seconds, which is the sweet spot in tradeoff between data safety and performance for our use-case
  • it is normal to have around 3-4 GB of dirty data, but can go up to 12G before we start syncing, avg sync rate is cca 1 GB/s on this class of systems

In cases, where wait_txg_synced would be in regularly used paths and thus would slow the system down to a crawl, we have to take the route of not using such features, if possible. zfs_holey behavior is sadly one of those things, in our setup, zfs_dmu_offset_next_sync=0 is the default.

If tuned properly, OpenZFS is able to take on pretty huge aggregation of very random workloads.

So far the situation with OpenZFS is shaping up very similarly to MySQL - defaults set for very small systems and if you want to run a larger thing and fully take advantage of what the code can do, there's a 'High Performance MySQL' bible explaining all the tunables and how to tune them. Well, except such a book is yet to be written for OpenZFS :)

I am absolutely fine with defaults, that work well for the desktop. In terms of various thresholds, data sizes etc. - overly relying on wait_txg_synced though, I think, can kill even the desktop use-case, if it's used more than lightly.

On "my" systems, I always establish a loadavg baseline, around which the system hovers if everything is working properly and then go on a hunt if things aren't OK - maybe ZED could do some fiddling of the tunables if we come up with heuristics that work in most use-cases. A huge if, but may be worth a try. (That could save the need for a book on how to tune OpenZFS)

Update the default FICLONE and FICLONERANGE ioctl behavior to wait
on dirty blocks.  While this does remove some control from the
application, in practice ZFS is better positioned to the optimial
thing and immediately force a TXG sync.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants