Default to zfs_bclone_wait_dirty=1 #17455

behlendorf · 2025-06-11T22:16:18Z

Motivation and Context

Description

Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync.

How Has This Been Tested?

Existing test case.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

robn · 2025-06-11T22:27:29Z

Would it be worthwhile to plumb a flag through to indicate whether or not a fallback is available (ie, is this copy_file_range() vs FICLONE/FICLONERANGE)? I worry about a conventional copy offload on small, newly-created files - seems like blocking might be worse than just falling back to a content copy? Not sure.

(now thinking very hard about my use of ccache, which I have configured to clone between cachedir and builddir).

behlendorf · 2025-06-11T23:01:17Z

I worry about a conventional copy offload on small, newly-created files

Agreed. But on the flip side, we also want to consider the single large newly created file. It'd be a shame to end up making a full copy of it because we didn't want to wait for the sync.

Yeah, plumbing a flag through seems like the thing to do.

amotin · 2025-06-11T23:14:09Z

@robn I was several times thinking about limiting minimal block size to either clone or deduplicate, since it might be not really efficient to clone or dedup too small file(s), unless there is going to be a huge number of those blocks (though we don't know that in advance). But if application explicitly uses FICLONE/FICLONERANGE instead of copy_file_range(), there might be some bigger chance that it knows what it is doing, so the mentioned flag might have sense from that perspective also. On the other side FreeBSD had only copy_file_range(), so it can't use this as a hint.

This change though I don't think I will object. We already have exactly the same situation for zfs_dmu_offset_next_sync, which we do have enabled by default. And we actually saw pretty pathological scenario for it with Samba, when IIRC Windows clients first truncate the destination file to a target size (making the dnode dirty) and then try to copy into it, calling lseek(). That's what actually made me to create the recent PR there. And Veeam already complain on zfs_bclone_wait_dirty being disabled, that breaks their legal workload, and I don't have anything else to propose them.

robn

Alright, I'm persuaded for now. I'll flip it locally (2.3) and see how it feels for the next little while. Most likely I don't even notice!

IvanVolosyuk · 2025-06-11T23:51:16Z

Will it cause a txg storm similar to #11353? @snajpa voiced pretty strong opinion about requiring txg to progress.

snajpa · 2025-06-12T00:17:40Z

Any feature relying on wait_txg_synced is like it wouldn't exist for our use-case. We're not using block cloning at all, I dont think the design is OK. Out of 1500+ members, not one ever reported wanting to advance this behavior in any way, so we're unlikely to enable it in the near future. My focus currently is on getting rid of wait_txg_synced in the zfs_link_create path for TMPFILE, which forces us to run with sync=disabled, which is even worse.

snajpa · 2025-06-12T00:55:35Z

One of the original strenghts of ZFS, aside reliability and stability of the implementation, was its well thought-out caching design. The ARC has traditionally seen 99%+ long term hitrates in all setups we ever had.

And similarly, with dirty data write back, we appreciated the clever design from the start - it turned a lot of what would be random writes all over, into one linear write-out, limited only by sequential write speed of underlying hardware (or, well, almost-linear). That was especially nice for rotational drives, but may be not as relevant for current M.2, even though they have a huge discrepancy between linear and random write access too - we don't reach the speeds yet. But when we do, it will be noticeable in comparison with other CoW FS doing a similar thing, if we're stuck in a writeback sync loop.

In our latest setup on the 128 core / 2 TB RAM boxes, we have patched ZFS to allow for far greater zfs_dirty_data_max_max, which in turn enables far greater zfs_dirty_data_sync due its default to 20-something percent IIRC, where by default it would start a txg sync way too early, so the pool was continuosly in txg sync loop. I don't find such behavior normal/tolerable even in this mild case, where there isn't any application specifically hanging in wait_txg_synced loop:

approximately 1/3 of the data are constant rewrites of the same blocks that were written in the previous txg, b/c the systems are heavily multitenant and run wide variety of workloads and they just do things, aren't idle, but what they do ends up very agreggated
we want to run with zfs_txg_timeout set to 15 seconds, which is the sweet spot in tradeoff between data safety and performance for our use-case
it is normal to have around 3-4 GB of dirty data, but can go up to 12G before we start syncing, avg sync rate is cca 1 GB/s on this class of systems

In cases, where wait_txg_synced would be in regularly used paths and thus would slow the system down to a crawl, we have to take the route of not using such features, if possible. zfs_holey behavior is sadly one of those things, in our setup, zfs_dmu_offset_next_sync=0 is the default.

If tuned properly, OpenZFS is able to take on pretty huge aggregation of very random workloads.

So far the situation with OpenZFS is shaping up very similarly to MySQL - defaults set for very small systems and if you want to run a larger thing and fully take advantage of what the code can do, there's a 'High Performance MySQL' bible explaining all the tunables and how to tune them. Well, except such a book is yet to be written for OpenZFS :)

I am absolutely fine with defaults, that work well for the desktop. In terms of various thresholds, data sizes etc. - overly relying on wait_txg_synced though, I think, can kill even the desktop use-case, if it's used more than lightly.

On "my" systems, I always establish a loadavg baseline, around which the system hovers if everything is working properly and then go on a hunt if things aren't OK - maybe ZED could do some fiddling of the tunables if we come up with heuristics that work in most use-cases. A huge if, but may be worth a try. (That could save the need for a book on how to tune OpenZFS)

Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

robn approved these changes Jun 11, 2025

View reviewed changes

behlendorf force-pushed the default_bclone_wait_dirty branch from eaef52d to bc72988 Compare June 13, 2025 22:19

robn mentioned this pull request Jun 24, 2025

reflinking errors when bclone_wait_dirty is set to 0 during Veeam synthetic full backup. #17485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Default to zfs_bclone_wait_dirty=1 #17455

Default to zfs_bclone_wait_dirty=1 #17455

Uh oh!

behlendorf commented Jun 11, 2025

Uh oh!

robn commented Jun 11, 2025 •

edited

Loading

Uh oh!

behlendorf commented Jun 11, 2025

Uh oh!

amotin commented Jun 11, 2025

Uh oh!

robn left a comment

Uh oh!

IvanVolosyuk commented Jun 11, 2025

Uh oh!

snajpa commented Jun 12, 2025 •

edited

Loading

Uh oh!

snajpa commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Default to zfs_bclone_wait_dirty=1 #17455

Are you sure you want to change the base?

Default to zfs_bclone_wait_dirty=1 #17455

Uh oh!

Conversation

behlendorf commented Jun 11, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

robn commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf commented Jun 11, 2025

Uh oh!

amotin commented Jun 11, 2025

Uh oh!

robn left a comment

Choose a reason for hiding this comment

Uh oh!

IvanVolosyuk commented Jun 11, 2025

Uh oh!

snajpa commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snajpa commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

robn commented Jun 11, 2025 •

edited

Loading

snajpa commented Jun 12, 2025 •

edited

Loading

snajpa commented Jun 12, 2025 •

edited

Loading